Creating a Large XML Sitemap for Django
Let's say you have so many pages (thousands) that you can't just create one /sitemap.xml file that lists all the URLs (aka <loc>). So you need to create /sitemaps.xml which points to other sitemap files. And if there are thousands of addresses in each, then you need to compress these files.
This article shows you how to create a sitemap file that points to 63 sitemap-{M}-{N}.xml.gz files that span about 1,000,000 URLs. The context here is Python and getting the data comes from Django. Python is the key to success, but if you have something other than Django, you can take a closer look at the very idea and mentally replace it with your own data converter.
Generating .xml.gz files
That is the essence of the work. The generator function, which takes a Django QuerySet instance (which is ordered and filtered!), starts generating etrees and dumps them to disk using gzip.
import gzip
from lxml import etree
outfile = "sitemap-{start}-{end}.xml"
batchsize = 40_000
def generate(self, qs, base_url, outfile, batchsize):
# Use `.values` to make the query much faster
qs = qs.values("name", "id", "artist_id", "language")
def start():
return etree.Element(
"urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
)
def close(root, filename):
with gzip.open(filename, "wb") as f:
f.write(b'<?xml version="1.0" encoding="utf-8"?>\n')
f.write(etree.tostring(root, pretty_print=True))
root = filename = None
count = 0
for song in qs.iterator():
if not count % batchsize:
if filename: # not the very first loop
close(root, filename)
yield filename
filename = outfile.format(start=count, end=count + batchsize)
root = start()
loc = "{}{}".format(base_url, make_song_url(song))
etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc
count += 1
close(root, filename)
yield filename
The most important lines in terms of lxml.etree and sitemaps are:
root = etree.Element("urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
...
etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc
Another important thing is a note about using .values(). If you don't, Django will create a model instance for each string returned by the iterator. It is expensive.
Another important thing is to use the Django ORM iterator as it is much more efficient than fiddling with limits and offsets.
Map Map Generation
Creating a map of maps does not need to be compressed, as it will be tiny.
def generate_map_of_maps(base_url, outfile):
root = etree.Element(
"sitemapindex", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
)
with open(outfile, "wb") as f:
f.write(b'<?xml version="1.0" encoding="UTF-8"?>\n')
files_created = sorted(glob("sitemap-*.xml.gz"))
for file_created in files_created:
sitemap = etree.SubElement(root, "sitemap")
uri = "{}/{}".format(base_url, os.path.basename(file_created))
etree.SubElement(sitemap, "loc").text = uri
lastmod = datetime.datetime.fromtimestamp(
os.stat(file_created).st_mtime
).strftime("%Y-%m-%d")
etree.SubElement(sitemap, "lastmod").text = lastmod
f.write(etree.tostring(root, pretty_print=True))
Back to Top