Creating a Large XML Sitemap for Django

Let's say you have so many pages (thousands) that you can't just create one /sitemap.xml file that lists all the URLs (aka <loc>). So you need to create /sitemaps.xml which points to other sitemap files. And if there are thousands of addresses in each, then you need to compress these files.

This article shows you how to create a sitemap file that points to 63 sitemap-{M}-{N}.xml.gz files that span about 1,000,000 URLs. The context here is Python and getting the data comes from Django. Python is the key to success, but if you have something other than Django, you can take a closer look at the very idea and mentally replace it with your own data converter.

Generating .xml.gz files

That is the essence of the work. The generator function, which takes a Django QuerySet instance (which is ordered and filtered!), starts generating etrees and dumps them to disk using gzip.

import gzip

from lxml import etree


outfile = "sitemap-{start}-{end}.xml"
batchsize = 40_000


def generate(self, qs, base_url, outfile, batchsize):
    # Use `.values` to make the query much faster
    qs = qs.values("name", "id", "artist_id", "language")

    def start():
        return etree.Element(
            "urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        )

    def close(root, filename):
        with gzip.open(filename, "wb") as f:
            f.write(b'<?xml version="1.0" encoding="utf-8"?>\n')
            f.write(etree.tostring(root, pretty_print=True))

    root = filename = None

    count = 0
    for song in qs.iterator():
        if not count % batchsize:
            if filename:  # not the very first loop
                close(root, filename)
                yield filename
            filename = outfile.format(start=count, end=count + batchsize)
            root = start()
        loc = "{}{}".format(base_url, make_song_url(song))
        etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc
        count += 1
    close(root, filename)
    yield filename

The most important lines in terms of lxml.etree and sitemaps are:

root = etree.Element("urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
...         
etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc

Another important thing is a note about using .values(). If you don't, Django will create a model instance for each string returned by the iterator. It is expensive.

Another important thing is to use the Django ORM iterator as it is much more efficient than fiddling with limits and offsets.

Map Map Generation

Creating a map of maps does not need to be compressed, as it will be tiny.

def generate_map_of_maps(base_url, outfile):
    root = etree.Element(
        "sitemapindex", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    )

    with open(outfile, "wb") as f:
        f.write(b'<?xml version="1.0" encoding="UTF-8"?>\n')
        files_created = sorted(glob("sitemap-*.xml.gz"))
        for file_created in files_created:
            sitemap = etree.SubElement(root, "sitemap")
            uri = "{}/{}".format(base_url, os.path.basename(file_created))
            etree.SubElement(sitemap, "loc").text = uri
            lastmod = datetime.datetime.fromtimestamp(
                os.stat(file_created).st_mtime
            ).strftime("%Y-%m-%d")
            etree.SubElement(sitemap, "lastmod").text = lastmod
        f.write(etree.tostring(root, pretty_print=True))
Back to Top