handroll, sitemaps, and robots.txt

Google webmaster tools provide suggestions to improve your site ranking. The suggestions generally involve making it easier for their crawlers to find your content. One such suggestion is adding a sitemap. Adding a sitemap can increase your visibility on the internet.

A sitemap is a listing of pages within your site. Web crawlers often work by starting at the root of your website (like https://example.com/), and then navigating to each of the links on that page. The crawler will repeat the process until it can’t find any more links. For a well-linked site, this process works well. Unfortunately, if you have a page that is not linked to other pages, the crawler will not know that it exists. The benefit of a sitemap is that you can inform the crawler about all your pages, whether they are linked well to other pages or not.

handroll 3.1 includes a sitemap extension that will generate a sitemap for you automatically. To use it, add the following to your handroll.conf.

[site]
with_sitemap = True

That’s it! From now on, any of your HTML files will be included in a sitemap.txt file. Once you have a sitemap, you should inform web crawlers of its location.

Conventionally, websites “communicate” with web crawlers via a robots.txt file. This file gives instructions of what a crawler should or should not crawl. It also happens to be the place where you can specify the location of a sitemap file.

robots.txt wants the full URL to the sitemap file so I used handroll’s new Jinja 2 template composer to generate my file without hardcoding my domain.

The whole file, named robots.txt.j2, looks like:

User-agent: *
Disallow:
Sitemap: {{ config.domain }}/sitemap.txt

With one additional line in my configuration file and three lines in a template file, I made it easier for web crawlers to find everything I care about on my website.