Skip to content

Controlling site indexing

rjzupkoii edited this page Feb 9, 2025 · 1 revision

Overview

Websites provide suggestions to web crawlers via a file called robots.txt. By default when your page is built, a basic robots.txt file is generated that includes the following:

Sitemap: https://academicpages.github.io/sitemap.xml

Where the URL - https://academicpages.github.io in this case - is replaced by the URL of your site. Typically this is all that most users need since it allows major search engines to find the sitemap for the site XML format, which will improve indexing of your site. However, if you require more control over the web crawlers and indexing of your site, you will need to add a custom robots.txt file.

Custom robots.txt files

To create a custom robots.txt file you will need to create the file in the root of your repository similar to the following:

User-agent: Googlebot
Disallow: /nogooglebot/

User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml

Once deployed you should be able check by going to your site updating the URL to include /robots.txt (ex., https://academicpages.github.io/robots.txt). Google's documentation for developers contains excellent information on writing rules for your own robots.txt. If you are interesting in blocking web crawlers used for AI training, the following is a good place to start:

User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: CCbot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: PiplBot
Disallow: /