-
Notifications
You must be signed in to change notification settings - Fork 46k
Controlling site indexing
Websites provide suggestions to web crawlers via a file called robots.txt. By default when your page is built, a basic robots.txt
file is generated that includes the following:
Sitemap: https://academicpages.github.io/sitemap.xml
Where the URL - https://academicpages.github.io in this case - is replaced by the URL of your site. Typically this is all that most users need since it allows major search engines to find the sitemap for the site XML format, which will improve indexing of your site. However, if you require more control over the web crawlers and indexing of your site, you will need to add a custom robots.txt
file.
To create a custom robots.txt
file you will need to create the file in the root of your repository similar to the following:
User-agent: Googlebot
Disallow: /nogooglebot/
User-agent: *
Allow: /
Sitemap: https://www.example.com/sitemap.xml
Once deployed you should be able check by going to your site updating the URL to include /robots.txt
(ex., https://academicpages.github.io/robots.txt). Google's documentation for developers contains excellent information on writing rules for your own robots.txt. If you are interesting in blocking web crawlers used for AI training, the following is a good place to start:
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: CCbot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: PiplBot
Disallow: /