Complete Guide to robots.txt

What is robots.txt?

The Basics

The robots.txt file is a text file that webmasters create to instruct web robots (typically search engine robots) how to crawl and index pages on their website.

It's part of the Robots Exclusion Protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users.

Important: robots.txt is not a security measure. It's publicly accessible and shouldn't be used to hide sensitive information.

File Location

The robots.txt file must be placed in the root directory of your website. For example:

https://www.example.com/robots.txt

Search engines will look for this exact location. If it's placed elsewhere, it won't be recognized.

Core Functions of robots.txt

Disallow Access

Prevent search engine crawlers from accessing specific directories or pages on your website.

                        User-agent: *

                        Disallow: /private/

Specify User Agents

Give different instructions to different crawlers (Googlebot, Bingbot, etc.).

                        User-agent: Googlebot

                        Disallow: /images/

Sitemap Specification

Point crawlers to your XML sitemap location for better indexing.

Sitemap: https://example.com/sitemap.xml

Crawl Delay

Suggest how many seconds a crawler should wait between requests.

                        User-agent: *

                        Crawl-delay: 10

Allow Specific Pages

Allow access to specific pages within a blocked directory.

                        User-agent: *

                        Disallow: /private/

                        Allow: /private/public-page.html

Comments

Add comments to explain your directives to other developers.

                        # Block all crawlers from temp files

                        User-agent: *

                        Disallow: /tmp/

Advanced robots.txt Features

Wildcards

Use wildcard characters (*) to match any sequence of characters in paths.

                            User-agent: *

                            Disallow: /*.pdf$  # Block all PDFs

                            Disallow: /tmp/*  # Block all in /tmp directory

Pattern Matching

Use $ to specify the end of a URL pattern for precise matching.

                            User-agent: *

                            Disallow: /*.php$  # Block all PHP files

                            Allow: /index.php$  # Except the homepage

Multiple Directives

Combine multiple directives for complex crawling rules.

                            User-agent: Googlebot-Image

                            Disallow: /images/private/

                            Allow: /images/private/logo.jpg

                            Crawl-delay: 5

Best Practices

Do's

Use for non-sensitive content: Ideal for duplicate content, search results pages, or infinite spaces.
Test your file: Use Google Search Console's robots.txt tester to validate your rules.
Keep it simple: Only include directives you actually need.

Don'ts

Don't block CSS/JS: This can prevent Google from properly rendering your pages.
Don't use for security: Sensitive pages should use proper authentication.
Don't overuse wildcards: Can lead to accidentally blocking important pages.

Complete Example File

robots.txt

# Example robots.txt file

# Last updated: 2023-06-15

# Block all crawlers from temporary files

User-agent: *

Disallow: /tmp/

Disallow: /private/

Disallow: /search/

Disallow: /*.pdf$

# Special rules for Googlebot

User-agent: Googlebot

Disallow: /images/

Allow: /images/logo.jpg

Crawl-delay: 5

# Special rules for Bingbot

User-agent: bingbot

Disallow: /admin/

Crawl-delay: 10

# Sitemap location

Sitemap: https://www.example.com/sitemap.xml

Sitemap: https://www.example.com/image-sitemap.xml

Testing Tools

Google Search Console

The robots.txt Tester tool helps you check whether your robots.txt file blocks Google web crawlers from specific URLs.

Visit Tool

SEO Robots.txt Validator

Free online tool to validate your robots.txt file syntax and test URL accessibility against your rules.

Visit Tool

CURL Command

Quickly check your robots.txt file from the command line with this simple CURL command.

curl https://www.yourdomain.com/robots.txt