Mastering robots.txt

The complete guide to controlling search engine crawlers on your website

Learn More
Robot icon

What is robots.txt?

The Basics

The robots.txt file is a text file that webmasters create to instruct web robots (typically search engine robots) how to crawl and index pages on their website.

It's part of the Robots Exclusion Protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users.

Important: robots.txt is not a security measure. It's publicly accessible and shouldn't be used to hide sensitive information.

File Location

The robots.txt file must be placed in the root directory of your website. For example:

https://www.example.com/robots.txt

Search engines will look for this exact location. If it's placed elsewhere, it won't be recognized.

Core Functions of robots.txt

Disallow Access

Prevent search engine crawlers from accessing specific directories or pages on your website.

User-agent: *
Disallow: /private/

Specify User Agents

Give different instructions to different crawlers (Googlebot, Bingbot, etc.).

User-agent: Googlebot
Disallow: /images/

Sitemap Specification

Point crawlers to your XML sitemap location for better indexing.

Sitemap: https://example.com/sitemap.xml

Crawl Delay

Suggest how many seconds a crawler should wait between requests.

User-agent: *
Crawl-delay: 10

Allow Specific Pages

Allow access to specific pages within a blocked directory.

User-agent: *
Disallow: /private/
Allow: /private/public-page.html

Comments

Add comments to explain your directives to other developers.

# Block all crawlers from temp files
User-agent: *
Disallow: /tmp/

Advanced robots.txt Features

Wildcards

Use wildcard characters (*) to match any sequence of characters in paths.

User-agent: *
Disallow: /*.pdf$ # Block all PDFs
Disallow: /tmp/* # Block all in /tmp directory

Pattern Matching

Use $ to specify the end of a URL pattern for precise matching.

User-agent: *
Disallow: /*.php$ # Block all PHP files
Allow: /index.php$ # Except the homepage

Multiple Directives

Combine multiple directives for complex crawling rules.

User-agent: Googlebot-Image
Disallow: /images/private/
Allow: /images/private/logo.jpg
Crawl-delay: 5

Best Practices

Do's

  • Use for non-sensitive content: Ideal for duplicate content, search results pages, or infinite spaces.

  • Test your file: Use Google Search Console's robots.txt tester to validate your rules.

  • Keep it simple: Only include directives you actually need.

Don'ts

  • Don't block CSS/JS: This can prevent Google from properly rendering your pages.

  • Don't use for security: Sensitive pages should use proper authentication.

  • Don't overuse wildcards: Can lead to accidentally blocking important pages.

Complete Example File

robots.txt
# Example robots.txt file
# Last updated: 2023-06-15

# Block all crawlers from temporary files
User-agent: *
Disallow: /tmp/
Disallow: /private/
Disallow: /search/
Disallow: /*.pdf$

# Special rules for Googlebot
User-agent: Googlebot
Disallow: /images/
Allow: /images/logo.jpg
Crawl-delay: 5

# Special rules for Bingbot
User-agent: bingbot
Disallow: /admin/
Crawl-delay: 10

# Sitemap location
Sitemap: https://www.example.com/sitemap.xml
Sitemap: https://www.example.com/image-sitemap.xml

Testing Tools

Google Search Console

The robots.txt Tester tool helps you check whether your robots.txt file blocks Google web crawlers from specific URLs.

Visit Tool

SEO Robots.txt Validator

Free online tool to validate your robots.txt file syntax and test URL accessibility against your rules.

Visit Tool

CURL Command

Quickly check your robots.txt file from the command line with this simple CURL command.

curl https://www.yourdomain.com/robots.txt