What is robots.txt?
The Basics
The robots.txt file is a text file that webmasters create to instruct web robots (typically search engine robots) how to crawl and index pages on their website.
It's part of the Robots Exclusion Protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users.
Important: robots.txt is not a security measure. It's publicly accessible and shouldn't be used to hide sensitive information.
File Location
The robots.txt file must be placed in the root directory of your website. For example:
Search engines will look for this exact location. If it's placed elsewhere, it won't be recognized.
Core Functions of robots.txt
Disallow Access
Prevent search engine crawlers from accessing specific directories or pages on your website.
Disallow: /private/
Specify User Agents
Give different instructions to different crawlers (Googlebot, Bingbot, etc.).
Disallow: /images/
Sitemap Specification
Point crawlers to your XML sitemap location for better indexing.
Crawl Delay
Suggest how many seconds a crawler should wait between requests.
Crawl-delay: 10
Allow Specific Pages
Allow access to specific pages within a blocked directory.
Disallow: /private/
Allow: /private/public-page.html
Comments
Add comments to explain your directives to other developers.
User-agent: *
Disallow: /tmp/
Advanced robots.txt Features
Wildcards
Use wildcard characters (*) to match any sequence of characters in paths.
Disallow: /*.pdf$ # Block all PDFs
Disallow: /tmp/* # Block all in /tmp directory
Pattern Matching
Use $ to specify the end of a URL pattern for precise matching.
Disallow: /*.php$ # Block all PHP files
Allow: /index.php$ # Except the homepage
Multiple Directives
Combine multiple directives for complex crawling rules.
Disallow: /images/private/
Allow: /images/private/logo.jpg
Crawl-delay: 5
Best Practices
Do's
-
Use for non-sensitive content: Ideal for duplicate content, search results pages, or infinite spaces.
-
Test your file: Use Google Search Console's robots.txt tester to validate your rules.
-
Keep it simple: Only include directives you actually need.
Don'ts
-
Don't block CSS/JS: This can prevent Google from properly rendering your pages.
-
Don't use for security: Sensitive pages should use proper authentication.
-
Don't overuse wildcards: Can lead to accidentally blocking important pages.
Complete Example File
# Last updated: 2023-06-15
# Block all crawlers from temporary files
User-agent: *
Disallow: /tmp/
Disallow: /private/
Disallow: /search/
Disallow: /*.pdf$
# Special rules for Googlebot
User-agent: Googlebot
Disallow: /images/
Allow: /images/logo.jpg
Crawl-delay: 5
# Special rules for Bingbot
User-agent: bingbot
Disallow: /admin/
Crawl-delay: 10
# Sitemap location
Sitemap: https://www.example.com/sitemap.xml
Sitemap: https://www.example.com/image-sitemap.xml
Testing Tools
Google Search Console
The robots.txt Tester tool helps you check whether your robots.txt file blocks Google web crawlers from specific URLs.
Visit ToolSEO Robots.txt Validator
Free online tool to validate your robots.txt file syntax and test URL accessibility against your rules.
Visit ToolCURL Command
Quickly check your robots.txt file from the command line with this simple CURL command.