Guide

Robots.txt Checker: Test & Validate Your Robots.txt

NetVizor Team April 3, 2026

#robots.txt #seo #web-crawling

A single mistake in your robots.txt file can accidentally block Google from crawling your entire website. It happens more often than you think — and the consequences can devastate your search rankings overnight. This guide explains what robots.txt does, how to check it correctly, and how to fix the most common errors.

Check Your Robots.txt Now

👉 Robots.txt Checker — Free Online Tool

Enter any domain and instantly see its robots.txt file, validate the syntax, and check which pages are blocked from crawlers.

What Is Robots.txt?

Robots.txt is a plain text file placed in the root directory of your website (yourdomain.com/robots.txt) that tells search engine crawlers which pages or sections they should or shouldn't access.

It's part of the Robots Exclusion Protocol — a standard that major search engines like Google, Bing, and others follow by convention (not by obligation).

What robots.txt controls:

Which pages crawlers can access
Which crawlers are affected (Google, Bing, specific bots)
Where your XML sitemap is located
Crawl delay between requests

How Robots.txt Works

When a search engine bot visits your site, it first checks yourdomain.com/robots.txt before crawling any page. Based on the rules it finds, it decides what to crawl.

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
Sitemap: https://yourdomain.com/sitemap.xml

Important distinction: Robots.txt controls crawling, not indexing. A page blocked by robots.txt won't be crawled — but it can still appear in search results if other pages link to it. To prevent indexing, use the noindex meta tag instead.

Robots.txt Syntax Explained

Basic structure

User-agent: [bot name or *]
Disallow: [path to block]
Allow: [path to allow]
Crawl-delay: [seconds]
Sitemap: [full URL to sitemap]

User-agent directives

User-agent	Crawler
`*`	All crawlers
`Googlebot`	Google (all)
`Googlebot-Image`	Google Images
`Googlebot-Video`	Google Video
`Bingbot`	Microsoft Bing
`Slurp`	Yahoo Search
`DuckDuckBot`	DuckDuckGo
`facebookexternalhit`	Facebook link previews
`Twitterbot`	Twitter/X link previews

Allow and Disallow rules

# Block all crawlers from the entire site
User-agent: *
Disallow: /

# Allow all crawlers everywhere (default behavior)
User-agent: *
Disallow:

# Block a specific directory
User-agent: *
Disallow: /admin/

# Block a specific file
User-agent: *
Disallow: /private-page.html

# Block all PDFs
User-agent: *
Disallow: /*.pdf$

# Allow Google but block everything else
User-agent: Googlebot
Allow: /

User-agent: *
Disallow: /

Wildcard patterns

Pattern	Matches
`/admin/`	Exactly `/admin/` and everything inside
`/admin*`	Anything starting with `/admin`
`*.pdf$`	All URLs ending in `.pdf`
`/*?`	All URLs with query parameters

How to Check Your Robots.txt File

Method 1: Online checker (fastest)

Use Robots.txt Checker NetVizor:

Enter your domain
See the current robots.txt content
Check which paths are blocked or allowed
Validate syntax errors

Method 2: Direct URL

Simply open yourdomain.com/robots.txt in your browser. If it returns a 404, you don't have a robots.txt file (which is fine — all pages are crawlable by default).

Method 3: Google Search Console

Open Google Search Console
Go to Settings → robots.txt
Google shows the robots.txt it last fetched and when

This is especially useful to check if Googlebot sees the same robots.txt as you do — caching issues can cause discrepancies.

Method 4: Google's Robots.txt Tester

Open Google Search Console → Settings → robots.txt Tester (legacy tool)
Test specific URLs against your current robots.txt
See whether a URL is allowed or blocked

Most Common Robots.txt Mistakes

Mistake 1: Accidentally blocking the entire site

The most catastrophic mistake:

# WRONG — blocks all crawlers from everything
User-agent: *
Disallow: /

This single rule prevents Google from crawling any page on your website. Rankings disappear within days.

How it happens: Developers add this during site maintenance and forget to remove it. Always check robots.txt after a site launch or migration.

Mistake 2: Blocking CSS and JavaScript files

# WRONG — prevents Google from rendering your pages
User-agent: *
Disallow: /wp-content/
Disallow: /assets/

If Google can't access your CSS and JavaScript, it can't properly render your pages. This hurts rankings because Google sees a broken version of your site.

Fix: Allow Googlebot to access all resources needed to render pages.

Mistake 3: Disallow without trailing slash

# Blocks only /admin (the exact URL)
Disallow: /admin

# Blocks /admin/ and everything inside it
Disallow: /admin/

Without the trailing slash, you only block the exact URL — not the directory and its contents.

Mistake 4: Wrong file location or filename

Robots.txt must be:

In the root directory (yourdomain.com/robots.txt)
Named exactly robots.txt (lowercase)
Served with 200 status (not 301 redirect)
Plain text format (text/plain)

A robots.txt at yourdomain.com/folder/robots.txt has no effect.

Mistake 5: Blocking important pages by accident

# Meant to block /private/secret
# Actually blocks ALL pages starting with /p
Disallow: /p

Always test your rules with Robots.txt Checker NetVizor before publishing.

Mistake 6: Using robots.txt to hide sensitive content

Robots.txt is publicly visible — anyone can read it. If you list sensitive directories in robots.txt, you're actually advertising their existence to bad actors.

Use server-side authentication to protect sensitive content — not robots.txt.

Robots.txt for Common CMS Platforms

WordPress

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-login.php
Disallow: /xmlrpc.php

Sitemap: https://yourdomain.com/sitemap_index.xml

Shopify

Shopify generates robots.txt automatically. You can customise it via the robots.txt.liquid template. Common additions:

User-agent: *
Disallow: /admin
Disallow: /cart
Disallow: /orders
Disallow: /checkout
Disallow: /account

Next.js / Nuxt.js

In Next.js, create public/robots.txt or use the next-sitemap package. In Nuxt 3, use the nuxt-simple-robots module or place it in the public/ directory.

Robots.txt vs Meta Noindex: What's the Difference?

These two mechanisms are often confused:

	Robots.txt	Meta Noindex
Controls	Crawling	Indexing
Location	Root directory file	HTML `<head>` tag
Effect	Bot won't visit the page	Bot visits but won't index
Scope	Entire directories or patterns	Individual pages
Can still rank?	Yes (via links)	No

When to use robots.txt:

Block crawlers from admin areas, internal tools
Prevent crawling of duplicate content
Save crawl budget on large sites

When to use noindex:

Remove specific pages from search results
Thank-you pages, login pages, internal search results

Crawl Budget: Why Robots.txt Matters for Large Sites

For large websites (100,000+ pages), crawl budget becomes critical. Google doesn't crawl every page of every site on every visit — it allocates a certain number of crawl requests per site.

Wasting crawl budget on unimportant pages (faceted navigation, filtered URLs, duplicate content) means important pages get crawled less frequently.

Robots.txt helps by blocking low-value URLs:

# Block faceted navigation (common e-commerce issue)
User-agent: *
Disallow: /*?color=
Disallow: /*?sort=
Disallow: /*?page=

# Block internal search results
Disallow: /search/

XML Sitemap in Robots.txt

Always include your sitemap URL in robots.txt — it helps search engines find and crawl your content:

User-agent: *
Disallow: /admin/

Sitemap: https://yourdomain.com/sitemap.xml

If you have multiple sitemaps:

Sitemap: https://yourdomain.com/sitemap-pages.xml
Sitemap: https://yourdomain.com/sitemap-posts.xml
Sitemap: https://yourdomain.com/sitemap-images.xml

Check if your sitemap is valid with DNS Lookup NetVizor to verify the domain resolves correctly, and make sure all sitemap URLs return 200 status.

FAQ: Robots.txt Questions

Does robots.txt affect Google rankings? Indirectly, yes. Blocking important pages prevents Google from crawling and indexing them — which removes them from search results. Blocking CSS/JS hurts rendering quality. A clean, well-configured robots.txt helps Google crawl your site efficiently.

What happens if I don't have a robots.txt file? Nothing bad — all pages are crawlable by default. A missing robots.txt simply means no restrictions. Google won't penalise you for not having one.

Can I block specific countries or IPs with robots.txt? No. Robots.txt only controls crawlers — not human visitors, and not by location. Use server-side rules (Cloudflare, .htaccess, nginx config) to block IPs or countries.

Does every website need a robots.txt? Not necessarily. Small sites with no sensitive areas and no duplicate content issues don't need one. Larger sites, e-commerce platforms, and sites with admin areas should have one.

How quickly does Google update after I change robots.txt? Google typically re-fetches robots.txt within 24 hours. However, the effects on crawling can take days to propagate — previously blocked pages may take weeks to disappear from search results (or reappear after unblocking).

Can I use robots.txt to block AI crawlers? Yes. Specify the user-agent for AI crawlers:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

Conclusion

Robots.txt is simple in concept but powerful in impact. A single misplaced rule can block your entire site from Google — and a well-crafted file can significantly improve how efficiently crawlers navigate your content.

Quick checklist: