BY
Arpad Balogh
June 2, 2021

Robots.txt is a text file that tells search engine crawlers what to crawl and what not to crawl when they are crawling your website.

It is the very first file that a web crawler will access when they come to your website.

The Robots.txt is stored in the root folder of your website.

Major search engines respectively follow the instructions in your Robots.txt file, but keep in mind that these guidelines can mean different things to different search engines.

The History of Robots.txt: Who Invented it?

Martijn Koster inventor of robots-txt

Martijn Koster inventor of Robots.txt

Robots.txt was originally introduced by Martijn Koster in February 1994 on the www-talk mailing list.


Martin Kojester was an employee at Nexor.


The Robots.txt was created as a way to regulate crawler activity on his webserver during an incident where a rogue web crawlers excessive and aggressive requests crashed the server by incidentally causing a denial-of-service-attack.

Is having a Robots.txt File Necessary?

Having a robots.txt file is not necessary if you do not want to restrict any web crawler activity, although it is highly recommended to have one.

What Protocol does the Robots.txt use?

The robots.txt uses the "Robots Exclusion Protocol".

This Robots Exclusion Protocol specifies that web robots should fetch data from the robots.txt file first, before crawling pages on an entire site.

The protocol also specifies that web robots must obey the instructions given in this file as long as they do not come into conflict with any other rules applicable to search engine bots behavior specified elsewhere (such as those in directives marked "User-agent: *").

Syntaxes: How to Create a Robots.txt file?

You can use any plain text editor such as Notepad, TextEdit or Wordpad to create a Robots.txt file on your computer.

Open it up, create a new text file, make sure that you are using UTF-8 as the encoding type and save it as a txt.

Your Robots.txt will consist of multiple directives, rules and additionally comments. The only rule here is that there should be one rule per line.

Each of these groups can tell the crawler the following information:


  • User-Agent: Which group (crawler, search engine) the rules below it apply to. There are several User Agents and each
  • Allow: Which files or directories the User-Agent has access to.
  • Disallow: Which files or directories the User-Agent doesn't have access to.

These are the most basic syntaxes, there are a lot more options that you can add to your Robots.txt, like:

Sitemap Directive (for XML Sitemaps)

You can use the robots.txt file to specify any XML sitemap for your website.

You can do that by adding this line to your group:

Sitemap: http://www.example.com/sitemap.xml

Keep in mind that you can declare several sitemaps, so if you have multiple ones, include them all in your Robots.txt file.

Crawl-Delay Directive

This directive allows you to specify how long a crawler should wait between successive requests.

You can use this for example if your website has heavy traffic and is slow, so the bots will not overload it by visiting the site too often.

You can do that by adding this line to your group:

Crawl-Delay: 120 seconds

The time in seconds is how much robots have to wait before crawling your entire site again.

Robots.txt Example

Here is a basic Robots.txt example for a WordPress website:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

# UTF-8 BOM Tested & Approved
User-agent: Googlebot

# Allow files critical for rendering
Allow: /*.js
Allow: /*.css

# Allow AJAX - Do Not Remove
Allow: /wp-admin/admin-ajax.php

# Prevent login page crawls etc
Disallow: /wp-login.php

# Prevent register page crawls etc
Disallow: /wp-register.php

# Sitemap files

Sitemap: https://example.com/sitemap_index.xml
Sitemap: https://example.com/post-sitemap.xml
Sitemap: https://example.com/page-sitemap.xml
Sitemap: https://example.com/category-sitemap.xml
Sitemap: https://example.com/author-sitemap.xml

You can test out your Robots.txt file in Google's Search Console Robots Testing Tool.

Where to put the Robots.txt file?

put your robots.txt in the root folder

The Robots.txt file should be put in the root folder of your website host.

So if your website is https://example.com, then the Robots.txt file should be located at the root domain https://example.com/robots.txt

If you have a subdomain, like https://sub.example.com, your Robots.txt file should be located at https://sub.example.com/robots.txt

Keep in mind that the robots.txt file on your subdomain and root domain is different as these are handled as separate websites!

The only rule is that your Robots.txt cannot be put in a sub-folder, like https://example.com/folder/robots.txt

How to Use Wildcards in a Robots.txt file?

using wildcards in your robots.txt

Major search engines (Google, Bing, Yahoo) allow for a limited form of wildcards for path values.

These are * and $.

These can be used for regular expressions, which can be used in your Disallow directive to get these search engines to exclude certain directories or files.

There are several ways you can use Wildcards in your Robots.txt file.

Let's look at a few Robots.txt Wildcard examples:

You Would Like to Disallow a Certain Path

You should use the following line in your Robots.txt:

Disallow: /example

This would tell Google (or any major search engine) not to crawl the following URLs, when crawling your website:

example.com/example
example.com/examples


But, as it is case sensitive, it wouldn't apply to anything like this:


example.com/EXample

You Would Like to Disallow a Certain Folder and Everything in That Folder:

You should use the following line in your Robots.txt:

Disallow: /example/

This would tell Google (or any major search engine) not to crawl the following folder and it's contents, when crawling your website:

example.com/example/
example.com/example/page

If you would like to Disallow certain type of files:

You should use the following line in your Robots.txt:


Disallow: /*.php$

This would tell Google (or any major search engine) not to crawl the following files on your website:

example.com/example.php
example.com/example/file.php

You Would Like to Disallow Certain Type of Files with Queries or Parameters as Well

You should use the following line in your Robots.txt:


Disallow: /*.php

This would tell Google (or any major search engine) not to crawl the following files on your website:

example.com/example.php
example.com/example/file.php
example.com/example/file.php?parameter

Of course, you can modify any of these lines to your liking, like making them only being applicable for a certain folder or such!

Do You Want More Customers?


If you are looking to generate more customers, check out this guide and learn how.

You'll find 5 steps that will help fix the most important things on your website today!

How to Allow a File in a Disallowed Folder in Robots.txt?

There might be times, where you wouldn't like Google or other search engines to crawl a folder, but there might be certain types of files in that folder that you want to get crawled

In case of that, you should use a Disallow and an Allow directive after each other in a sequence.

Here is an example if you like Search Engines to crawl JPEGs inside a Disallowed folder:

Disallow: /folder/
Allow: /folder/*.jpg$


This will disallow the crawling of the folder, but will allow the crawling of any files that have the filetype .jpg (without any parameters).

Can You Block Different Search Engines From Crawling Your Website?

Yes, by using different user agents, you can block search engines from crawling your website.


To do this, you must use the Disallow directive and specify the user agent you would like to block.

For example, if you want to block Google, you can use this:

User-agent: Googlebot
Disallow: /


This will disallow Googlebot user agent from crawling the entire site.

There are other user agents for other Search Engines, but the method for blocking them is the same.

Why is Robots.txt Important for SEO?

Robots.txt is important for SEO because you can block any crawler from accessing certain parts of your website and you can also help search engine spiders find your sitemaps.

Robots.txt can be also used to help eliminate duplicate content, which can be harmful thing for your SEO.

Robots.txt alone doesn't help you with ranking higher in the search results, but it is definitely a good step towards to a good technical SEO health of your website.

You can also declare
your sitemaps in your Robots.txt which helps crawlers to more easily crawl your website.

Can You Block Indexing with Robots.txt?

Although you can block pages, Disallow them in your Robots.txt is not a sure way of blocking the indexing of your pages in search engines.

For example, Google's crawler can easily find links to the blocked pages and can index pages that are blocked in your Robots.txt.

If you want Google not to index a page, you must not block the crawling of it in the Robots.txt and use a No-index tag on the page itself.

If you use Disallow to block the crawling of the page and also use a Noindex tag, that will stop Google from seeing the NoIndex directives completely.

If you want to hide a page from the search result, just simply add a NoIndex tag to that page!

How to Fix Submitted URL Blocked by Robots.txt Error in Google Search Console

What does this error mean?

This error means that Google is trying to crawl a page on your website, which you have submitted for indexing, but it couldn't as it is disallowed in your Robots.txt

How to fix this error?

If you would like for Google to index that page and the disallow was added by an accident, you should remove it from your Robots.txt

If you would like Google to only index the page, but not the folder, you should use the above mentioned Allow, Disallow method.

How to Fix Sitemap Contains URLs Which are Blocked by Robots.txt Error

What does this error mean?

This error means that Google is trying to crawl a page on your website, which is added to your sitemap, but it is unable to because it is blocked in your Robots.txt.

How to fix this error?

If you would like for Google to index that page and the disallow was added by an accident, you should remove it from your Robots.txt

If you would like Google to only index the page, but not the folder, you should use the above mentioned Allow, Disallow method.

If you wouldn't like Google to index the page, you should remove it from your sitemap.

Does Blocking Directories with Robots.txt Help with Crawl Budget?

Yes, blocking directories or certain pages in your Robots.txt can help with the crawl budget for your website, but you shouldn't use it as a temporary fix.

Google is stating in their official document, that this freed up crawl budget won't be shifted to other pages unless your website is hitting the serving limit.

This means that disallowing certain parts of your website can free up crawl budget, but Google only recommends Disallowing parts of your website that you don't want to have appearing in the search results in the long run.

You should also consider using No-Index of Password Protection for these pages if you want them crawled, but not indexed.

Conclusion

The Robots.txt file is an important part to control how many pages on your website are getting crawled or to block specific pages, files from being crawled at all.

With that said, we recommend giving as much leeway to search engines and search engine spiders as possible.

We hope this article helped you in creating the perfect Robots.txt website, which is an essential part of a
DIY SEO Campaign.


And as always, feel free to contact us for any help!

Do You Want More Customers?


If you are looking to generate more customers, check out this guide and learn how.

You'll find 5 steps that will help fix the most important things on your website today!

About the author

Arpad Balogh is a Hungarian SEO Expert and Strategist with more than 8 years of experience in the field. He loves dogs, he is afraid of empty pools (so weird, right?), loves vegan food and has a passion for telling very bad jokes.

Technical SEO, Structured Data and Keyword research are the areas he loves the most.

As a Founder of Slothio, His mission is to help 5,000 small business owners to grow their business in the next three years, help them be happier and enjoy life more.

Arpad Balogh


You may also like

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}
>