robots.txt is a file that you can use to instruct where web crawlers should look for information and where they should not.

How it works?

Good web crawler first accesses root of a domain and looks for robots.txt file.

For example if robot wants to check www.example.com/welcome.html it will first check if www.example.com/robots.txt exists.

And again for example it finds:

robots.txt:

# No robots, Please
User-agent: *
Disallow: /

In above file:

User-agent: * means this section applies to all robots and
Disallow: / instructs the robot that it should not visit any pages on the site.

Note: It is important to know that robots can ignore your /robots.txt and robots.txt file is a publicly available file.

First consideration is really important to know since the robots who ignore the instructions are usually malicious.

What to put inside?

robots.txt is a plain text file. Here are few examples:

To allow all robots to visit all files:
User-agent: *
Disallow:

And opposite disallow all robots out:
User-agent: *
Disallow: /

If you need to disallow a specific agent to visit specific folder
User-agent: SpecificBot # replace the 'SpecificBot' with the actual user-agent of the bot
Disallow: /notimportant/

Above example shows also how you can put comments in the file.

In addition you can tell robots where your sitemap is located
User-agent: *
Sitemap: http://www.example.com/sitemaps/sitemap.xml

Where to put it?

The short answer: in the top-level directory of your web server.

A bit longer: it should be located after your domain name. For example www.example.com/robots.txt not www.example.com/robot_file/robots.txt

Comments

Leave a Reply




 

Switch to our mobile site