Nov
29
Robots.txt tips and tricks
Filed Under Howto
robots.txt is a file that you can use to instruct where web crawlers should look for information and where they should not.
How it works?
Good web crawler first accesses root of a domain and looks for robots.txt file.
For example if robot wants to check www.example.com/welcome.html it will first check if www.example.com/robots.txt exists.
And again for example it finds:
robots.txt:
# No robots, Please
User-agent: *
Disallow: /
In above file:
User-agent: * means this section applies to all robots and
Disallow: / instructs the robot that it should not visit any pages on the site.
Note: It is important to know that robots can ignore your /robots.txt and robots.txt file is a publicly available file.
First consideration is really important to know since the robots who ignore the instructions are usually malicious.
What to put inside?
robots.txt is a plain text file. Here are few examples:
To allow all robots to visit all files:
User-agent: *
Disallow:
And opposite disallow all robots out:
User-agent: *
Disallow: /
If you need to disallow a specific agent to visit specific folder
User-agent: SpecificBot # replace the 'SpecificBot' with the actual user-agent of the bot
Disallow: /notimportant/
Above example shows also how you can put comments in the file.
In addition you can tell robots where your sitemap is located
User-agent: *
Sitemap: http://www.example.com/sitemaps/sitemap.xml
Where to put it?
The short answer: in the top-level directory of your web server.
A bit longer: it should be located after your domain name. For example www.example.com/robots.txt not www.example.com/robot_file/robots.txt
Comments
Leave a Reply
