robots.txt is a file that you can use to instruct where web crawlers should look for information and where they should not.
How it works?
Good web crawler first accesses root of a domain and looks for robots.txt file.
For example if robot wants to check www.example.com/welcome.html it will first check if www.example.com/robots.txt exists.
And again for example it finds:
# No robots, Please
In above file:
User-agent: * means this section applies to all robots and
Disallow: / instructs the robot that it should not visit any pages on the site.
Note: It is important to know that robots can ignore your /robots.txt and robots.txt file is a publicly available file.
First consideration is really important to know since the robots who ignore the instructions are usually malicious.
What to put inside?
robots.txt is a plain text file. Here are few examples:
To allow all robots to visit all files:
And opposite disallow all robots out:
If you need to disallow a specific agent to visit specific folder
User-agent: SpecificBot # replace the 'SpecificBot' with the actual user-agent of the bot
Above example shows also how you can put comments in the file.
In addition you can tell robots where your sitemap is located
Where to put it?
The short answer: in the top-level directory of your web server.
A bit longer: it should be located after your domain name. For example www.example.com/robots.txt not www.example.com/robot_file/robots.txt