The robots.txt file is a commonly overlooked tool in the SEO toolbox. Despite being incredibly simple, time and again I see sites with it missing or badly implemented. I have written a short guide here explaining its uses.
The robots.txt file is a small little text file which resides in the web root of a website. It's purpose is to serve as a behaviour guideline for search engine spiders. In essence it tells googlebot (and other spiders) what pages of a website is it should and should not index for use in web searches. This makes it very important when you are trying to do SEO.
Before I go on I want to discuss an often overlooked and misunderstood aspect of the robots.txt file which is security. The robots.txt file is a guideline only and web spiders can and often do ignore it. Even the ones that do obey it often have different interprepations of how the rules work.
It is for this reason that it is a very bad idea to list sensitive directories in the robots.txt file as an attacker can actually see what directories you do not want indexed to hone an attack. I personally recommend a policy of denying access to everything (without naming the resources being denied) and then giving access to named resources instead of giving access to everything and denying access to named resources.
The first step in creating your robots.txt file is simply a case of creating a blank text file named robots.txt and placing it in the webroot of your site. It is then a simple matter of visiting 'http://yoursitedomain/robots.txt' to see if it is visible.
Once you have created your robots.txt file the first thing we do is write a comment. The robots.txt file uses the '#' symbol for starting comments. A comment can appear on its own line and appended to a code directive.
# Comments appear after the "#" symbol at the start of a line,
User-agent: * # or after a directive
The user agent directive is the first directive of any rule group. Its basic function is to tell a web spider if the following directives apply to it or not. You can have multiple user-agent directives in a single robots.txt file. All the rules directly below the user-agent directive and above any subsequent user-agent directives apply to that named spider. Thus a visiting spider will go read though the list until it either finds the '*' name (all spiders) or its own name and should obey all the rules listed below until the spider reaches a new user-agent directive.
# this robots.txt file contains several different user agent directives
#Googlebot is surprise surprise the google website indexer
User-agent: Googlebot
#rules for googlebot to obey
#T-Rex is the Lychos spider
User-agent: T-Rex
#rules for T-Rex to obey
#All other spiders
User-agent: *
#rules for all other to obey
For a list of the more commonly know spiders visit the www.robotstxt.org website.
After the user-agent directive the 'Disallow' directive is the most commonly know command. In fact many guides only discuss this directive. Its basic function is simply to disallow a spider from indexing a specific directory or file. In the case of a directory all files and subdirectories are disallowed. You can make an exception by using the 'Allow' directive see below.
#Do not allow spiders to index any resource in the myfolder directory
#or the mypage webpage
User-agent: *
Disallow: /myfolder/
Disallow: /mypage.html
The opposite of the Disallow directive the Allow directive grant a spider access to a specific file or directory. This is important when you want to deny access to all the resources of a directory excluding a specific resource.
It is important to note that the Allow directive is implemented slightly differently by different spiders but the idea is basically the same. Note that the Allow directive must appear before any Disallow directives. Consider the following.
#Allow a web spider to see the showfile webpage in myfolder
#but nothing else in the same folder
User-agent: *
Allow: /myfolder/showfile.html
Disallow: /myfolder/
Several major crawlers support a Crawl-delay parameter, which determines how long a spider must wait after retrieving a page before it can request another page from the site. This is a good security measure to implement as it reduces the likelyhood of a spider acting like a 'Denial of Service' attack as it prevents your site from being overloaded by a spider suddenly requesting a large number of pages in quick succession.
The crawl-delay directive accepts a value which represents the number of seconds to wait between successive requests to the same server:[1][2][3]
#wait ten seconds between page requests
User-agent: *
Crawl-delay: 10
Another important function of the robots.txt file in particular for SEO is the sitemap directive which tells a webspider where the site map(s) is stored. Example as follows.
Sitemap: http://www.mysite.com/sitemaps/profiles-sitemap.xml
Sitemap: http://www.mysite.com/sitemaps/blog-sitemap.xml