Writing a Robots.txt file - page 2

Author: Steven Neiland
Published:

Warning: This blog entry was written two or more years ago. Therefore, it may contain broken links, out-dated or misleading content, or information that is just plain wrong. Please read on with caution.

User Agent

The user agent directive is the first directive of any rule group. Its basic function is to tell a web spider if the following directives apply to it or not. You can have multiple user-agent directives in a single robots.txt file. All the rules directly below the user-agent directive and above any subsequent user-agent directives apply to that named spider. Thus a visiting spider will go read though the list until it either finds the '*' name (all spiders) or its own name and should obey all the rules listed below until the spider reaches a new user-agent directive.

# this robots.txt file contains several different user agent directives 

#Googlebot is surprise surprise the google website indexer
User-agent: Googlebot
#rules for googlebot to obey

#T-Rex is the Lychos spider
User-agent: T-Rex
#rules for T-Rex to obey

#All other spiders
User-agent: *
#rules for all other to obey

For a list of the more commonly know spiders visit the www.robotstxt.org website.

Disallow

After the user-agent directive the 'Disallow' directive is the most commonly know command. In fact many guides only discuss this directive. Its basic function is simply to disallow a spider from indexing a specific directory or file. In the case of a directory all files and subdirectories are disallowed. You can make an exception by using the 'Allow' directive see below.

#Do not allow spiders to index any resource in the myfolder directory
#or the mypage webpage
User-agent: *
Disallow: /myfolder/
Disallow: /mypage.html

Allow

The opposite of the Disallow directive the Allow directive grant a spider access to a specific file or directory. This is important when you want to deny access to all the resources of a directory excluding a specific resource.

It is important to note that the Allow directive is implemented slightly differently by different spiders but the idea is basically the same. Note that the Allow directive must appear before any Disallow directives. Consider the following.

#Allow a web spider to see the showfile webpage in myfolder 
#but nothing else in the same folder

User-agent: *
Allow: /myfolder/showfile.html
Disallow: /myfolder/
1 2 3

Reader Comments

  • Please keep comments on-topic.
  • Please do not post unrelated questions or large chunks of code.
  • Please do not engage in flaming/abusive behaviour.
  • Comments that contain advertisments or appear to be created for the purpose of link building, will not be published.

Archives Blog Listing