05 septembre 2006

Write a correct robots.txt file for Googlebot and other User-agents

The robots.txt file located in the root Web directory of a Web Site is used by robots such as Googlebot, MSNBot, Yahoo! Slurp or Yahoo!'s Web Crawler to know which pages of the Web Site are to the indexed by the search engine, and which pages should not be.

This robots.txt file is a plain text file containing sections such as:

User-agent: Googlebot
Disallow: /private_content/
Disallow: /images/

In this example, it will exclude from the search engine index the pages located in the private_content and images directories.

The syntax for robots.txt entry presented here should be used as is ; I mean a space is needed between ":" and the page or directory path.

Comments may be inserted in the robots file. A comment line starts with a "#" character.

A more generic syntax exists to disallow files or directories for all User-agents:

User-agent: *
Disallow: /cgi-bin/
Disallow: /family/

Should you combine both syntaxes "User-agent: Bot-Name" and "User-agent: *", you should take care to place "User-agent: *" after all "named" sections.

For example, Google's robot, Googlebot reads the robots.txt file and uses the first User-agent section matching the pattern Googlebot*. Then, Googlebot stops reading the file.

It should be the same for other bots (Yahoo or Msn).

Libellés : , , , , ,

0 commentaires:

Enregistrer un commentaire

Abonnement Publier les commentaires [Atom]

<< Accueil