In the days before Google, in fact it was around 3 B.G), AltaVista was the newest search engine available to Internet users. In order to demonstrate the superior power of their minicomputers, AltaVista team at Digital chose to browse and index the entire Internet. At the time this was something new and most webmasters did not want these 'robot' programs visiting the pages on their site due to the resultant increased load on their servers and the associated rise in bandwidth cost. This led to the Robots Exclusion Standard. This standard was created in 1996 precisely to prevent this from happening.
Using a simple text file called robots.txt you can instruct search engines to stay out of certain directories. Here is a very simple robots.txt which disallows all search engines (User-agents) access to the /images directory.
User-agent: * Disallow: /images
By disallowing /images you are also implicitly disallowing all subdirectories under /images, such as /images/logos and any files beginning with /images such as /images.html.
The first draft of the standard did not include an "Allow" directive. It was added later, but there is no guarantee it's supported by all search engines. Anything that was set to be specifically disallowed was considered fair game to web crawlers.
To disallow access to your entire web site use a robots.txt like this:
User-agent: * Disallow: /
If User-agent is * then the following lines apply to all search engine robots. By specifying the signature of a web crawler as the User-agent you can give specific instructions to that robot.
User-agent: Googlebot Disallow: /google-secrets
Since the initial specification was issued, some search engines have expanded the protocol. An example of this is to permit the use of wildcards.
User-agent: Slurp Disallow: /*.gif$
This prevents Yahoo! (whose web crawler is called Slurp) from indexing any files on your site that end with ".gif". Keep in mind that wildcard matches are not supported by all search engines so you have to preface these lines with the appropriate User-agent line.
You can combine several of the above techniques in one robots.txt file. Here's a theoretical example.
User-agent: * Disallow: /bar User-agent: Googlebot Allow: /foo Disallow: /bar Disallow: /*.gif$ Disallow: /
Computer applications work great when it comes to following well defined instructions. The human brain however is less efficient at these functions, so the best advice is to keep things simple.
Google's webmaster tools includes a robots.txt analysis tool that is very highly recommended. For more information on the Robots Exclusion Standard, point your browser to www.robotstxt.org.
Today when companies are spending a lot of money to be included in search engine listings, the idea of excluding your content may seem quaint. But from a security perspective there are many valid reasons for limiting what a search engine indexes on your site.