The robots.txt (not robot.txt as it is commonly misspelled) and robots meta tag are two similar methods for excluding the search engine spiders from indexing all or part of your website.
Perhaps you don’t want a curious little robot parsing around your cgi folder or perhaps you have set up a temporary directory or a private area or perhaps you have an example page set up demonstrating spamming your own page with keywords or hidden text links and you don’t want this page penalized by the search engines.
These are all reasons to block a search engine robot from indexing some pages, images, scripts and other elements of your website. The robots.txt file needs to go in the root folder (the same folder as your index.html file).
In order to exclude the search engine robots from all or parts of your website, many webmasters will use a robots.txt file. This file is text only (no html) and has a couple of specific areas needing specific information.
The user-agent area contains the name of the robot, robots or all robots.
User-agent: * In this example, the wildcard * means all robots.
User-agent: googlebot In this example, only Google’s robot is excluded.
The next area of the robots.txt file is the Disallow area. In this area you can exclude a robot or robots from indexing your folders, images, html pages, scripts or other files.
Disallow: /cgi-bin In this example, only the cgi-bin folder is excluded.
Disallow: /query.html In this example, only the query.html file is excluded.
If you would like your entire site not to be indexed by any of the search engines you would put this in your robots.txt file:
If you want to exclude all of the robots from a certain directory on your website, your robots.txt file would look like this:
If you want to exclude the robot from indexing a certain file in a certain directory, the robots.txt file would look like this:
If you would like to keep a specific search engine robot from indexing a specific file, the robots.txt file would look like this:
If you would like to see what the world’s top search engine Google is doing, let’s take a look at the Google robots.txt file:
If you would like to see what our own White House is doing, let’s take a look at a small part of the whitehouse.gov robots.txt file (the whole file is too long for display here):
The character # is used for a comment in the robots.txt file. The rule of thumb is to place the # on new line with the comment.
Robots Meta Tag
Another way to exclude the robots from indexing html pages of a website or not following the links is by using a robots meta tag. There are 4 main robots meta tags one can use in order to instruct a robot.
<meta name=”robots” content=”index,follow”>
<meta name=”robots” content=”noindex,follow”>
<meta name=”robots” content=”index,nofollow”>
<meta name=”robots” content=”noindex,nofollow”>
Index – instructs the robot to index the page.
Noindex – instructs the robot not to index the page.
Follow – instructs the robot to follow the links from the page and index them.
Nofollow – instructs the robot not to follow the links from the page and thus not index them.
If there is not robots meta tag on the website, then the default is “index,follow” which means all robots will index the page and follow the links to other pages for indexing.