Can Googlebot See You? Let’s Find Out
Posted at 8:04am EST on 08/20/2007
Making sure that Googlebot can see all your pages or rather only being able to see the pages you want it to see, is very important. Google is the only major search engine provider to offer any kind of tool that gives you the ability to see whether or not its robot can see your pages.
For a long time, you would just need to add things in your robots.txt and wait a few months to see if it actually does indeed stop indexing certain pages or if some type of regular expressions (“regex”) you entered, actually worked. The process was slow, too slow — it took too long.
This has been a feature of Google’s Webmaster Tools for quite some time now, but I only recently started using it more aggressively. I will be using this site, ginside.com, as an example for how effective this tool actually can be for webmasters.
As a quick reference, you can get to the robots.txt analysis tool by selecting one of your sites, then going to Diagnostic, then robots.txt analysis.

Taking a look at my robots.txt for ginside.com, you can see a few things going on here. I have it set to the user-agent: * because I want it to select all robots that crawl this file and index my site — which includes Googlebot. I disallow my login page, the admin panel, and then I start digging into a deeper level of robots.txt.
I am using regex for the next two entries. The line that shows /*/feed/$, that is blocking all of my posts’ feeds URLs. Even though I don’t have physical links going to the feeds, I felt it was best to just completely clear out any possibility of anyone else linking to the feeds — because they did exist — and not having them indexed. I don’t want my feeds links indexed, on a per post basis, because I feel they are worthless content in the search index — for any search engine. They provide absolutely no value to anyone. Someone may be interested in subscribing to a specific feed on my site, but they are more than likely not going to do that until after they have seen the post and become very interested in it.
The next entry, /date/*/$, that blocks all of my archives pages. I have this entry because I’m doing a test for duplicate content under the same domain. I have been trying a couple different experiments with how indexing works and what the best method is to ensure that all my posts get indexed and given proper link popularity values.
So how does the /date/*/$ actually work? Some simple regex syntax gets the job done. /date/, that starts the directory where I want the robots to start blocking. The * indicates that I want the robots to check against anything after the /date/, such as /date/07/2006/, or whatever it may be. The $ indicates that is the end of the regex.
Let’s put it through the test.

Looks like everything went according to plan and it only blocked that certain URL. The screen shot above doesn’t show the full URL because it was tough to make the screen shot work, so I tested it with that shortened URL. But nonetheless, it’s still working as it should.
The Yahoo!-specific robot crawl delay was properly ignored by Googlebot and it recognizes my sitemap.
This is just a brief article on how you can use robots.txt more effectively to control what is actually indexed on your site by major search engines (at least the ones that actually obey the robots.txt file).
JohnMu
Aug 20th, 2007
I’m fairly certain you don’t want your robots.txt like that. By having generic and specified sections, you are telling the Yahoo crawler to apply the crawl-delay but letting it crawl everything, including the feeds. The specific section overrides any setting you have in your generic section.
For instance:
would allow the Googlebot to crawl everything while disallowing access to all other crawlers.
Similarly, you need to keep in mind that only Google, Yahoo and MSN (“only” probably 99.0%, but anyway :-)) actually use wildcards. This means that all other crawlers who see those disallow-lines will not be able to parse them and will allow those URLs. If you need to keep them out, you would have to put the full URLs into your robots.txt.
Jonathan Dingman
Aug 27th, 2007
John,
I actually went through and verified it, but if there isn’t any listing, such as Disallow:, it will take whatever it has from above. Such as I have listed, Yahoo!’s slurp will obey both the listings and it actually has done exactly that.
Personally, I don’t want Googlebot to be the only robot allowed to crawl my site. MSN and Yahoo! still send me a decent bit of traffic which I’m appreciative of.
It probably is the case that only the major search engines obey the wildcards, but those are all that I’m concerned with in this case. I’m just as happy not caring about the rest since most people that even search for my site, use one of those three major search engines.
But in any case, yes, you would want to use the fully qualified paths if you were wanting to exclude *every* robot that obeys the rules.
Thanks for your two cents John.