Aug
20

Can Googlebot See You? Let’s Find Out

Written by Jonathan Dingman
08/20/2007 8:29 ET - Filed under Search

Making sure that Googlebot can see all your pages or rather only being able to see the pages you want it to see, is very important. Google is the only major search engine provider to offer any kind of tool that gives you the ability to see whether or not its robot can see your pages.

For a long time, you would just need to add things in your robots.txt and wait a few months to see if it actually does indeed stop indexing certain pages or if some type of regular expressions (”regex”) you entered, actually worked. The process was slow, too slow — it took too long.

This has been a feature of Google’s Webmaster Tools for quite some time now, but I only recently started using it more aggressively. I will be using this site, ginside.com, as an example for how effective this tool actually can be for webmasters.

As a quick reference, you can get to the robots.txt analysis tool by selecting one of your sites, then going to Diagnostic, then robots.txt analysis.

ginside.com robot.txt

Taking a look at my robots.txt for ginside.com, you can see a few things going on here. I have it set to the user-agent: * because I want it to select all robots that crawl this file and index my site — which includes Googlebot. I disallow my login page, the admin panel, and then I start digging into a deeper level of robots.txt.

I am using regex for the next two entries. The line that shows /*/feed/$, that is blocking all of my posts’ feeds URLs. Even though I don’t have physical links going to the feeds, I felt it was best to just completely clear out any possibility of anyone else linking to the feeds — because they did exist — and not having them indexed. I don’t want my feeds links indexed, on a per post basis, because I feel they are worthless content in the search index — for any search engine. They provide absolutely no value to anyone. Someone may be interested in subscribing to a specific feed on my site, but they are more than likely not going to do that until after they have seen the post and become very interested in it.

The next entry, /date/*/$, that blocks all of my archives pages. I have this entry because I’m doing a test for duplicate content under the same domain. I have been trying a couple different experiments with how indexing works and what the best method is to ensure that all my posts get indexed and given proper link popularity values.

So how does the /date/*/$ actually work? Some simple regex syntax gets the job done. /date/, that starts the directory where I want the robots to start blocking. The * indicates that I want the robots to check against anything after the /date/, such as /date/07/2006/, or whatever it may be. The $ indicates that is the end of the regex.

Let’s put it through the test.

Robots.txt Results

Looks like everything went according to plan and it only blocked that certain URL. The screen shot above doesn’t show the full URL because it was tough to make the screen shot work, so I tested it with that shortened URL. But nonetheless, it’s still working as it should.

The Yahoo!-specific robot crawl delay was properly ignored by Googlebot and it recognizes my sitemap.

This is just a brief article on how you can use robots.txt more effectively to control what is actually indexed on your site by major search engines (at least the ones that actually obey the robots.txt file).

Tags:
  • Subscribe via RSS
  • Bookmark to del.icio.us