At SES New York, someone asked “Why don’t you provide a parameter, like ‘?googlebot=nocrawl’ to say ‘Googlebot, don’t index this page’?” That was a pretty good question. The short answer would be that on pages you don’t want indexed by spiders, you can add this meta tag to the page:
<META NAME=”ROBOTS” CONTENT=”NOINDEX”>
You can read more about the
noindex and nofollow meta tags on our webmaster pages.
But the user specifically wanted a url parameter. I mentioned that because the parameter “id” is often used for session IDs, Googlebot used to avoid urls with “?id=(let’s say a five digit or larger number)” but that I didn’t know if that was still true. I think someone else nearby asked “Isn’t that kind of an ugly hack though?” and I had to fall back on “You asked for something that worked, not something that was pretty.” The questioner persisted, but I was out of other ways to do it, so I said I’d pass the feedback on, namely “someone wants a url parameter that’s keeps Googlebot from indexing the page.”
That question came up again today, and I wanted to mention one more way to block Googlebot by using wildcards in robots.txt (Google supports wildcards like ‘*’ in robots.txt). Here’s how:
1. Add the parameter like ‘
http://www.mattcutts.com/blog/some-random-post.html?googlebot=nocrawl’ to pages that you don’t want fetched by Googlebot.
2. Add the following to your robots.txt:
User-agent: Googlebot
Disallow: *googlebot=nocrawl
That’s it. We may see links to the pages with the nocrawl parameter, but we won’t crawl them. At most, we would show the url reference (the uncrawled link), but we wouldn’t ever fetch the page.