View Full Version : index a page but not its session duplicates
06-08-2009, 02:19 PM
Is this set of directives doing what it is supposed to? Will all the bots interpret it the same way?
The obvious "Disallow: / * ?" will not work because then all the urls containing a session id will be disallowed, which we don't want.
The above two pages are the only ones with duplicate content issues (title, etc.). The session id is matched to a table and made the SELECTED item in a pull down menu. It is also passed to the Submitted page on confirmation of submission.
We do want the Register and Reserve pages indexed, just not the individual sessions. This is further complicated by the fact that there are other pages in the same directory that are totally dependent on session ids. We want all of them to be indexed.
The method you have should work, although you don't specifically need the allow lines. You have two additional options as well. You could dynamically add the canonical tag, or add a meta noindex tag if a session id is present in the page.
06-08-2009, 03:24 PM
Am I correct in surmising from your comment that a url without a '?' will be ignored?
I have included a rel="nofollow" attribute in the dynamic generated referring link. I'm hoping that if these pages (sessions) have been indexed in the past six months, that they would eventually fall off the radar with this addition to robots.txt.
Is this the canonical approach to which you have referred?
<link rel="canonical" href="http://www.example.com/courses/register.php" />
<link rel="canonical" href="http://www.example.com/courses/reserve.php" />
Now the 'dumb and dumber' question: Which page should this tag go into? The target page(s) or the dynamic page referring to it (them)?
Our dynaimc referring page utilizes a flag that toggles the link text and the target to one or the other of the above pages.
In reference to my previous comment, disallow statements are inclusive, so if you disallow file.php?, anything that starts with file.php? will be blocked, however, file.php would still be allowed, so the allow statement is not technically needed (but won't cause an issue if you leave it there).
I don't think the rel=nofollow tag on the links will have any effect at all - it appears that this tag simply tells the spider not to pass pagerank, it may not actually prevent the page from being browsed.
I would consider the canonical tag, as shown, but only if the pages are actually duplicates, or are very similar. If there are large sections of unique text, Google may decide to disregard the canonical tag. If you do implement it though, you would put the tag on all of the pages. So, the top tag you gave would go on register.php, and every session variant. This tells the spider that the pages fit into a logical unit, but again it will only work if there is limited unique content.
06-08-2009, 04:08 PM
Limited unique content - very much so. The only difference is in the SELECTED text in the pull down, and the page h1. I could have made a dynamic title but couldn't see the sense in it back then. I'm kind of kicking myself for letting this duplicate issue slide for so long. It only now began showing in GWT.
On looking at the rel="nofollow" thing, a second time, it makes no sense to worry about juice if the page isn't being browsed. I really don't know what gave me the idea to use it in the first place besides knee jerkiness. Thanks for pointing this out.
06-13-2009, 01:15 AM
so there is a way to that huh!
Still learning those tags and codes for my site.
Please keep it going.
Thanks a lot.