View Full Version : Does this constitute duplicate content?
Simon Young
09-08-2009, 05:54 AM
Hope you can help, I want to make sure that I am not getting penalised for duplicate content penalties and if so then perhaps someone could suggest a fix..
I have a page where customers select their printer (php page) and this then takes them to their particular printer... the selection page is Compatible inkjet printer cartridges - Epson, Canon and HP ink (http://www.dvd-and-media.com/cartridges_index.php) - then I have a rewritre that converts the dynamic php results page to an html page - an example would be -
dvd-and-media.com/cartridges_purchase.php?Model=Epson+Stylus+Photo+R 320 is re-written and converted to - dvd-and-media.com/Inkjet-Cartridges-Epson-Stylus-Photo-R320.html
However if a customer adds the item to their cart and then clicks the continue shopping they then get the php version of the page - so both still exist - so does this mean a duplicate content penalty or is this acceptable?
inertia
09-08-2009, 06:12 AM
Why is the page after "continue shopping" not rewriting in the same way as the initial page?
SemAdvance
09-08-2009, 03:47 PM
A duplicate content penalty would be applied to Site B which takes Intellectual Property from Site A without permission, which is often written content, and posts it on their own site, as their own work. The search engines would supposedly be able to determine date of first inception and filter Site B from appearing in the upper results.
Duplicate pages within your own site, would be seen by the search engines as standard operation for a great deal of e-commerce websites, and others using a CMS for example.
Hope this helps!
morestar
09-08-2009, 03:47 PM
Yes right now if that was my site I would be quite worried about duplicate content issues with the site.
so going to the product page bring an HTML Re-write and then if they continue shopping they get the actual php version of the page.
have you totally checked the URL of the new php (duplicate) page? Maybe it's not even the same page that get re-written in the first place...
morestar
09-08-2009, 03:48 PM
A duplicate content penalty would be applied to Site B which takes Intellectual Property from Site A without permission, which is often written content, and posts it on their own site, as their own work. The search engines would supposedly be able to determine date of first inception and filter Site B from appearing in the upper results.
Duplicate pages within your own site, would be seen by the search engines as standard operation for a great deal of e-commerce websites, and others using a CMS for example.
Hope this helps!
OH!
I knew not...
morestar
09-08-2009, 03:56 PM
ya it might not considering the redirect to the duplicate page is done via a click submission.
But just in case, I would use, and I have used, rel="nofollow" on links that go to the cart... There's no reason to have it spidered, and you most likely wouldn't want the cart to be the landing page, anyway.
morestar
09-08-2009, 04:06 PM
I liken SEO to voodoo and make a sacrifice of rum and decapitate a chicken to Papa Legba, spirit of communications and crossroads, before every site launch.
I do the same...
ya that would be a great idea - the no follows, that should keep your mind at ease for a time anyways...the search engines don't need to see your cart pages...
Jlee350
09-08-2009, 04:07 PM
In fact I do NJ! ;)
It is GSiteCrawler - you'll have to search for it b/c I can't post url's yet. :) but it's easy to find.
SnerdeyWebs
09-08-2009, 04:12 PM
Just following up on the question ( do you have a fav crawler to use )
Anyone have a list that they normally check with that they'd like to share :)
johnWorks
09-08-2009, 04:51 PM
Hope you can help, I want to make sure that I am not getting penalised for duplicate content penalties and if so then perhaps someone could suggest a fix..
No problem. Google added an additional link attribute so that you can specify which page you prefer to be the "real" page for these types of issues. It's rel="canonical".
I still can't post links on the forum here, but just google "rel canonical" and you should see the post on the Google Webmaster Central blog discussing it, and how to implement.
dwells87199
09-08-2009, 04:51 PM
The second page does not need to be indexed. Who cares if Google penalizes it. If I am reading this correctly, the second page is accessed from the first page.
bsnelling
09-08-2009, 04:57 PM
Good advice! Do you have a favorite crawler?
1) I am fairly new here: I thought spiders crawl your site at a time undetermined. What do you mean by favorite crawler? Can I initiate the spider crawl and if, how?
2) We have a website for .com and .de and also were told this would be duplicate content. Is it correct?
3) You have posted here over 90 times and you managed to get your website on the blog. How do you do that? I was told by my SEO Co. to blog until I am blue in the face to get back links to our website. How do I get back links though, if I am not allowed to add my website to the blog? I thnk I qualify after 10 postings, but have not seen my www. ... from my profile inhere...
Sorry for the rookie questions... Hope someone knows more than me, which is a definite yes anyway, lol !!!
Thank you!!!
allanp73
09-08-2009, 05:06 PM
From what it sounds like it requires selections then a click. Robots just follow links they don't fill out forms. So I doubt they would even get to the second page. Both page sound like they would have minimum content and I wouldn't worry too much if neither got indexed.
nigeltpacker
09-08-2009, 05:31 PM
Mryang,
I do not think that the issue is with Google and duplication although there is a considerable amount of body text repetition on each page. Looking at the site there are a number of user experience issues that are distracting the user from their assigned task of purchasing ink cartridges.
Are you a printer cartridge supplier or a memory stick supplier? Are you a paper and label supplier or a battery supplier? Whilst all are relevant to a particular visitor, our research has shown that with users becoming more task orientated when purchasing online any product that is not in their scope of purchase will not register with them.
It may be worth considering splitting the site and starting new websites specialising in one product range. I can see that you have done a lot of work developing your website and you will have to weigh up the cost and profit benefits to developing new niche websites.
I hope this has given you food for thought
Regards
Nigel
CKarsting
09-08-2009, 06:03 PM
Duplicate content is usually only an issue when the content is being used in two different domains. I would suggest for somewhat meaningless pages such as these that you use a no index in your robot.txt for a few reasons. One it will remove the chance of being penalized for any duplicate contentent (inner site) and two it will keep from diluting page rank to your more important pages. There really is no need to even have pages dedicated to printing or other non topic related pages on your website, so why worry about having them indexed. Food for thought.
Clint1
09-08-2009, 11:47 PM
Just following up on the question ( do you have a fav crawler to use )
Anyone have a list that they normally check with that they'd like to share :)
Xenu (http://home.snafu.de/tilman/xenulink.html) is another one.
Clint1
09-08-2009, 11:49 PM
A duplicate content penalty would be applied to Site B which takes Intellectual Property from Site A without permission, which is often written content, and posts it on their own site, as their own work. The search engines would supposedly be able to determine date of first inception and filter Site B from appearing in the upper results.
That's not how it works in G. I've repeatedly had long-standing top-ranked content stolen/hijacked and G rewards them by deleting my pages and replacing them with the parasite's pages.
scgalvin
09-09-2009, 02:09 AM
The duplicate content penalty issue is a touchy, some people feel there is not a real "penalty" at all. I don't agree. I would guess Google looks at which content is the oldest and assumes that is the original content, but I'm sure its not that simple.
As far as the problem at hand. I would recommend to use a robots.txt file to exclude the print friendly version, I would not rely on a no_follow.
Clint1
09-09-2009, 02:37 AM
The duplicate content penalty issue is a touchy, some people feel there is not a real "penalty" at all. I don't agree. I would guess Google looks at which content is the oldest and assumes that is the original content, but I'm sure its not that simple.
Yeah, it just depends on how one looks at it and their definition of "penalty". They will usually just pick one page to index then delete the other (on the same domain). If the domains are different, in my unfortunate experience with it, they'll delete the long-standing page and index the newer from the scraper parasite website--even when that parasite page has absolutely no PR nor IBL's.
As far as the problem at hand. I would recommend to use a robots.txt file to exclude the print friendly version, I would not rely on a no_follow.
Yeah again. (That's rel="nofollow"). The rumors flying now are that the jerks are penalizing for using rel="nofollow" because they view it as "PR sculpting". Well DUH, WTF did they think was going to happen with their PR Pandora's box. :confused: :confused: You introduce something as evil, nefarious, manipulative and potentially slanderous as that, and site owners are indeed are going to have to find ways of "dealing" with it. It's a self-fulfilling prophecy.
Simon Young
09-09-2009, 04:37 AM
WOW really opened a can of worms that I didn't expect - so to summarise -
As my duplicate content is on the same domain then I should be OK
Also as it takes a click and an operation to get to the php page I should also be OK
Lastly I only risk one of the pages being ignored by Google and keep one in the index.
I understand the benefit of adding a rel="nofollow" to the cart links as this we defiantely stop those php files being indexed but that brings up one further issue that perhaps you guys can answer -
As my cart is handled on another domain by an external provider the way my site is setup currently means I have absolutely thousands of external links on my site - would adding rel="nofollow" to all those links do me any harm or benefit me in the long run.
Clint1
09-09-2009, 08:45 AM
As my duplicate content is on the same domain then I should be OK.
Not really. As others have suggested, you should block the printer friendly pages (or whatever) in your robots.txt file.
Also as it takes a click and an operation to get to the php page I should also be OK
I'm not sure what you mean by "operation", but if it's a plain basic exposed link, it will be spidered and potentially picked up.
Lastly I only risk one of the pages being ignored by Google and keep one in the index.
"Generally", yes, but you have no way of knowing which, and they could end up deleting both pages.
I understand the benefit of adding a rel="nofollow" to the cart links as this we defiantely stop those php files being indexed but that brings up one further issue that perhaps you guys can answer -
As my cart is handled on another domain by an external provider the way my site is setup currently means I have absolutely thousands of external links on my site - would adding rel="nofollow" to all those links do me any harm or benefit me in the long run
No, use the robots.txt file to block the pages. See my post #23 above.
There's no point in allowing cart pages, or for that matter even bothering with them. Even if dup content is involved there, after all they're only cart pages and who really cares, right? ;)
You shouldn't block all OBL's, unless they are all cart and cart associated pages. But like I said, why bother. Some OBL's could even help you if they go to well-known sites or same-field sites as yours.
Simon Young
09-09-2009, 10:37 AM
Clint, can't really use the robots file to block the other 50% of pages as it would be over 4000 pages and they are constantly changing - I don't believe Google pays any attention to my exclusions already in my robots.txt file as I disallowed a directory (my marketing directory) several months ago and the content of that directory is STILL in the index.
I would love to exclude the cart URL using the robots file and that relates to all the links to the cart in my site which are always in a form -
<form method="POST" action="http://ww3.aitsafe.com/cf/add.cfm" onSubmit="return mmCheckIfSelected(this)">
So it would be the URL ww3.aitsafe.com and any variants after that that I would want to exclude.
I tried changing form POST to include a nofollow but that didnt give valid code such as -
<form method="POST" action="http://ww3.aitsafe.com/cf/add.cfm" onSubmit="return mmCheckIfSelected(this)" rel="nofollow">
So what would I need to add as a line in my robots.txt file to basically say ignore and dont follow ANY links which include ww3.aitsafe.com
weegillis
09-09-2009, 11:45 AM
Have you found any results in the SERP that point to your printer friendly pages? If not, it might be safe to say they're not being indexed and you're off the hook already.
As has been pointed out be several previous posters, duplicate content on your site 'may' confuse the spiders, a wee little bit, but not much, and would not result in a penalty if I understand the SE's definition of duplicate content (across domains).
I would be more worried about upsetting the cart in progress, or hot linking into the cart process without it having been initialized. From a technical perspective, it is this sort of proactive fix I would be looking for, if such a problem might exist.
From what I can make of using rel="nofollow" for your third party shopping cart, it makes no sense to try to use it. Better to contact the provider and have them just ban robots on that URI altogether. Problem solved for everyone (subscribers).
It was also pointed out that one of the paths is generated on a click, which means the URL is contained by a Javascript method. Wouldn't this make the link invisible?
Clint1
09-10-2009, 01:53 AM
Clint, can't really use the robots file to block the other 50% of pages as it would be over 4000 pages and they are constantly changing
Yeah that would be a problem.
I don't believe Google pays any attention to my exclusions already in my robots.txt file as I disallowed a directory (my marketing directory) several months ago and the content of that directory is STILL in the index.
The Gbot's pay attention to mine. Have you tried putting the file through a validator to see if it's error-free, or tested in the G WMT area?
I would love to exclude the cart URL using the robots file and that relates to all the links to the cart in my site which are always in a form -
<form method="POST" action="http://ww3.aitsafe.com/cf/add.cfm" onSubmit="return mmCheckIfSelected(this)">
So it would be the URL ww3.aitsafe.com and any variants after that that I would want to exclude.
And you can't do that, because.....?
I tried changing form POST to include a nofollow but that didnt give valid code such as -
<form method="POST" action="http://ww3.aitsafe.com/cf/add.cfm" onSubmit="return mmCheckIfSelected(this)" rel="nofollow">
Did you try it like this:
<form method="POST" action="http://ww3.aitsafe.com/cf/add.cfm" rel="nofollow" onSubmit="return mmCheckIfSelected(this)">
So what would I need to add as a line in my robots.txt file to basically say ignore and dont follow ANY links which include ww3.aitsafe.com
Now that I think about it, I don't know if that would even work because it's an outside domain. I don't know. You can of course put your own URL's in a robots.txt file, but I'm not aware how that could be done, or even if it could be done for someone else's URL.
Clint1
09-10-2009, 01:57 AM
Have you found any results in the SERP that point to your printer friendly pages? If not, it might be safe to say they're not being indexed and you're off the hook already.
Yeah that may solve everything. As long as it always stayed that way. You'd probably need to periodically check to see if that would remain the case.
johnWorks
09-10-2009, 01:32 PM
mryang,
Just took a look at your website and I'm starting to think this whole conversation is moot. But first...
RE: rel="nofollow" -- this is a link attribute and only works on anchor tags (<a href="" rel="nofollow">), so won't do anything for your <form> tag.
RE: robots.txt -- unless things have changed, robots.txt won't prevent pages from showing up in the index. The content doesn't get indexed, but the URI itself does and can show up as a URI only listing (no snippet). Best way to keep a page completely out of the index is to use meta robots tag:
<meta name="robots" content="none" />
That's not practical for your pages though. And I don't think it's an issue anyway because...
The "link" from a product page to add a product to the cart is not a link at all. It's a form submit button. AFAIK, SE spiders don't submit forms. If they did, I'd see my various "thank you" pages in the index.
Also AFAIK, the spiders don't crawl URIs specified in your form action. Spiders crawl links -- <a href="">.
(I know that spambots can submit forms, but I'm talking about SE spiders here. Why would they bother utilizing additional resources to do this when there are so many real links out there to crawl? ;))
Considering you have to submit 2 forms to get to the "duplicate" .php page, I don't think you've got anything to worry about here... unless you're linking to the duplicate page from somewhere else.
nowreturn
09-11-2009, 01:37 AM
copyscape com
check
Clint1
09-11-2009, 02:32 AM
copyscape com
check
While that place is good to see if others have copied your webpages, it doesn't come into play here when one is asking about what is considered dupe content in the eyes of the SE bots, and whether/how to block it. ;)