iEntry 10th Anniversary Forum Rules Search
WebProWorld
Register FAQ Calendar Mark Forums Read
Google Discussion Forum Google Discussion forum is for topics specifically related to Google. There is a subforum dedicated to AdSense/AdWords subjects.

Share Thread: & Tags

Share Thread:

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 02-26-2007, 05:56 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default SI Duplicate Content Issues, Any Help?

I thought I would ask here before digging in this further. Obviously I have my own blog:

http://www.jaankanellis.com/

Take a look at this site: operator query:

http://www.google.com/search?q=site:...&start=10&sa=N

I do have plenty of pages indexed as supplemental results in Google right now, some for good reason. I pretty much understand why and that is really not the concern here. The really concern is Google indexing duplicate posting from the poll plug-in I have installed on the blog. See the site: operator results have plenty indexed:

http://www.jaankanellis.com/page/19/...rue&poll_id=5/

Normally I would not worry, but it seems to be effecting my traffic and indexing in Google. See Google has indexed the crappy poll URL:

http://www.jaankanellis.com/page/19/...rue&poll_id=5/

as the main way to crawl my web content when in actuality it need to be crawling the URL below which has been placed in the Supplemental Index:

http://www.jaankanellis.com/jagger3-here-it-comes/

As I usually preach and most understand here Google normally doest do this. "Usually they are smart enough to pick the right URL. In this case they have not. So my question is

1. How do I block the robots from accessing this "crap" URL?

2. What is the easiest way to fix this other than just blocking the bots from those URL?
Reply With Quote
  #2 (permalink)  
Old 02-26-2007, 06:54 PM
kgun's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: May 2005
Location: Norway
Posts: 5,723
kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10
Default

  • Content is content.
  • Style is style. Hint: Sitewide stylesheets.
  • Code is code. Hint: Include files.

Make it simple, as simple as possible, but no simpler.

Think of what you can do better when you go to bed. Think so hard that you dream of it.

Last but not least. Make a better heading than for this post. You lost one customer because of non fucused KW's in this post :-)
Reply With Quote
  #3 (permalink)  
Old 02-26-2007, 10:32 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default

Any more ideas folks?
Reply With Quote
  #4 (permalink)  
Old 02-27-2007, 08:50 AM
crankydave's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Aug 2004
Location: Playing with fire!
Posts: 4,254
crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9
Default

Have you considered "no follow" "no index" on your poll plugin/URL's?

Dave
Reply With Quote
  #5 (permalink)  
Old 02-27-2007, 12:10 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default

Quote:
Originally Posted by crankydave
Have you considered "no follow" "no index" on your poll plugin/URL's?

Dave
Yes but where is Google finding these URLs and why would they consider them the primaries. Doesnt make much sense to me and I would like to figure it out.
Reply With Quote
  #6 (permalink)  
Old 02-27-2007, 12:28 PM
kgun's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: May 2005
Location: Norway
Posts: 5,723
kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10
Default

So you changed the post title. IMO it is important to think some seconds of a good post title with relevant KW's, both for SE bot indexing, your own ad and finding the post at WPW.
Reply With Quote
  #7 (permalink)  
Old 02-27-2007, 01:14 PM
crankydave's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Aug 2004
Location: Playing with fire!
Posts: 4,254
crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9
Default

Quote:
Originally Posted by incrediblehelp
Quote:
Originally Posted by crankydave
Have you considered "no follow" "no index" on your poll plugin/URL's?

Dave
Yes but where is Google finding these URLs and why would they consider them the primaries. Doesnt make much sense to me and I would like to figure it out.
Good point.

Wonder if they're following java? The href for the view results link?

Dave
Reply With Quote
  #8 (permalink)  
Old 02-27-2007, 07:01 PM
craigmn3's Avatar
WebProWorld Veteran
 
Join Date: Jan 2004
Location: California
Posts: 335
craigmn3 RepRank 1
Default Switch

Switch the actual content of each page or blog entry, (while keeping a back up) then the information and navigation you wanted to be your entrance page will be there, put the poll page in the entrance page place.

I place no credence in the ability of any bot to determine what is a head and what is a tail, it's all math, Figure out the math and your good. Art doesn't enter into it.
Reply With Quote
  #9 (permalink)  
Old 02-27-2007, 08:53 PM
WebProWorld Member
 
Join Date: Aug 2006
Posts: 72
EditFast RepRank 0
Default

Am I thinking too simple or am I missing soimething?
1) Use the robots.txt to bar the googlebot from that URL.
2) Use the meta tag nofollow
3) use .htaccess to bar the google bot from that URL.
__________________
EditFast
Any Document --> Any Time!
Web Site Copy Editing & Proofreading
Reply With Quote
  #10 (permalink)  
Old 02-27-2007, 09:26 PM
WebProWorld Member
 
Join Date: Apr 2004
Location: Chicago, IL
Posts: 48
tacimala RepRank 0
Default

A lot of blogs right now are getting hit with a lot of supplemental results because there are too many of the same ways to get the same information, such as by date, category, or by the RSS/feed URL's. From a search engine standpoint the only one that should matter is the original blog post URL. Block out the other ways to access that same info in a robots.txt and your supplemental index ratio will go way down.

Obviously you don't want to block the feed readers from these URL's, but just Google/Slurp/MSNbot.
Reply With Quote
  #11 (permalink)  
Old 02-27-2007, 10:51 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default

Quote:
Originally Posted by EditFast
Am I thinking too simple or am I missing soimething?
1) Use the robots.txt to bar the googlebot from that URL.
2) Use the meta tag nofollow
3) use .htaccess to bar the google bot from that URL.
Editfast take a look at the site: operator above. It is not just one URL that I am talking about here. Its hundreds....I just posted one as a example.

If it was one URL, I wouldn't have to worry to much. In fact it is probably impossible for me to know all the URLs Google has indexed this way...blah.
Reply With Quote
  #12 (permalink)  
Old 02-28-2007, 12:20 AM
WebProWorld Member
 
Join Date: Apr 2004
Location: Chicago, IL
Posts: 48
tacimala RepRank 0
Default

Going a little further on what I wrote before, your poll is on every page so it could be that Google sees the URL's given by the poll as stronger than the URL you are hoping for because it is used more often on the site. My guess is that if you follow my advice above and disallow Google to the directories you do not want indexed and submit a sitemap to webmaster central that your problem will resolve itself within a reasonable amount of time. Take it a step further and do a 301 redirect if you can for the pages are most affected by this. Redirect the bad URL that is currently indexed to the one you want.
Reply With Quote
  #13 (permalink)  
Old 02-28-2007, 01:21 AM
WebProWorld Member
 
Join Date: Jan 2007
Location: India
Posts: 38
ddwebguru RepRank 0
Default

Hi incrediblehelp, it's really a problem because the pages are so many - is this pages are orphan? the pages are linked from your main site? (you can find in these pages in google webmaster account). if not you have to add no index no follow tag in each page.
Reply With Quote
  #14 (permalink)  
Old 02-28-2007, 04:50 AM
Webnauts's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Aug 2003
Location: Worldwide
Posts: 8,170
Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9
Default

Jaan I am not sure if I really understood the problem.
If it is about URLs like this http://www.jaankanellis.com/page/19/...rue&poll_id=5/ then you can add in your robots.txt the rule:

Disallow: *?
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO
Reply With Quote
  #15 (permalink)  
Old 02-28-2007, 02:04 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default

Quote:
Originally Posted by Webnauts
Jaan I am not sure if I really understood the problem.
If it is about URLs like this http://www.jaankanellis.com/page/19/...rue&poll_id=5/ then you can add in your robots.txt the rule:

Disallow: *?
Wont that disallow all URLs with ? parameters? If so I dont like that option as I will have other URLs using that parameter.
Reply With Quote
  #16 (permalink)  
Old 02-28-2007, 02:42 PM
crankydave's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Aug 2004
Location: Playing with fire!
Posts: 4,254
crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9
Default

Jaan...

I did a search for the following...

jal_no_js=true&poll_id=1/

Here's what I got...

http://www.google.com/search?hl=en&r...poll_id%3D1%2F

Here's a link to the first result I see...

http://forum.semiologic.com/discussi...seo-nightmare/

Perhaps this helps?

Dave
Reply With Quote
  #17 (permalink)  
Old 02-28-2007, 02:53 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default

Quote:
Originally Posted by crankydave
Perhaps this helps?
From the article:

"The fix is to use plugins that enforce permalinks. You then get a 301 redirect to the proper uri when this occurs. End of story."

I am already doing this. Obviously the permalink plugin/code I am using is not picking up these as dups and initiating the 301 redirect.

Arrggggghhhh
Reply With Quote
  #18 (permalink)  
Old 02-28-2007, 03:09 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default

Maybe use

Code:
Disallow: /*poll_id*
In the robots.txt?

I would definitely rather use a 301 redirect for those wrong URLs to the right ones. Not sure why the plugin is not working.
Reply With Quote
  #19 (permalink)  
Old 02-28-2007, 03:35 PM
crankydave's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Aug 2004
Location: Playing with fire!
Posts: 4,254
crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9
Default

Quote:
Originally Posted by incrediblehelp
Maybe use

Code:
Disallow: /*poll_id*
In the robots.txt?

I would definitely rather use a 301 redirect for those wrong URLs to the right ones. Not sure why the plugin is not working.
I just found the same thing you did. :)

Quote:
Firstly I believe you can remove these duplicate from google by editing your robots.txt file. Add the following

Code:
User-agent: *
Disallow: /*poll_id*
I'd prefer the 301 too but at least it *should* be a fix.

Then perhaps use the console to delete the URL's.

Dave
Reply With Quote
  #20 (permalink)  
Old 02-28-2007, 03:41 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default

I posted at the following locatons looking for help:

Forum: Democracy is not an SEO nightmare

plugin not 301 redirecting poll URLs by incrediblehelp
Reply With Quote
  #21 (permalink)  
Old 02-28-2007, 08:15 PM
WebProWorld New Member
 
Join Date: Feb 2007
Location: Miami
Posts: 1
googlecert RepRank 0
Default Google Index

Go here and sign up for google sitemaps:

https://www.google.com/accounts/Serv...Fhl%3Den&hl=en

Then go here and download the sitemap xml generator:

http://sourceforge.net/project/showf...kage_id=153422

Run the program on your site to index and create a site map, once you have a sitemap you can assign each page an importance weight as a number, so your most important page would be weighted 1, you can even exclude pages from the google bot from the created google sitemap.

Next upload the created sitemap files into the root directory of your website on your server making note of the file names.

Now log into your google sitemap account and were it says submit a sitemap you will submit the xml files one at a time, example sitemap.xml, sitemap1.xml.

Once submitted it takes google about 20 minutes for the bot to come to your site to verify the sitemaps are correct and then they will start to read the sitemap and the bot will act accordingly.

You will also receive the following data from google in the sitemap admin panel:

Crawl errors
Web crawl
Mobile Web
robots.txt analysis
Crawl rate
Preferred domain
Enhanced Image Search

This is from googles help files on sitemaps:

Index stats use our advanced operators to provide you with sample results about how your site is indexed. We've used these advanced operators to return information about your home page. Click on the link to view a list of results. Stats that may be available are:

Indexed pages in your site - uses the site: operator to return a sample list of your indexed pages.
Pages that refer to your site's URL - uses the allinurl: operator to return a sample list of pages that mention your site's URL.
Pages that link to your site - uses the link: operator to return a sample list of pages that link to your site.
The current cache of your site - uses the cache: operator to return the current cache of your home page. If you don't want Google to cache your site, you can specify this in the <head> section of your pages.
Information we have about your site - uses the info: operator to return the description we have of your site.
Pages that are similar to your site - uses the related: operator to return a sample list of pages that we consider similar to your site.

Page analysis stats provide information about how the Googlebot views the crawled pages of your site. Stats that may be available are:

Type - the content type of your crawled pages. We use content type for the File Format search option in our Advanced Search.
Encodings - the encoding used by your crawled pages.
Common words - words in your site's content, and in external anchors to your site.

Query stats provide information about search queries that have returned pages from your site. If your site is listed in Google Mobile Web Search results, these queries are listed as well. Average top position is the highest position any page from your site ranked for that query, averaged over the last three weeks. Since our index is dynamic, this may not be the same as the current position of your site for this query.

For detailed information about our search query syntax, see the Google Web API reference. You can click on any listed seach query to view the results of that query.

Stats that may be available are:

Top search queries - list the top queries that return results from your site. Note that this list is unrelated to where your site is listed in the search results.
Top search query clicks - the top search queries that directed traffic to your site. These are the top searches that caused users to click on a link to your site.

Hope this helps you

Freddie Molto
CMO
http://www.sdcmedia.com
Reply With Quote
  #22 (permalink)  
Old 03-01-2007, 10:59 AM
kgun's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: May 2005
Location: Norway
Posts: 5,723
kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10
Default

Quote:
Originally Posted by incrediblehelp
As I usually preach and most understand here Google normally doest do this. "Usually they are smart enough to pick the right URL. In this case they have not. So my question is

1. How do I block the robots from accessing this "crap" URL?

2. What is the easiest way to fix this other than just blocking the bots from those URL?
Your post title and my first answer was of similar quality.

Quote:
Originally Posted by incrediblehelp
Quote:
Originally Posted by crankydave
Perhaps this helps?
From the article:

"The fix is to use plugins that enforce permalinks. You then get a 301 redirect to the proper uri when this occurs. End of story."

I am already doing this. Obviously the permalink plugin/code I am using is not picking up these as dups and initiating the 301 redirect.

Arrggggghhhh
Then you should modify the code.

Quote:
Originally Posted by incrediblehelp
Quote:
Originally Posted by EditFast
Am I thinking too simple or am I missing soimething?
1) Use the robots.txt to bar the googlebot from that URL.
2) Use the meta tag nofollow
3) use .htaccess to bar the google bot from that URL.
Editfast take a look at the site: operator above. It is not just one URL that I am talking about here. Its hundreds....I just posted one as a example.
If it was one URL, I wouldn't have to worry to much. In fact it is probably impossible for me to know all the URLs Google has indexed this way...blah.
My bolding.

My advice in order of priority:
  1. Modify / change incorrect code if possible.
  2. Block the relevant directories using .htaccess or robots.txt
  3. Then use permalinks / redirects for any remaining links in directories that can not be blocked.
  4. Look up the actual links and send a request to Google to delete them from the S(I).
Reply With Quote
  #23 (permalink)  
Old 03-01-2007, 03:45 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default

Just to clarify:

1. I could fix this through robots.txt and I will as a last resort. Here is the code I could use:

User-agent: Google
Disallow: *jal_no_js*

and

User-agent: Slurp
Disallow: *jal_no_js*

2. I would like to 301 redirect the crap URLs to the correct ones and I working on that now.
3. It is not about blocking the URLs. First Google shouldn't be giving these any weight to begin with. They should be using the correct ones, but we cant cry over spilled milk.
4. Building a sitemap or using Webmaster console has nothing really to do with this. In fact I am big fan of NOT using a sitemap if your website/blog is getting adequate spidering without it and I am.
5. There is no "incorrect" code here. Google is just picking the wrong URLs.
6. I am using permalink as you can see with the URLs being rewritten on the fly with the post title as the subdirectory.
7. It is counter productive to look up 100s or 1000s of URLs and manual delete them.
8. I really appreciate all comments thus far.
Reply With Quote
  #24 (permalink)  
Old 03-01-2007, 07:03 PM
kgun's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: May 2005
Location: Norway
Posts: 5,723
kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10kgun RepRank 10
Default

Quote:
Originally Posted by incrediblehelp
7. It is counter productive to look up 100s or 1000s of URLs and manual delete them.
Jaan, what about outsourcing? I remember a youg boss I had, one of the better:

He had a lot of slogans, among others this one:

"I can not answer every question or solve every problem, but I can hit this button, and ask one of my employees."

But perhaps, that is not an option when you have done a redirect.

Myself, I love permalinks.
Reply With Quote
  #25 (permalink)  
Old 03-02-2007, 05:17 PM
WebProWorld Member
 
Join Date: Apr 2004
Location: Chicago, IL
Posts: 48
tacimala RepRank 0
Default

Quote:
Originally Posted by incrediblehelp
Just to clarify:

1. I could fix this through robots.txt and I will as a last resort. Here is the code I could use:

User-agent: Google
Disallow: *jal_no_js*

and

User-agent: Slurp
Disallow: *jal_no_js*

2. I would like to 301 redirect the crap URLs to the correct ones and I working on that now.
3. It is not about blocking the URLs. First Google shouldn't be giving these any weight to begin with. They should be using the correct ones, but we cant cry over spilled milk.
4. Building a sitemap or using Webmaster console has nothing really to do with this. In fact I am big fan of NOT using a sitemap if your website/blog is getting adequate spidering without it and I am.
5. There is no "incorrect" code here. Google is just picking the wrong URLs.
6. I am using permalink as you can see with the URLs being rewritten on the fly with the post title as the subdirectory.
7. It is counter productive to look up 100s or 1000s of URLs and manual delete them.
8. I really appreciate all comments thus far.
It is tedious to do it like that now, but my guess is that if you implement everything in your robots.txt that not only will it fix things for your future blog posts, but Google will also resolve the problem itself after your pages get cached again in the future. 301's will just speed it up.
Reply With Quote
  #26 (permalink)  
Old 03-02-2007, 05:26 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default

Yeah Ii just added the robots.txt for Google today, even though I wanted a 301 redirect solution.
Reply With Quote
Reply

  WebProWorld > Search Engines > Google Discussion Forum

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT -4. The time now is 02:20 AM.



Search Engine Optimization by vBSEO 3.3.0