iEntry 10th Anniversary Forum Rules Search
WebProWorld
Register FAQ Calendar Mark Forums Read
Google Discussion Forum Google Discussion forum is for topics specifically related to Google. There is a subforum dedicated to AdSense/AdWords subjects.

Share Thread: & Tags

Share Thread:

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 07-11-2007, 02:15 PM
WebProWorld New Member
 
Join Date: Oct 2005
Posts: 4
briscoe98 RepRank 0
Default Old Pages Being Indexed By Third Party

Within the past six weeks we have had many page errors going to old products that we no longer carry, in some cases, we haven’t carried them for long time. I am assuming they are coming from some sort of robot or script that is crawling the entire site, but errors out when it hits the old product pages. We will get several page errors (about 50-75 at a time) within a few minutes from the same IP address, and then it will be fine for a couple of days.

When it happens, the IP addresses almost always come from international locations. I don’t know if it is one person, and they are masking their IP address, or if it is actually coming from different places. Some of the places include Denmark, Norway, Hong Kong, Canada and every once in a while a United States location. The IP address is never a common robot like Google, MSN, AOL, LYCOS, Yahoo etc.

Another thing is the URL in these cases is always http://sitename.com and not http://www.sitename.com. The common robots always use www in our URL. Going back four years, I have never seen consecutive errors coming from a non www URL, until about six weeks ago.

I am worried that since this is not coming from common robots that it might be something malicious, especially coming from various international IP addresses. Does anybody have suggestions on what this might be and what can be done to prevent this? I know I can do an ISAPI Rewrite to prevent the non www issue, but I am more concerned why old non existing pages keep getting hit by something out there.

Last edited by briscoe98; 07-11-2007 at 02:28 PM.
Reply With Quote
  #2 (permalink)  
Old 07-11-2007, 06:16 PM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,657
wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9
Default Re: Old Pages Being Indexed By Third Party

First I would try to find out where the traffic is coming from, to determine if it might be a spider or other legitimate but erroneous traffic. Try doing a reverse lookup on the IP address (from your command prompt enter ping -a -n1 0.0.0.0 where 0.0.0.0 is the IP address. Check that site to see if it is a directory or search engine or something else.

Also, if a search engine or other spider is picking up old links, you may be able to locate the source of the old links by entering "yoursite.com/path/to/the.page" exactly as it appears in your error logs, in quotes, in Google. This should show you any page that shows that lists that URL. You can also use the link: search to find true links to the pages in question. It is possible that their are links to the old pages, possibly even on a foreign language web site, that some other SE is indexing every few months and causes this periodic spike in traffic.

When I redesigned my primary site, after setting up all the redirects I set up a logging system to track access attempts to the old addresses, so I could track down old links that had not been updated. I found that the traffic to the old links would sometimes spike as search engines checked some pages, found the redirects, then started heavier traffic pretty quickly. I noticed this mostly with Yahoo, which also sent randomized URLs to force the server to give error messages after the redesign. It seemed from the logs that once the spider found a few redirects traffic from that spider increased as it explored the new structure.

Depending how you handled the removal of the old pages (404, 301, 302...) some databases may store the non-existant url and recheck periodically. Supposedly, if you simply delete the old page, Google would see it deleted but periodically check back (monthly to semi-annually depending on a range of factors) to see if the page came back. If you did a permanent redirect, Google would retain the old URL for a much shorter amount of time. I think a lot of other SEs do the same, but I have more info from my logs and hearsay regarding Google than I do about any of the others.

Also, there are software programs that let users store offline copies of web sites. For example IE5 had this feature built in. This could be related to a feature such as that.
__________________
The best way to learn anything, is to question everything.

Last edited by wige; 07-11-2007 at 06:19 PM.
Reply With Quote
  #3 (permalink)  
Old 07-11-2007, 08:23 PM
Orion's Avatar
WebProWorld Veteran
WebProWorld MVP
 
Join Date: Sep 2003
Location: Halton Hills, ON
Posts: 702
Orion RepRank 4Orion RepRank 4Orion RepRank 4Orion RepRank 4
Default Re: Old Pages Being Indexed By Third Party

you can also use the DNS tools to look up IPs, owners etc.

If it's ok traffic and not just someone trying to waste your bandwidth, I'd consider 301'ing the majority of those pages in your .htaccess file to either similar products or to your main catalogue page(s). If it's a bot, hopefully they'll pick up on that and actually fix the links they're following in their database.
Reply With Quote
  #4 (permalink)  
Old 07-11-2007, 08:46 PM
WebProWorld New Member
 
Join Date: Oct 2005
Posts: 4
briscoe98 RepRank 0
Default Re: Old Pages Being Indexed By Third Party

Quote:
Originally Posted by wige View Post
First I would try to find out where the traffic is coming from, to determine if it might be a spider or other legitimate but erroneous traffic. Try doing a reverse lookup on the IP address (from your command prompt enter ping -a -n1 0.0.0.0 where 0.0.0.0 is the IP address. Check that site to see if it is a directory or search engine or something else.

Also, if a search engine or other spider is picking up old links, you may be able to locate the source of the old links by entering "yoursite.com/path/to/the.page" exactly as it appears in your error logs, in quotes, in Google. This should show you any page that shows that lists that URL. You can also use the link: search to find true links to the pages in question. It is possible that their are links to the old pages, possibly even on a foreign language web site, that some other SE is indexing every few months and causes this periodic spike in traffic.

When I redesigned my primary site, after setting up all the redirects I set up a logging system to track access attempts to the old addresses, so I could track down old links that had not been updated. I found that the traffic to the old links would sometimes spike as search engines checked some pages, found the redirects, then started heavier traffic pretty quickly. I noticed this mostly with Yahoo, which also sent randomized URLs to force the server to give error messages after the redesign. It seemed from the logs that once the spider found a few redirects traffic from that spider increased as it explored the new structure.

Depending how you handled the removal of the old pages (404, 301, 302...) some databases may store the non-existant url and recheck periodically. Supposedly, if you simply delete the old page, Google would see it deleted but periodically check back (monthly to semi-annually depending on a range of factors) to see if the page came back. If you did a permanent redirect, Google would retain the old URL for a much shorter amount of time. I think a lot of other SEs do the same, but I have more info from my logs and hearsay regarding Google than I do about any of the others.

Also, there are software programs that let users store offline copies of web sites. For example IE5 had this feature built in. This could be related to a feature such as that.
Thank you for your help. I tried the following and here is what happened:

I tried to the Ping command for the last 5 attempts and all 5 timed out. Just to make sure the ping command was working, i tried known good IPs and it worked fine.

I tried 10 random links from these errors in quotes in google, and all 10 came up with nothing.

Normally we do custom 404 errors, but on product pages we redirect to a new page with product suggestions. Since these old pages are no longer indexed in google, I am thinking that somebody is getting them from somewhere else, like maybe a website archive from a third party.

Could somebody be scanning these old pages to look for any old pages that we still might have on our server, but no longer use. Maybe to look for possible security loop holes? At this point I am thinking that it is not a search bot.
Reply With Quote
  #5 (permalink)  
Old 07-11-2007, 09:14 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default Re: Old Pages Being Indexed By Third Party

So are you worried about this because you don't want to see all the errors in your logs files or just don't want some weird bot crawling your pages and taking bandwidth? Sounds just like a crappy scrapper bot to me.
Reply With Quote
  #6 (permalink)  
Old 07-11-2007, 09:23 PM
DrTandem1's Avatar
WebProWorld 1,000+ Club
 
Join Date: Oct 2003
Location: Encinitas, CA
Posts: 1,830
DrTandem1 RepRank 2
Default Re: Old Pages Being Indexed By Third Party

If a page or pages of your site that are no longer active, but still getting traffic, you are missing this traffic by letting it go to a 404 error. Create a custom 404 error page that allows the visitor to still access your site. Also, as mentioned, use 301 redirects for such pages.
__________________
DrTandem's San Diego Web Page Design, drtandem.com
Reply With Quote
  #7 (permalink)  
Old 07-11-2007, 10:02 PM
WebProWorld Member
 
Join Date: Nov 2006
Location: Seattle
Posts: 64
shannonlp RepRank 0
Default Re: Old Pages Being Indexed By Third Party

Just as a thought I write spiders daily for my job. It could be possible that someone has paid to have a custom spider created to scrape the product information from your site.

If this is the case more than likely it would be from many different IP's. This type of bot does not read any robots.txt and can read java/ajax, captcha images, encoded emails, pretty much everything that you think is safe from a bot.


On a good note if it is a professional spider it will create errors for the owner and it will then be modified to go after relevant pages.

Hope this helps
__________________
Web Designer and Custom Spider Creator
eCommerce and shopping cart information

Last edited by shannonlp; 07-11-2007 at 10:03 PM. Reason: spelling oops
Reply With Quote
  #8 (permalink)  
Old 07-12-2007, 12:10 PM
WebProWorld New Member
 
Join Date: Oct 2005
Posts: 4
briscoe98 RepRank 0
Default Re: Old Pages Being Indexed By Third Party

Quote:
Originally Posted by incrediblehelp View Post
So are you worried about this because you don't want to see all the errors in your logs files or just don't want some weird bot crawling your pages and taking bandwidth? Sounds just like a crappy scrapper bot to me.
I am not really worried about the errors showing up. I am worried that somebody out there is trying to steal our content, or trying to pull our keywords or something along those lines. I could just be paranoid.

Chances are if they are stealing keywords, it is one sites that when you click on the results in Goolge, it has nothing to do with what the site actually is. Somebody just trying to gain rankings.

Last edited by briscoe98; 07-12-2007 at 12:18 PM.
Reply With Quote
  #9 (permalink)  
Old 07-12-2007, 02:55 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default Re: Old Pages Being Indexed By Third Party

"pull keywords"? You mean content right? No one person "owns" keywords.
Reply With Quote
  #10 (permalink)  
Old 07-12-2007, 03:07 PM
WebProWorld New Member
 
Join Date: Oct 2005
Posts: 4
briscoe98 RepRank 0
Default Re: Old Pages Being Indexed By Third Party

Quote:
Originally Posted by incrediblehelp View Post
"pull keywords"? You mean content right? No one person "owns" keywords.
Sorry about that. I mean pull content for products manufactured by us.
Reply With Quote
Reply

  WebProWorld > Search Engines > Google Discussion Forum

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Witty hackers create junk pages on 3rd party sites freetraff Internet Security Discussion Forum 3 06-26-2008 01:39 PM
What pages are indexed??? knowvak Search Engine Optimization Forum 6 12-20-2005 11:55 AM
Not all pages being indexed jkjazz Search Engine Optimization Forum 3 06-09-2005 05:10 PM
Indexed Pages C French Yahoo! Discussion Forum 3 05-12-2005 12:19 AM
Getting More Pages Indexed KtoID MSN Search Discussion Forum 4 01-20-2005 07:58 PM


All times are GMT -4. The time now is 02:47 AM.



Search Engine Optimization by vBSEO 3.3.0