Submit Your Article Forum Rules Search
WebProWorld
Register FAQ Calendar Mark Forums Read
Search Engine Optimization Forum SEO is much easier with help from peers and experts! The WebProWorld SEO forum is for the discussion and exploration of various search engine optimization topics. Any non (engine) specific SEO or SEM topics should go here.

Share Thread: & Tags

Share Thread:

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 12-20-2007, 01:32 PM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,822
wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10
Default Indexing of Forbidden Content

Another thread (MSN Live Indexes Google Ads) is discussing the appearance of Google Adsense ads in the MSN/Live index. In theory, this should not be possible, as the URLs in question have been blocked through the application of robots.txt. (Line 15 of http://www.google.com/robots.txt specifically disallows the ads.) I can think of a few possible reasons why these pages would be indexed legitimately:
  • Google recently changed the URLs but did not update the robots.txt file in time
  • For some reason, the Live bot was unable to access the robots.txt file
  • There is some form of redirect, where the ads exist at another URL, but the blocked URL is being displayed for the content, possibly an exploit of the Adwords code.
It was mentioned in the thread that it is not uncommon for URLs that are blocked by robots.txt to show up in the indexes of the various search engines. This has severe logistical and security ramifications if true.

Has anyone encountered this issue, where you have blocked content through robots.txt or a robots meta tag and had it indexed? Do you know of any reason why properly blocked content would still be indexed?
__________________
The best way to learn anything, is to question everything.
Reply With Quote
  #2 (permalink)  
Old 12-20-2007, 04:36 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default Re: Indexing of Forbidden Content

Microsoft Live Search Fixes Problem with Google AdWords Ads

Quote:
The issue stems from the way Live Search handles content disallowed by the Robots.txt file. We regularly check the robots.txt file of a site to ensure that we don't index and cache pages excluded by the webmaster. However, if we do find a link elsewhere on the web pointing to a page excluded by the robots.txt file, we may include the link and the anchor text in our index if we think it might be valuable to our users. Yesterday we accidently began including the links from the ads of Google AdSense customers. The issue has been fixed, and you should see the results disappear from our search results over the next couple days.
Reply With Quote
  #3 (permalink)  
Old 12-20-2007, 04:39 PM
WebProWorld Veteran
 
Join Date: Aug 2006
Location: Burlington, Ontario, Canada.
Posts: 407
jtracking RepRank 1
Default Re: Indexing of Forbidden Content

I just had a thought though...if the page is linked to from another site what happens? does google get to that page before it read robots.txt ?
__________________
Post as-it-happens crime stories of criminal behaviour at crimedigg.com
Reply With Quote
  #4 (permalink)  
Old 12-20-2007, 04:44 PM
Peter (IMC)'s Avatar
WebProWorld MVP
WebProWorld MVP
 
Join Date: Dec 2003
Posts: 1,485
Peter (IMC) RepRank 4Peter (IMC) RepRank 4Peter (IMC) RepRank 4Peter (IMC) RepRank 4
Default Re: Indexing of Forbidden Content

Quote:
Originally Posted by jtracking View Post
I just had a thought though...if the page is linked to from another site what happens? does google get to that page before it read robots.txt ?
If they find enough links they will index the link,.. but not the content. Generally they use the dmoz title and description in these cases. (if available of course)
__________________
FREE SEO ! Really? YES! All you have to do is implement it!
Follow me on Twitter PeterIMC

Last edited by Peter (IMC); 12-20-2007 at 04:58 PM.
Reply With Quote
  #5 (permalink)  
Old 12-20-2007, 05:01 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default Re: Indexing of Forbidden Content

Quote:
Originally Posted by Peter (IMC) View Post
If they find enough links they will index the link,.. but not the content. Generally they use the dmoz title and description in these cases. (if available of course)
I think you are missing the point here. No matter what they should not index the URL's in this case if they were to adhere to G robots.txt file. Has nothing to do with seeing links fro other websites.
Reply With Quote
  #6 (permalink)  
Old 12-20-2007, 05:23 PM
Peter (IMC)'s Avatar
WebProWorld MVP
WebProWorld MVP
 
Join Date: Dec 2003
Posts: 1,485
Peter (IMC) RepRank 4Peter (IMC) RepRank 4Peter (IMC) RepRank 4Peter (IMC) RepRank 4
Default Re: Indexing of Forbidden Content

Quote:
Originally Posted by incrediblehelp View Post
I think you are missing the point here. No matter what they should not index the URL's in this case if they were to adhere to G robots.txt file. Has nothing to do with seeing links fro other websites.
I was merely stating what they do, not what they should or should not do.

The discussion is interesting though. Should they index the url or not? It's like being a celebrity. You want your privacy and don't let anybody into your house, but does that forbid magazines to write your name in their articles?
__________________
FREE SEO ! Really? YES! All you have to do is implement it!
Follow me on Twitter PeterIMC
Reply With Quote
  #7 (permalink)  
Old 12-20-2007, 05:55 PM
datetopia's Avatar
WebProWorld Pro
 
Join Date: Dec 2006
Location: Datetopia Dating Software
Posts: 139
datetopia RepRank 0
Default Re: Indexing of Forbidden Content

The perfect crawler should act as a human internet surfer.

It's the same as specifying keywords for the pages. Most engines will index what they consider popular and searcheable information and they will show the results for keywords based on their algorithms and not on website owner's instructions.
Reply With Quote
  #8 (permalink)  
Old 12-20-2007, 06:34 PM
DrTandem1's Avatar
WebProWorld 1,000+ Club
 
Join Date: Oct 2003
Location: Encinitas, CA
Posts: 1,830
DrTandem1 RepRank 2
Default Re: Indexing of Forbidden Content

Just an FYI, using a robots.txt file(s) does not guarantee either indexing or not indexing. It is simply the designer's preference/request. It may be ignored by certain robots.
__________________
DrTandem's San Diego Web Page Design, drtandem.com
Reply With Quote
  #9 (permalink)  
Old 12-20-2007, 06:42 PM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,822
wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10wige RepRank 10
Default Re: Indexing of Forbidden Content

Quote:
The issue stems from the way Live Search handles content disallowed by the Robots.txt file. We regularly check the robots.txt file of a site to ensure that we don't index and cache pages excluded by the webmaster. However, if we do find a link elsewhere on the web pointing to a page excluded by the robots.txt file, we may include the link and the anchor text in our index if we think it might be valuable to our users. Yesterday we accidently began including the links from the ads of Google AdSense customers. The issue has been fixed, and you should see the results disappear from our search results over the next couple days.
Expletive. And I mean a really colorful one.

Quote:
Originally Posted by jtracking View Post
I just had a thought though...if the page is linked to from another site what happens? does google get to that page before it read robots.txt ?
In theory, (and based on what I have read on various webmaster faq type pages) when a search engine "discovers" a url, that url is added to a to-crawl index. When the spider goes through the to-crawl index, it first checks the robots file for the domain, then if permitted it attempts to retrieve the document, and if successful processes and indexes it. At least, this is the way the standard assumes it works, and the way most search engines have claimed they work.

Quote:
Originally Posted by datetopia View Post
The perfect crawler should act as a human internet surfer.

It's the same as specifying keywords for the pages. Most engines will index what they consider popular and searcheable information and they will show the results for keywords based on their algorithms and not on website owner's instructions.
But there is always certain content that is intended only for humans, not for bots. One example (from the robots.txt web site) is that of a survey. Blocking the script that processes the results should prevent spiders from crawling or indexing the URL, and tainting the survey results.

In the same way, you do not want that URL to be indexed, because the clicks are coming from users who have no idea what they are clicking on, and could taint the results of a survey - or analytics or campaign tracking, etc. In addition, robots.txt is used to keep spiders from indexing login pages for CMS systems and other critical apps to prevent Googlehacking. If I know a certain CMS is vulnerable to attack, and Google indexes a login page for that CMS, I could potentially search for all the sites that use that CMS and attack them.
__________________
The best way to learn anything, is to question everything.
Reply With Quote
  #10 (permalink)  
Old 01-01-2008, 05:19 PM
WebProWorld Pro
 
Join Date: Feb 2004
Location: Stupid question. At my PC.
Posts: 135
TechEvangelist RepRank 1
Default Re: Indexing of Forbidden Content

I have seen files that are blocked in the robots.txt file AFTER they are already indexed show up in Google's index for almost a year. The issue may be a matter of whether they are indexed before or after they are blocked. I have always found the robots meta tag to be more effective at eliminating URLs from search engine indexes.

MSN seems to be much better at using the robots.txt file to eliminate URLs. Yahoo appears to ignore it most of the time. I've seen blocked URLs show up in Yahoo for years after they were blocked.
__________________
Facts are meaningless. They can be used to prove anything. - Homer Simpson
MySQL Cheatsheet :: Arizona SEO training :: Phoenix Managed Services

Last edited by TechEvangelist; 01-01-2008 at 05:26 PM.
Reply With Quote
Reply

  WebProWorld > Search Engines > Search Engine Optimization Forum

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Forbidden kgun Internet Security Discussion Forum 0 08-15-2007 09:14 AM
403- forbidden Error trancehead Search Engine Optimization Forum 10 05-31-2006 06:17 AM
Yahoo + Open Content Alliance = Smooth Text Indexing Move WPW_Feedbot Search Engine Optimization Forum 0 10-04-2005 12:00 PM
Forbidden Request webhost1 Web Programming Discussion Forum 1 07-11-2005 11:59 AM
Error 403 forbidden using Checklink martinacastro Search Engine Optimization Forum 0 03-26-2004 10:38 AM


All times are GMT -4. The time now is 02:11 PM.



Search Engine Optimization by vBSEO 3.3.0