WebProWorld Part of WebProNews.com
Page One Link To Us Edit Profile Private Messages Archives FAQ RSS Feeds  
 

Go Back   WebProWorld > Search Engines > Google Discussion Forum
Subscribe to the Newsletter FREE!


Register FAQ Members List Calendar Arcade Chatbox Mark Forums Read

Google Discussion Forum Google Discussion forum is for topics specifically related to Google. There is a subforum dedicated to AdSense/AdWords subjects.

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 08-31-2007, 06:48 AM
WebProWorld Member
 

Join Date: Sep 2005
Location: South Africa
Posts: 56
cppgenius RepRank 0
Default Sitemap used as a replacement for robots.txt

We have added a simple robots.txt file on our server to prevent unnecessary entries appearing in our error logs. There aren't any URL's that we want to exclude from the Google index, so we only added the 'Sitemap:' entry to the robots.txt file, nothing else.

Now Google is complaining about a specific set of URL's restricted by our robots.txt file. These specific URL's do not appear in our Sitemap and best of all it is only one set of URL's from one specific page.

The URL's are in the following format
http://www.example.com/link.php?id=1&site=www.another-example.com

The link.php page is used as a central page doing all the redirection for all the affiliate programs we are signed up for. This makes it easy to manage our affiliate links in one central location and secondly it makes our links look more user-friendly throughout the site. This is kind of an affiliate link cloaking page. However we have many links like this across our whole site, so I don't see the format of the link or the use of a redirection page as the cause of the problem. However, like I said these URL's do not appear in our Sitemap, so it kinda looks like Google is using the Sitemap as the deciding factor for whether we want the link crawled or not.

I know I can solve this by adding the following to the robots.txt file:
User-agent: *
Disallow:

According to http://www.robotstxt.org/wc/exclusion-admin.html you can create an empty "/robots.txt" file as an alternative to the above. That is what I did, except for adding the Sitemap line.

Another thing I could try is to add the URL's to our Sitemap file, so I think I will be able to work around this problem, but it is strange to me how Google handled the whole issue. I mean, nowhere did we state in our robots file, or via any other means, that these URL's should not be crawled, so why does Google make this crazy assumption?

Am I missing something or is the Google spider stuck in his own web?
Reply With Quote
  #2 (permalink)  
Old 08-31-2007, 09:43 AM
wige's Avatar
Moderator
WebProWorld Moderator
 

Join Date: Jun 2006
Location: United States
Posts: 1,782
wige RepRank 4wige RepRank 4wige RepRank 4wige RepRank 4
Default Re: Sitemap used as a replacement for robots.txt

There are a few very different things that could cause this. A meta tag on the link.php page could do this, for example. Also, have you checked the robots.txt file that Google has cached through webmaster tools?
__________________
The best way to learn anything, is to question everything.
Reply With Quote
  #3 (permalink)  
Old 09-01-2007, 04:49 AM
WebProWorld Member
 

Join Date: Dec 2006
Location: India
Posts: 49
brandrocker RepRank 0
Default Re: Sitemap used as a replacement for robots.txt

It looks like an old issue. You may read this as well.

Inside Google Sitemaps: Updated robots.txt status
Reply With Quote
  #4 (permalink)  
Old 09-01-2007, 07:20 AM
WebProWorld Member
 

Join Date: Sep 2005
Location: South Africa
Posts: 56
cppgenius RepRank 0
Default Re: Sitemap used as a replacement for robots.txt

wige, yes I did check the cached version and tested the URL's in question and guess what, Google said that the URL's was allowed by robots.txt.

Thanks for the link brandrocker, looks like one of those old issues with Google that was supposed to be fixed by now, but isn't.
Reply With Quote
  #5 (permalink)  
Old 09-03-2007, 07:37 AM
WebProWorld Member
 

Join Date: Sep 2005
Location: South Africa
Posts: 56
cppgenius RepRank 0
Default Re: Sitemap used as a replacement for robots.txt

Just to let you all know, adding the following lines to our robots.txt seemed to have done the trick, because the restricted URL's disappeared once Google downloaded the latest version of our robots.txt file:

User-agent: *
Disallow:

It still remains a strange phenomenon, but hey, I'm not going to complain if Google accepted the workaround.
Reply With Quote
  #6 (permalink)  
Old 09-06-2007, 03:12 AM
WebProWorld Member
 

Join Date: Sep 2005
Location: South Africa
Posts: 56
cppgenius RepRank 0
Default Re: Sitemap used as a replacement for robots.txt

I did not specify a redirect status code (the 3xx series) in the php script of link.php, I only use

header("Location: http://www.example.com");

Do you think that this might have caused Google to get confused? This is not a permanent redirect it is only a redirect to an affiliate URL, which may change from time to time. This page redirects to different URL's depending on the parameters passed with the URL, so which status code do you suggest is the safest to use, 302? I know Search Engines, especially Google, do not always like redirects, so I want to get this right.
Reply With Quote
  #7 (permalink)  
Old 09-06-2007, 10:26 AM
wige's Avatar
Moderator
WebProWorld Moderator
 

Join Date: Jun 2006
Location: United States
Posts: 1,782
wige RepRank 4wige RepRank 4wige RepRank 4wige RepRank 4
Default Re: Sitemap used as a replacement for robots.txt

If no code is specified in the header command, the default code used is 302. Google does seem to prefer 301 redirects though.

I think Googlebot may run into problems parsing a robots.txt file that only contains a sitemaps directive.
__________________
The best way to learn anything, is to question everything.
Reply With Quote
  #8 (permalink)  
Old 09-07-2007, 02:50 AM
WebProWorld Member
 

Join Date: Sep 2005
Location: South Africa
Posts: 56
cppgenius RepRank 0
Default Re: Sitemap used as a replacement for robots.txt

Thanks wige, I expected that 302 would be the default but wasn't sure.

I found an interesting blog entry by Matt Cutts regarding 302 redirects.
http://www.mattcutts.com/blog/seo-ad...302-redirects/

It is true what you say about Google preferring 301 redirects, but do you think it is appropriate for this specific situation? 301 redirects is important if you want to retain pagerank for pages that have been permanently moved to another location. Since I am not concerned about the pagerank of an affiliate link, I think 302 would be more appropriate here, because this redirect is more about functionality than pagerank, as a matter of fact, page rank is not even a factor here. Please correct me if I'm wrong.

I have read that you could have done a lot of nasty things with a 302 redirect in the past, so Google no longer places such a high premium on them. If I understand the article by Cutts correctly, I don't have to worry about getting penalized because of these redirects as long as I use them in good faith and I do not do some weird redirects or redirect to bad or spammy pages.

On the other hand, what I see as good faith and what Google see as good faith, can be two completely different things, right?
Reply With Quote
  #9 (permalink)  
Old 09-07-2007, 09:52 AM
wige's Avatar
Moderator
WebProWorld Moderator
 

Join Date: Jun 2006
Location: United States
Posts: 1,782
wige RepRank 4wige RepRank 4wige RepRank 4wige RepRank 4
Default Re: Sitemap used as a replacement for robots.txt

One of the things with redirects is the message they send the spider when the redirecting page is indexed. When a 302 redirect is encountered, the search engine will continue to check the referring page to see if the content is no longer being redirected. If you specify a 301 redirect, the spider will automatically calculate links to the old location as being links to the new location, and rechecks of the redirecting page are greatly reduced.

Probably the primary consideration is how often you expect to change the destination of the redirects. If they will never, or almost never, change, use a 301. As far as technical issues, bots usually look at urls containing paramaters as seperate pages, so in theory these should not cause crawl errors unless the redirect is not in a proper format, and in these cases the URL would be listed under not crawled rather than blocked by robots.txt.
__________________
The best way to learn anything, is to question everything.
Reply With Quote
Reply

  WebProWorld > Search Engines > Google Discussion Forum
Tags: , , ,



Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Similar Threads
Thread Thread Starter Forum Replies Last Post
Robots meta tags or Robots.txt? Webnauts Search Engine Optimization Forum 0 08-16-2007 12:03 AM
Replacement For Jux2: DoubleTrust WPW_Feedbot Search Engine Optimization Forum 0 06-15-2005 08:30 AM
Finding a replacement for passwords WPW_Feedbot IT Discussion Forum 0 02-23-2005 08:00 AM
www.starcoins.biz replacement page osiris7719 Submit Your Site For Review 5 07-05-2004 01:50 PM
Google Replacement? rossi32s Google Discussion Forum 3 02-19-2004 01:56 AM


Search Engine Optimization by vBSEO 3.2.0