 |

08-31-2007, 06:48 AM
|
|
WebProWorld Member
|
|
Join Date: Sep 2005
Location: South Africa
Posts: 56
|
|
Sitemap used as a replacement for robots.txt
We have added a simple robots.txt file on our server to prevent unnecessary entries appearing in our error logs. There aren't any URL's that we want to exclude from the Google index, so we only added the 'Sitemap:' entry to the robots.txt file, nothing else.
Now Google is complaining about a specific set of URL's restricted by our robots.txt file. These specific URL's do not appear in our Sitemap and best of all it is only one set of URL's from one specific page.
The URL's are in the following format
http://www.example.com/link.php?id=1&site=www.another-example.com
The link.php page is used as a central page doing all the redirection for all the affiliate programs we are signed up for. This makes it easy to manage our affiliate links in one central location and secondly it makes our links look more user-friendly throughout the site. This is kind of an affiliate link cloaking page. However we have many links like this across our whole site, so I don't see the format of the link or the use of a redirection page as the cause of the problem. However, like I said these URL's do not appear in our Sitemap, so it kinda looks like Google is using the Sitemap as the deciding factor for whether we want the link crawled or not.
I know I can solve this by adding the following to the robots.txt file:
User-agent: *
Disallow:
According to http://www.robotstxt.org/wc/exclusion-admin.html you can create an empty "/robots.txt" file as an alternative to the above. That is what I did, except for adding the Sitemap line.
Another thing I could try is to add the URL's to our Sitemap file, so I think I will be able to work around this problem, but it is strange to me how Google handled the whole issue. I mean, nowhere did we state in our robots file, or via any other means, that these URL's should not be crawled, so why does Google make this crazy assumption?
Am I missing something or is the Google spider stuck in his own web?
|

08-31-2007, 09:43 AM
|
 |
Moderator
|
|
Join Date: Jun 2006
Location: United States
Posts: 1,782
|
|
Re: Sitemap used as a replacement for robots.txt
There are a few very different things that could cause this. A meta tag on the link.php page could do this, for example. Also, have you checked the robots.txt file that Google has cached through webmaster tools?
__________________
The best way to learn anything, is to question everything.
|

09-01-2007, 07:20 AM
|
|
WebProWorld Member
|
|
Join Date: Sep 2005
Location: South Africa
Posts: 56
|
|
Re: Sitemap used as a replacement for robots.txt
wige, yes I did check the cached version and tested the URL's in question and guess what, Google said that the URL's was allowed by robots.txt.
Thanks for the link brandrocker, looks like one of those old issues with Google that was supposed to be fixed by now, but isn't.
|

09-03-2007, 07:37 AM
|
|
WebProWorld Member
|
|
Join Date: Sep 2005
Location: South Africa
Posts: 56
|
|
Re: Sitemap used as a replacement for robots.txt
Just to let you all know, adding the following lines to our robots.txt seemed to have done the trick, because the restricted URL's disappeared once Google downloaded the latest version of our robots.txt file:
User-agent: *
Disallow:
It still remains a strange phenomenon, but hey, I'm not going to complain if Google accepted the workaround.
|

09-06-2007, 03:12 AM
|
|
WebProWorld Member
|
|
Join Date: Sep 2005
Location: South Africa
Posts: 56
|
|
Re: Sitemap used as a replacement for robots.txt
I did not specify a redirect status code (the 3xx series) in the php script of link.php, I only use
header("Location: http://www.example.com");
Do you think that this might have caused Google to get confused? This is not a permanent redirect it is only a redirect to an affiliate URL, which may change from time to time. This page redirects to different URL's depending on the parameters passed with the URL, so which status code do you suggest is the safest to use, 302? I know Search Engines, especially Google, do not always like redirects, so I want to get this right.
|

09-06-2007, 10:26 AM
|
 |
Moderator
|
|
Join Date: Jun 2006
Location: United States
Posts: 1,782
|
|
Re: Sitemap used as a replacement for robots.txt
If no code is specified in the header command, the default code used is 302. Google does seem to prefer 301 redirects though.
I think Googlebot may run into problems parsing a robots.txt file that only contains a sitemaps directive.
__________________
The best way to learn anything, is to question everything.
|

09-07-2007, 02:50 AM
|
|
WebProWorld Member
|
|
Join Date: Sep 2005
Location: South Africa
Posts: 56
|
|
Re: Sitemap used as a replacement for robots.txt
Thanks wige, I expected that 302 would be the default but wasn't sure.
I found an interesting blog entry by Matt Cutts regarding 302 redirects.
http://www.mattcutts.com/blog/seo-ad...302-redirects/
It is true what you say about Google preferring 301 redirects, but do you think it is appropriate for this specific situation? 301 redirects is important if you want to retain pagerank for pages that have been permanently moved to another location. Since I am not concerned about the pagerank of an affiliate link, I think 302 would be more appropriate here, because this redirect is more about functionality than pagerank, as a matter of fact, page rank is not even a factor here. Please correct me if I'm wrong.
I have read that you could have done a lot of nasty things with a 302 redirect in the past, so Google no longer places such a high premium on them. If I understand the article by Cutts correctly, I don't have to worry about getting penalized because of these redirects as long as I use them in good faith and I do not do some weird redirects or redirect to bad or spammy pages.
On the other hand, what I see as good faith and what Google see as good faith, can be two completely different things, right? 
|

09-07-2007, 09:52 AM
|
 |
Moderator
|
|
Join Date: Jun 2006
Location: United States
Posts: 1,782
|
|
Re: Sitemap used as a replacement for robots.txt
One of the things with redirects is the message they send the spider when the redirecting page is indexed. When a 302 redirect is encountered, the search engine will continue to check the referring page to see if the content is no longer being redirected. If you specify a 301 redirect, the spider will automatically calculate links to the old location as being links to the new location, and rechecks of the redirecting page are greatly reduced.
Probably the primary consideration is how often you expect to change the destination of the redirects. If they will never, or almost never, change, use a 301. As far as technical issues, bots usually look at urls containing paramaters as seperate pages, so in theory these should not cause crawl errors unless the redirect is not in a proper format, and in these cases the URL would be listed under not crawled rather than blocked by robots.txt.
__________________
The best way to learn anything, is to question everything.
|
| Thread Tools |
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|