|
|
||||||
|
||||||
| Index Link To US Private Messages Archive FAQ RSS | ||||||
| Google Discussion Forum Google Discussion forum is for topics specifically related to Google. There is a subforum dedicated to AdSense/AdWords subjects. |
Share Thread: & Tags
|
||||
|
![]() |
|
|
LinkBack | Thread Tools | Display Modes |
|
|||
|
For the past fortnight googlebot has been on my site to index non existant urls,without first trying to obtain the robots.txt file,is their bot turning rogue or what.Here is a sample of my server log for the 23rd may today
crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:01:27 +0100] "GET /search/search.pl?q=cars HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:04:27 +0100] "GET /search/search.pl?q=bebo.com HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:07:38 +0100] "GET /search/search.pl?q=cars HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:10:41 +0100] "GET /search/search.pl?q=bebo.com HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:14:22 +0100] "GET /search/search.pl?q=cars HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:17:25 +0100] "GET /search/search.pl?q=bebo.com HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:20:31 +0100] "GET /search/search.pl?q=cars HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:23:42 +0100] "GET /search/search.pl?q=bebo.com HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:26:52 +0100] "GET /search/search.pl?q=cars HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:30:03 +0100] "GET /search/search.pl?q=bebo.com HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:33:34 +0100] "GET /search/search.pl?q=cars HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:36:44 +0100] "GET /search/search.pl?q=bebo.com HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:40:00 +0100] "GET /search/search.pl?q=cars HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:43:24 +0100] "GET /search/search.pl?q=bebo.com HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:46:34 +0100] "GET /search/search.pl?q=cars HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:49:44 +0100] "GET /search/search.pl?q=bebo.com HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:52:54 +0100] "GET /search/search.pl?q=cars HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:56:04 +0100] "GET /search/search.pl?q=bebo.com HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:03:59:14 +0100] "GET /search/search.pl?q=cars HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:04:02:33 +0100] "GET /search/search.pl?q=bebo.com HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:04:05:43 +0100] "GET /search/search.pl?q=cars HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:04:09:01 +0100] "GET /search/search.pl?q=bebo.com HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:04:12:16 +0100] "GET /search/search.pl?q=cars HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:04:15:37 +0100] "GET /search/search.pl?q=bebo.com HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:04:18:47 +0100] "GET /search/search.pl?q=cars HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:04:21:57 +0100] "GET /search/search.pl?q=bebo.com HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:04:25:14 +0100] "GET /search/search.pl?q=cars HTTP/1.1" 404 - crawl-66-249-65-198.googlebot.com - - [23/May/2007:04:28:24 +0100] "GET /search/search.pl?q=bebo.com HTTP/1.1" 404 - Make up your own minds I have emailed them about this to recieve no reply and a continuation of trying to index these urls that do not existthe search url is incorrect for my search engine all this does is slow the processes down on my server they continuosly try to ndex the same links for between 6 and 7 minutes,does any one have any ideas,I would be gratefull for suggestions.I have modified my robots.txt file to ban the indexing of these urls also when a 404 is found isn`t it automatically removed from googles index,it is from mine.Like I said the above is an actuall sample from my log today but this as been ongoing for over a fortnight. Thanks Sincerely http://www.littlemansearch.co.uk/index.html |
|
||||
|
Hi
It's not ignoring your robots.txt file. Your file is incorrect and contains quite a few errors. Therefor googlebot and others will not download the file. Use the URL below to find the errors and how to resolve. http://tool.motoricerca.info/robots-checker.phtml you should enter the full URL to your robots.txt file. Peace |
|
|||
|
yahoo yesterday indexed my site and first it requested the robots.txt file as it should,googlebot used to but one day I was checking my logs and noticed that it wasn`t requesting the robots.txt file and nothing has changed.Almost Every other crawler requests the robots.txt file.
http://www.littlemansearch.co.uk |
|
|||
|
Thanks for the tip I followed the link and amended the few errors lets just see if it works now.Great blog by the way I would normally agree but I run a search engine myself,Visit http://www.awagawag.co.uk/search/search.pl?Mode=AnonAdd to add your website for crawling its free.
Sincerely http://www.littlemansearch.co.uk Last edited by Littlemansearch; 05-23-2007 at 09:05 AM. Reason: extra information |
|
|||
|
Thanks for your reply
I have used the no index no follow rule in my web page areas that I don`t want indexing but the problem remains that google keep trying to index the same link every 12 hours or so thus tying my server up as it is only currently on a normal pc and not on a dedicated server they have been trying to index the same url`s for at least a fortnight the thing is the url`s do not exist and according to my server logs it is not even requesting the robots.txt file which is the first thing it should do,It`s like walking into someone elses house uninvited. Thanks for the tip about google using more than one bot Sincerely Littleman Search http://www.littlemansearch.co.uk/index.html Last edited by Littlemansearch; 05-23-2007 at 12:15 PM. |
|
||||
|
Google does not recheck the robots.txt file before every file is retrieved. It checks at most daily, and I think the average is closer to weekly. Based on the number of requests, it sounds like Google found a large number of links to this URL somehow. Is there any way could create a 301 redirect to somewhere else? That should get Google to drop the page a lot faster.
Google looks at 404 error messages as the server saying, "For some unknown reason, I can't find what you are telling me should be here." Google then keeps saying "Find it yet? How about now?" Especially if a lot of other pages say it should be there. A 301 message tells Google "What you want was moved over there, and nothing will ever be here again. Stop asking." I think Google still crawls pages that are forbidden by robots.txt, to see if there are links on the resulting page that it can follow or index, but the page is not added to the index or cached. I have had 404s that Google kept looking for for over a year, until I put up 301 redirects. You can do the same thing in your server configuration or .htaccess, depending on your server software.
__________________
The best way to learn anything, is to question everything. |
|
|||
|
This post may help clear this up a bit:
http://googlewebmastercentral.blogsp...about-googlebo... Particularly this part: 'If my robots.txt file contains a directive for all bots as well as a specific directive for Googlebot, how does Googlebot interpret the line addressed to all bots? If your robots.txt file contains a generic or weak directive plus a directive specifically for Googlebot, Googlebot obeys the lines specifically directed at it." If your file includes a user-agent: Googlebot line, Googlebot will obey that line and ignore the user-agent: * line. If your file does not include a user-agent: Googlebot line, then Googlebot obeys the user-agent: * line. |
|
|||
|
If a request every 3 minutes ties up the machine, you may want to consider a bigger machine. That's not much, although I'd agree that fixing your robots.txt should make them quit hitting a 404 (which shouldn't take much, if any, resources from your machine.)
Brian.
__________________
ToolBarn.com, an Internet Retailer Top 500 and Inc. 500 Company | Tool Parts | Pet Supplies |
![]() |
|
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Need Advice! Google and Googlebot ignoring my client site | mantawebsolutions | Search Engine Optimization Forum | 27 | 01-26-2007 02:04 AM |
| PR 6 for Robots.txt file | amar | Search Engine Optimization Forum | 2 | 12-26-2006 09:57 PM |
| What is a robots.txt file? | Tamelyne | Search Engine Optimization Forum | 3 | 10-25-2004 08:42 PM |
| Google ignoring Robots.txt? | strum4life | Google Discussion Forum | 4 | 10-12-2004 11:04 PM |
| Googlebot only visting index.asp and robots.txt only | pbatson | Google Discussion Forum | 14 | 03-04-2004 12:17 AM |
|
WebProWorld |
Advertise |
Contact Us |
About |
Forum Rules |
MVP's |
Archive |
Newsletter Archive |
Top |
WebProNews
WebProWorld is an iEntry, Inc. ® site - © 2009 All Rights Reserved Privacy Policy and Legal iEntry, Inc. 2549 Richmond Rd. Lexington KY, 40509 |