View Full Version : Unknown robot (identified by 'robot')
chandrika
08-21-2007, 09:35 AM
Quite a few Unknown robots spider my sites which is usual, however last few days there is one that has spidered 10000s of pages which is somehwat unusual as the only spiders usually I get spidering everything are the larger SEs that identify themselves.
Does anyone know if this can be any kind of problem if you see an
Unknown robot (identified by 'robot')
in the stats that appears to be going thorugh everything? Or is this normal and just something I hadnt had before?
incrediblehelp
08-21-2007, 03:54 PM
Pretty normal. Lots of junk bots and scrappers visit my websites all of the time. If you want feel free to block them by IP.
You see that in your logs all the time, not much you can do about it unless block them by .htaccess or figure out the IP's, it could be GigaBlast, a University crawler or SnapBot.
Some of these guys need to be pressured to resolve to a host name by webmasters or they should be banned.
khurramali
08-23-2007, 12:32 AM
if you are disturbed by them, and want to keep away the scraper bots, I suggest that you block all other bots expect the ones you know, which identify themselves.
dgswilson
03-11-2011, 05:32 PM
(if) - .htaccess
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^hunter [OR]
RewriteCond %{HTTP_USER_AGENT} ^checker [OR]
RewriteCond %{HTTP_USER_AGENT} ^discovery [OR]
RewriteCond %{HTTP_USER_AGENT} ^spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^crawl [OR]
RewriteCond %{HTTP_USER_AGENT} ^robot [OR]
RewriteCond %{HTTP_USER_AGENT} ^bot [OR]
RewriteRule ^(.*)$ /somepage.php ( error page or where ever)
Then check your error reports and see if you can identify them. Be nice if you could track the page (I don't know a good way to do it). Also block "somepage.php in robots text, this might narrow it down. I'm counting on the truly skilled to make/add corrections on this. I do know you can block these and not get a 503 error - since I just did it - we'll see what happens.
dgswilson
03-11-2011, 06:39 PM
Just got this (error log)
for "*bot" - 173.192.238.45 = http://spinn3r.com/robot -
(173.192.238.45) Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.19; aggregator:Spinn3r (Spinn3r 3.1); http://spinn3r.com/robot) Gecko...
for "robot" - 174-37-205-76.robot.spinn3r.com - Robots (Crawlers):
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1; aggregator:Spinn3r (Spinn3r 3.1); http://spinn3r.com/robot) Gecko/20021130
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.19; aggregator:Spinn3r (Spinn3r 3.1); http://spinn3r.com/robot) Gecko/2010040121 Firefox/3.0.19
dgswilson
03-11-2011, 09:15 PM
I "think" - the (spider) might be - (123.125.71.33) Baiduspider+(+http://www.baidu.com/search/spider.htm)
C.Rebecca
03-31-2011, 10:37 AM
You can integrate spider detection scripts in your website or can block them using .htaccess
If you can track their details in .htaccess you can block such bots using their IP address.
e.g.
<Limit GET HEAD POST>
order allow,deny
deny from aaa.aaa.aaa.aa
deny from bbb.bbb.bbb.
deny from somedomain.com
allow from all
</LIMIT>