PDA

View Full Version : High Bandwidth consumption by bogus rogue robots



NetProwler
09-28-2010, 01:23 AM
I manage a server which has a high traffic content site amongst others. The content site uses their own CMS for content delivery. With that preamble, I will narrate my problem here. Of late I notice deep crawling from many IP addresses, some even identifying themselves as Googlebot user agent. A cursory check showed that they are not from Google.

The pattern itself shows that they are not from any mainstream robots. They fire rapid fire requests for about 100 pages in one cycle. Other than the 2 GB bandwidth in 15 days these rogue robots consume, I am worried about this slowing down the server.

I can block them as soon as I notice significant traffic coming from one particular IP address. But this is not the ideal solution. It keeps me tied down to checking the server raw log files every day. What would my cat think of me ?

Throttling is not an option as it would impact other genuine robots. What are the options available for this problem ?

ron angel
09-28-2010, 12:54 PM
I manage a server which has a high traffic content site amongst others. The content site uses their own CMS for content delivery. With that preamble, I will narrate my problem here. Of late I notice deep crawling from many IP addresses, some even identifying themselves as Googlebot user agent. A cursory check showed that they are not from Google.

The pattern itself shows that they are not from any mainstream robots. They fire rapid fire requests for about 100 pages in one cycle. Other than the 2 GB bandwidth in 15 days these rogue robots consume, I am worried about this slowing down the server.

I can block them as soon as I notice significant traffic coming from one particular IP address. But this is not the ideal solution. It keeps me tied down to checking the server raw log files every day. What would my cat think of me ?

Throttling is not an option as it would impact other genuine robots. What are the options available for this problem ?
I do not know answer to the problem but if it were me I would ignore it as I do with my site. all publicity (crawling of site) is good the more people see your site the better... what has your cat got to do with it, is something else cant understand.

Bernd
09-28-2010, 05:37 PM
You could use a bot trap.
http://www.kloth.net/internet/bottrap.php
http://www.bot-trap.de/home/

newconceptdesign
09-28-2010, 07:07 PM
I don't know the technical details but as far as I know you can use .htaccess file to block robots. In theory you can use robots.txt file but most of "bad bots" ignore this file anyway. As one can't block by robot name, the solution is to block them by matching the beginning of their User-Agent string.

To find more Google "block robots with .htaccess".

I hope you'll find it helpful.

NetProwler
09-30-2010, 04:44 AM
Thanks Ron Angel,Bernd and newconceptdesign.

Since this bogus robot keeps eating up my bandwidth, I spend more time checking the raw server log files of every site in the box. This takes up considerable time leaving me with little time for anything else. Who will feed my cat if I am practically living in the office ?

I will check out the bot traps pointed out by Bernd.

Thanks again

weegillis
09-30-2010, 03:12 PM
If you are the sysadmin you won't need to use .htaccess, just configure your server. That way all the sites hosted on the server will share the same setup.

NetProwler
10-01-2010, 07:06 AM
Thanks weegillis. I will have to use the server configuration directly. Only thing is I still need to figure out a way to identify in real time who the bad robot is. They masquerade as Googlebot or Yahoo slurp or whatever suits their fancy.

wige
10-01-2010, 09:18 AM
You can try doing a reverse DNS lookup when you detect the bot. What I have done in the past is add some detection code to a site-wide script such as an included header that would check the user agent, and if the agent was a known bot it would then verify that bot. If the verification failed, I would toss that bots IP into a database so I could then add it to my blacklist at a later date. This is useful if the attackers are spoofing known user agents. It gets much harder if the attackers spoof user agents from common browsers. To combat that, I usually use some type of heuristic analysis. One of the basic ones is to gather all of the off-site http-referer strings, and look for strings that come from search engines but have queries that are not related to your site. This is a common trick that bad bots use since most defenses have no way to catch them. You do have to hand check these results though, to be sure you aren't catching legitimate visitors that came through unexpected results, but it can give you a good way to catch bots trying to impersonate human users.

For the config-based blacklist that was mentioned above, the easiest solution I have found is to use environment variables:

# To block a bad bot based on user agent
BrowserMatch "googlebot not-from-google" spambot
# To block a bad bot based on IP
SetEnvIf Remote_Addr ^1.2.3.4$ spambot

Then in your site-specific configuration you can deny all access based on the flag "spambot"
Deny spambot

This will display an error 403 to the user.

NetProwler
10-02-2010, 03:51 AM
Wonderful Idea wige. Thank you so much. I have veered around to using your idea. Right now I am testing the extra latency such a routine would take. Problem is most webmasters would not understand the extra trouble we sysadmins take to keep their sites well oiled. At the sign of the first trouble, they are likely to blame us.

wige
10-04-2010, 09:30 AM
From what I have seen, most of the latency is the reverse lookups of the visitors' IP addresses. This could be mitigated by doing the lookups later, just writing a script to go through the log of suspected bots, find IPs that don't match and blocking them, so that the delay is not happening at runtime - it won't save your bandwidth right away, but it will give you a corrective action to knock the bots out after the fact.

As far as the Apache directives, if you make the changes directly in the config files rather than .htaccess, you should not notice any difference at all. Apache already does browser matching to deal with processing differences between Firefox and IE, and these rules simply add an additional step to that existing process.

NetProwler
10-06-2010, 12:59 AM
Thanks Wige. You are spot on as always. Yes. The latency is largely due to the reverse lookup of the IP addresses. Right now my script sets up a table of rogue bot IP addresses which is used by another script to do the lookup. This technique doesn't impact the latency to any noticeable manner.