|
|
||||||
|
||||||
| Index Link To US Private Messages Archive FAQ RSS | ||||||
| Internet Security Discussion Forum This forum is for the discussion of security related issues. If you find a new Phishing scheme, spyware, virus or malicious site - let us know about it. If any of the above found you... here's where you ask for help. |
Share Thread: & Tags
|
||||
|
![]() |
|
|
LinkBack | Thread Tools | Display Modes |
|
|||
|
Hello everybody,
I have submitted my site to a search engine a few months ago, but lately, unknown robots eats up my 2.5 GB Bandwidth in just about 10-20 days. I only have about 2 short videos and about 2-30 images on my site, so there's no reason why it would eat up that amount of bandwidth. I have checked my access log, and found out: 38.99.13.123 - - [23/Jun/2007:20:44:45 +0900] "GET /t/imagery/ HTTP/1.0" 404 17507 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)" 61.135.162.52 - - [23/Jun/2007:20:44:53 +0900] "HEAD /folder/folder/folder/content.html HTTP/1.1" 200 0 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)" They were crawling on my site every minute! May I know how to block these robots using the .htacces? Please let me know if this is not the right forum to discuss about this issue. Thank you. |
|
||||
|
Add in your .htaccess file the following lines:
order allow,deny deny from 38.99.13.123 deny from 61.135.162.52 allow from all
__________________
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
||||
|
You can also use this in robots.txt, which should help:
User-agent: * #put spider name in here, or leave it as wildcard Crawl-delay: 10 And some spiders will recognize this, which is part of the proposed new robots exclusion spec: User-agent: * #put spider name in here, or leave it as wildcard Request-rate: 1/5 # maximum rate is one page every 5 seconds Visit-time: 0600-0845 # only visit between 6:00 AM and 8:45 AM UT (GMT)
__________________
Custom WordPress Themes, CubeCart templates, ModX templates, Movable Type templates. ~ B1tchslappin Political Blog ~ GreenSpeak Community Action Last edited by bj; 06-26-2007 at 02:31 PM. Reason: clarification |
|
||||
|
Oops, forgot this-- it's a robots.txt validator
|
|
||||
|
All aspects of the internet is evolving to a rich media, high bandwidth environment.
You can either drive yourself crazy nickle and diming bandwidth or you can move with times and get a host with more bandwidth. 2.5 gigs is nothing today. |
|
|||
|
If you adapted your robots.txt page to disallow the video, wouldn't this keep the bandwidth use down?
__________________
There is a time for every purpose under heaven. ![]() http://www.expresspools.com http://www.sjvwd.com |
|
||||
|
I was having an issue with this a couple years ago... Altavista & Lycos spiders ate up our bandwith with a vengence.
We blocked the spiders IPs, (about 50 all together), and 2 weeks later we were dropped from Altavista and Lycos. Mind you, no one would care today, but 6-7 years ago Altavista & Lycos were HUGE. I know bandwith is expensive when you don't host your own sites from a datacenter or the like, but if spiders are crawling the crap out of your site, it is for a reason. I let them go. Last edited by timmathews.com; 06-26-2007 at 05:54 PM. Reason: more explanation |
|
||||
|
I just had to pay recently for extra bandwidth usage as well. I don't think it's because of bots (and I see them regularly, especially Yahoo's) but because of amount of a spam. I was sick and tired of cleaning up thousands(!) of spam e-mails and not only e-mails. The Bulletin Board was suffering from spam messages originated by the special bots for phpbb.
I have decided to gather the statistics who are the biggest spammers. For one month I have checked every spam e-mail and I wrote down the headers' IPs. Then, I blocked the whole range of IPs even down to A-class addresses. Believe or not, I have slashed spam at ~80%! I was so angry, that I have overestimated my results and mistakenly slashed 65.x IP range where the Google's bot lives. You can imagine what happened to the web site with a PR 6... The number 6 turned to 0! Our sales stopped completely for 2 months. It took me awhile (many hours of hard work and web site optimization) to get back on track. The positive thing is that I am working in the right direction now by allowing the Ip addresses to hit my site only from the areas I want. It's like the top-bottom approach to the security. If you want, get my results file here (some people asked me to post it): http://www.800-security.com/tech/SPAMaddresses.txt Please be careful, and verify your restrictions. The biggest spammers are Poland, Russia, and Asian region. There are some in America, as well. Use the following site to check the WHOIS, etc services: Information Security Resources and Links. Security Certifications, Firewalls, IDS, Microsoft Security, CISSP, Security+ The answer to your problem is to gather the statistics. The bots are usually use the same IPs (no more than several addresses). Restrict the bandwidth eaters but again: be careful. Use the Control Panel to restrict the addresses.
__________________
The Cyber Teacher http://www.rtek2000.com http://www.800-webdesign.com/web-master-links.html -Free Web Master's Resources _________________ |
|
|||
|
Quote:
I second the motion. I believe spiders are visiting our sites for a number of reasons and one of them is to list us in the directories etc. I usually try to find out where they came from and then see what they're about...most of the time they've got me on their site. : )
__________________
Post as-it-happens crime stories of criminal behaviour at crimedigg.com |
|
|||
|
Quote:
my terrible but maybe somehow they can find the dude who's spamming me. anyways in the end i found out the more ip addresses you block the more there is a chance that you'll block legitimate visitors. I think I have an idea on how to easily block the spammers...i'll post it if it works.
__________________
Post as-it-happens crime stories of criminal behaviour at crimedigg.com Last edited by jtracking; 06-26-2007 at 08:42 PM. |
|
||||
|
Quote:
DO NOT BLOCK THEM BECAUSE IT IS COSTING YOU MONEY, ALLOW THEM AND MONITOR THEM BECAUSE THEY WILL ESSENTIALLY MAKE YOU MONEY! Think about it, your car requires FUEL to proceed forward, are you going to not put gas in it because it cost you money? I know that statement is broad, but that is the broadest (most broad?) analogy I could think of to get people to understand. Hit me. |
|
||||
|
I think joining this free project, you can get more information about who is good for you and who is not: Distributed Spam Harvester Tracking Network | Project Honey Pot
I am using that myself and it is just incredible.
__________________
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
|||
|
I have successful forum dedicated to the film industry of one of the states of India.I know a lot of people are visiting the forum.And so also a lot of spam bots.I want to know how much of the hits are actually due to humans and how much due to spam bots ? Is there any way of finding that out ?
|
|
|||
|
Blocking robots using robots.txt or IP addresses are both bad ideas.
Bad robots generally do not pay attention to robots.txt. Blocking IP addresses as some have suggested has all kinds repercussions. Normally bots will not change thier name very often so use the following in you .htaccess file in your root directory and deny all from inner directories except for you local ips. Using ModRewrite {Apache} If the string or regular expression matches the user-agent HTTP header it will send them to a forbidden page RewriteEngine On RewriteCond %{HTTP_USER_AGENT} ^Twiceler [OR] RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Baiduspider RewriteRule ^.* - [F,L] you can change the RewriteRule and send them somewhere else like a non-linked page that records hits and user-agents therefore letting you know how many bad bots are taking the bait! You will have to use PHP and MySQL if you do not want to save it in a file. If you do not have ModRewrite the following should help. SetEnvIfNoCase user-agent "^Twiceler" bad_bot=1 SetEnvIfNoCase user-agent "^Xaldon\ WebSpider" bad_bot=1 SetEnvIfNoCase user-agent "^Baiduspider" bad_bot=1 <FilesMatch "(.*)"> Order Allow,Deny Allow from all Deny from env=bad_bot </FilesMatch> |
|
|||
|
Quote:
|
|
|||
|
IMO the issue is much simpler than many of the replies I have read so far on this thread. I gave a run down on my blog here:
Robots.txt >>Search Engine Robots generating too much traffic on your site ? In a nutshell you can reduce to an absolute minimum robot bandwidth consumption by simply putting an empty robots.txt file in your web or blog. |
|
||||
|
>>38.99.13.123 - - [23/Jun/2007:20:44:45 +0900] "GET /t/imagery/ HTTP/1.0" 404 17507 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)"
That was a 404 for which the server returned something worth 17507 bytes. It might be a custom Page Not Found error page. In a large site where the bandwidth is a worry, avoid using a custom error page with anything more than a couple of KB - if possible. They all add up eventually. If it is the images which are crawled in excess ( the exact figure will vary depending upon individual circumstances) you can do something to avoid that. It might be too cumbersome to reproduce it here. But you can see how it is done here: Targetwoman Blog » Saving Bandwidth in servers Incidentally, I have not written that blog nor do I have anything to do with that. I just happened to see that. |
|
|||
|
Thank you all for the suggestions and advice!
@Sheriff: When I put this line RewriteRule ^.* - [F,L] it will give me an internal server error, so I just remove the F, I hope it doesn't make the code useless if I remove it? @seo4china: Yes my contents are both in Chinese and English. Baidu crawls heavily on my site.. I also need to block Yahoo! slurp China, but when I put RewriteCond %{HTTP_USER_AGENT} ^Yahoo! slurp China [OR] it will give me a 500 internal server error. I submitted my site to a B2B search engine (Jayde.com), it did help increase my page ranking but so much BW is lost. Now i have these on my .htaccess ########## Block unwanted robots ########## RewriteCond %{HTTP_USER_AGENT} ^Twiceler-0.9 [OR] RewriteCond %{HTTP_USER_AGENT} ^Baiduspider+ [OR] RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [OR] RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^YodaoBot/1.0 [OR] RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4.0 [OR] RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5.0 [OR] RewriteCond %{HTTP_REFERER} ^baidu\.com RewriteRule ^.* - [L] RewriteRule .*\.(jpg|jpeg|gif|png|bmp)$ http://www.mysite.com/image.jpg [R,NC] Could somebody please check if anything's wrong with the codes? Just hope they won't eat up my remaining 500mb BW, otherwise I need to get a paid hosting. My Robots/Spiders Visitors: Unknown robot (identified by 'spider')181016+69 2.73 GB 27 Jun 2007 - 03:34 Yahoo! Slurp China24489+45 707.65 MB 27 Jun 2007 - 03:16 Googlebot12582+21 308.10 MB 27 Jun 2007 - 03:33 Unknown robot (identified by 'bot/' or 'bot-')4529+98 119.71 MB 26 Jun 2007 - 15:41 Unknown robot (identified by 'robot')4219+5 93.20 MB 26 Jun 2007 - 12:18 Yahoo Slurp1591+516 39.52 MB 27 Jun 2007 - 03:32 Ask1005+20 26.21 MB 21 Jun 2007 - 16:58 MSNBot574+313 14.50 MB 26 Jun 2007 - 23:03 Unknown robot (identified by 'crawl')438+8 10.84 MB 27 Jun 2007 - 02:18 MSNBot-media92+4 2.29 MB 26 Jun 2007 - 22:52 Unknown robot (identified by hit on 'robots.txt')0+10 2.77 KB 24 Jun 2007 - 09:38 Alexa (IA Archiver)1+2 29.33 KB16 Jun 2007 - 06:26 Last edited by josephx; 06-27-2007 at 11:25 AM. |
|
||||
|
blocking all access to the video(s) might be a decent idea using the robot.txt file (this would cut down on bandwidth). but it will end the listings of that video in the search engines.
another option is that there are a TON of free video hosting solutions out there on the net (utube etc.) you could host the video(s) there then embed them in your site. That way the actual video will get more play and not cost you anything in bandwidth or $. Also it will probably play and stream much better than hosting it with the rest of your site as most general web hosting is not properly optimized for true video streaming. just my 2cents.
__________________
Ron Boyd search engine optimization (seo), internet consulting, web design :: Follow Me: orionsweb |
|
||||
|
Quote:
RewriteRule .* - [E=HTTP_IF_MODIFIED_SINCE:%{HTTP:If-Modified-Since}] RewriteRule .* - [E=HTTP_IF_NONE_MATCH:%{HTTP:If-None-Match}] That will save yourself and the search engines a lot a bandwidth.
__________________
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
||||
|
And when MSNBot is crawling your site, it generally does not try to access your site more frequently than one time every few seconds. If MSNBot determines that your site has a slow connection, it automatically adjusts the frequency. To specify a minimum frequency (in seconds), use the Crawl-delay parameter in the robots.txt file:
User-agent: msnbot
__________________
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
||||
|
Quote:
Hey, great thread. Just for my information, how is it incredible? It sounds interesting to me.
__________________
Invent the possibilities, not the obstacles. Tombstone Arizona - Tombstone Arizona History - Tombstone Arizona Souvenirs |
|
|||
|
After reading this thread as well as various other forums on bandwidth sucking bots, I'm almost convinced that the best is to add bandwidth and let them run.
Six months ago I had to add bandwidth because my site got shut for going over my quota. Now th problem arose again in December when the bandwidth consumption doubled respective to previous months. I didn't upgrade to more bandwidth because I noticed that it was the robots who were using it, although I did also have an increase in visitors. Anyway, i got several warnings from my hosting, but in then end they didn't shut me down even though I exceeded 120% of quota. I have a dynamic shopping cart that, besides the php pages, generates an equivalent html catalog of over 2000 pages. I have on average just 240 visitors per day. I have used to date 17 000/20 000MB bandwidth quota for the month. Okay, now in January I'm seeing the phenomena continue. Here's what the biggest crawlers sucked today. Googlebot 2,98GB Unknown robot 2,38GB Inktomi Slurp 238,53 MB In addition the following IP are big consumers, not sure what they do. 85.17.216.133 - 3.47 GB 14 Ene 2008 85.17.187.8 - 2.50 GB 15 Ene 2008 85.17.211.73 - 2.36 GB 08 Ene 2008 85.17.211.77 - 723.22 MB 11 Ene 2008 Are these averages numbers for bandwith consumption by robots? Thanks for your time.
__________________
<a href="www.bordadosdistintivos.com">Bordados</a><br> <a href="www.alicante-escapade.com">Alicante Escapade</a> |
|
||||
|
Quote:
Hope that helps! |
|
||||
|
Those 85.17.x.x IPs don't seem to resolve to any known search engines that I can find. The IP address block is managed by RIPE.net, which I can't see a reason for that much bot activity. Most search engines set their bots to resolve back to their own domain name. Additionally, doing a search for a few of those IP addresses in Google I find several web log files showing those IP addresses at the top of some sites' traffic reports.
You might be able to save some bandwidth by blocking these bots, although you might want to capture the associated user agent string and create a block based on that as well. As you catch these bad bots, you can then add them to your firewall so that the requests are denied and use virtually no bandwidth, or serve a very light (small file size) 403 Not Authorized error message.
__________________
The best way to learn anything, is to question everything. Interestingly Average Security Blog |
|
|||
|
Thanks for the help. I am going to try and block those IPs.
Also, I don't have a google sitemap set up on this site.
__________________
<a href="www.bordadosdistintivos.com">Bordados</a><br> <a href="www.alicante-escapade.com">Alicante Escapade</a> |
![]() |
|
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| "Java" robot eating up bandwidth | arvana | IT Discussion Forum | 0 | 06-05-2006 02:06 AM |
| Yahoo! Search Tips for Webmasters: Saving Bandwidth | YahooMike | Yahoo! Discussion Forum | 0 | 02-12-2005 04:33 PM |
| MSN Bot eating bandwidth. | Easywebdev | MSN Search Discussion Forum | 12 | 12-28-2004 08:55 AM |
| Search Bots | ohlson | Graphics & Design Discussion Forum | 1 | 09-10-2004 12:54 PM |
|
WebProWorld |
Advertise |
Contact Us |
About |
Forum Rules |
MVP's |
Archive |
Newsletter Archive |
Top |
WebProNews
WebProWorld is an iEntry, Inc. ® site - © 2009 All Rights Reserved Privacy Policy and Legal iEntry, Inc. 2549 Richmond Rd. Lexington KY, 40509 |