 |

11-21-2006, 11:04 AM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: May 2005
Location: Norway
Posts: 4,565
|
|
An interesting test.
I got the following idea, if I had time I would made the following experiment:
Set up a site on an unique IP address, with good content and a lot of IBL's and observer the log statistics under the following assumption:
robots.txt
User-agent: *
Disallow: /
Has anybody made that experiment over a long time interval and registered which bots does not respect robots.txt?
Resource link for those who want to analyze results:
WebMasterWorld subforums- robots.txt
- Apache Web Server
- Website Analytics - Tracking and Logging
Browser Capabilities Project
|

11-21-2006, 11:18 AM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: Aug 2003
Location: Worldwide
Posts: 6,867
|
|
Kjell I did not follow the links yet, but do you want to say that Google, MSN and Yahoo do not obey to these commands?
User-agent: Googlebot
Disallow: /
User-agent: MSNBot
Disallow: /
User-agent: Slurp
Disallow: /
|

11-21-2006, 01:02 PM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: May 2005
Location: Norway
Posts: 4,565
|
|
No not their ordinary SE bots, and I thought generally, but they may have other bots. I have not said they have.
To repeat, I thought of a general project. There are thousands of bots, and they could be colleted one by one. It should be able to do it automatically, that is make a program that - List them.
- Block them.
- Make them public.
- List domains and IP addresses where they come from.
This should be a never ending marathon race. Ideally more than one secret domain should be used in different geographical regions.
Advantages: - Decrease the burden on the internet, since a lot of them would give up.
- Save individual webmasters / websites for bandwith and thereby less need to upgrade, that is money.
- Reduce spam, since many of them are emailharvesting bots. Reduced spam is time saved and time is money.
|

11-21-2006, 01:10 PM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: Aug 2003
Location: Worldwide
Posts: 6,867
|
|
We have so far these in our .htaccess file:
# Simple spam protection against some of the more evil user agents
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*FileHound.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*JoBo.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*TurnitinBot.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Whacker.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteCond %{HTTP_USER_AGENT} ^.*adressendeutschland.*$
|

11-21-2006, 01:16 PM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: May 2005
Location: Norway
Posts: 4,565
|
|
My two:
http://multifinanceit.com/htaccess.txt
http://multifinanceit.com/htaccess1.txt
They are old and not updated. It should be done by a formal institution. I should support such an institution with at least USD 100.
"Browscap.ini is getting too big. The main reason for this is the huge number of crawlers and other bots I've been adding. I never intended for browscap to keep track of so many crawlers and other bots".
Hosters should give an offer to implement them automatically. That will give an indication of how serious / reliable the hoster is.
|

11-24-2006, 06:52 PM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: May 2005
Location: Norway
Posts: 4,565
|
|
Quote:
|
Originally Posted by Webnauts
Kjell I did not follow the links yet, but do you want to say that Google, MSN and Yahoo do not obey to these commands?
User-agent: Googlebot
Disallow: /
User-agent: MSNBot
Disallow: /
User-agent: Slurp
Disallow: /
|
Look at the banned-ip.xml file at Gary Keiths Browser Capabilities project.
Some other resources:
IP Addresses of Search Engine Spiders
CrawlWall, the firewall for webpages.
Now when I start with a new website I start with making: - A robots.txt file.
- A .hataccess file that block a lot of bad bots.
- Configuration settings in the .htaccess file for PHP so it is porable using eg:
php_value include_path "Path string"
A useful little script:
<?php
echo ( '<pre>' );
echo 'DOCUMENT_ROOT = ' . $_SERVER['DOCUMENT_ROOT'] ;
echo ( '</pre>' );
echo ( '<pre>' );
echo 'Include_path = ' . ini_get('include_path') . "\n";
echo ( '</pre>' );
?>
- So before I write any content and make any markup, I get control over my part of the server where my sites are hosted.
That saves me time, bandwith, referrer spam and last but not least cleaner and easier to read logs.
Reccomendation: When you start on a new WebSite, start by making a firewall around it.
|

11-25-2006, 02:46 AM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: Aug 2003
Location: Worldwide
Posts: 6,867
|
|
Kgun great post. Thanks for sharing this valuable info.
|

11-27-2006, 02:20 PM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: May 2005
Location: Norway
Posts: 4,565
|
|
I use the following thechnique to ban an Ip address or an Ip block: - Look up the IP: DnsStuff. Note the IP (xy.zw.221.230) and the CIDR (xy.zw.0.0/16).
- Block the IP: deny from xy.zw.221.230
- Block the IP range: deny from xy.zw.0.0/16
Where x,y,z and w are numbers.
|

11-28-2006, 11:20 AM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: May 2005
Location: Norway
Posts: 4,565
|
|
Here is an interesting thread at WMW:
I need to ban a country using htaccess
Note the following posts:
[list][*] Key Master: You've got a big job ahead of you :). I'll start you off with the first IP block (using the SetEnvIf method).
SetEnvIf Remote_Addr ^61\.[0-3]\. ban
<Files ~ "^.*$">
order allow,deny
allow from all
deny from env=ban
</Files>
[*] BjarneDM:
In my experience in observing these scans for formmail, they are not based on an analysis of your website - they just try different IPs until they get a positive response when looking for [fF]orm[mM]ail.[cgi¦pl] in either cgibin or cgi-bin.
Thus, there are three very simple defenses against these scans: - use the latest version of formmail
- rename the cgibin folder into something random like eftesfge
- rename formail to something either random or descriptive like OrderMail
[/list:o:3e92d4edbc]
DownLoad FormMail.
Related link:
Project BanBots
Also note the excellent tools at DnsStuff, like
CIDR/Netmask that calculates CIDR ranges.
|

11-30-2006, 09:06 AM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: May 2005
Location: Norway
Posts: 4,565
|
|
Note that it is also possible to block referrers with a domain name structure with hyphens or any name in the URL. You must start .htaccess with
RewriteEngine on
only include this line once to enable the rewriting engine
Examples: You want to deny referrers with the following structure in the URL: - word-word-whatever.com
RewriteCond %{HTTP_REFERER} ^(http://)?(www\.)?[a-z]+\-[a-z]+\-.*$ [NC,OR]
- word-word.word.com
RewriteCond %{HTTP_REFERER} ^(http://)?(www\.)?[a-z]+\-[a-z]\.[a-z].*$ [NC,OR]
- freeforex
RewriteCond %{HTTP_REFERER} ^(http://)?(www\.)?.*(-|.)freeforex(-|.).*$ [NC,OR]
See more...
|

11-30-2006, 07:16 PM
|
|
WebProWorld Pro
|
|
Join Date: Apr 2006
Location: Earth
Posts: 236
|
|
In robots.txt, we restrict everything from getting to the cgi-bin.
YET, Google Sitemaps lists over 200 specific pages in the cgi-bin as being restricted by the robots.txt.
The individual pages are NOT listed in robots.txt, just restrict "/cgi-bin/".
Nothing in our site is linked to any of these pages. They're pages over a year old! (We don't use that particular program any more).
It's like Google is reading my directory!
|

12-01-2006, 10:23 PM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: May 2005
Location: Norway
Posts: 4,565
|
|
This thread is primarily about .htaccess, since from a bad bot perspective robots.txt is a joke. If you do not believe me, try the following experiment. - Delete your .htaccess file (if you have any).
- To exclude all robots from the entire server upload the following robots.txt file to the root.
User-agent: *
Disallow: /
In theory your log shall then be clean.
|

12-09-2006, 11:26 AM
|
 |
WebProWorld Pro
|
|
Join Date: Jul 2003
Location: Springfield, Misery
Posts: 256
|
|
Quote:
|
Hosters should give an offer to implement them automatically. That will give an indication of how serious / reliable the hoster is.
|
oh yeah... using frontpage you get another hit in the gut by not being bale to modify your htaccess file with out risking the failure of your sites ext's
so I swing in the wind normally...
|

12-09-2006, 11:49 AM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: May 2005
Location: Norway
Posts: 4,565
|
|
Quote:
|
Originally Posted by krisidious
so I swing in the wind normally...
|
And some times, including myself, I fight with windmills.
Building a firewall around your site using .htaccess, is only an advice directly to those who use the Apache web server, but indirectly to those using other web servers like Microsoft's IIS server.
|

12-09-2006, 12:08 PM
|
 |
WebProWorld Pro
|
|
Join Date: Jul 2003
Location: Springfield, Misery
Posts: 256
|
|
I know I'm a bad man.... I use Apache Servers with Frontpage... don't whip me....
|

12-09-2006, 01:15 PM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: May 2005
Location: Norway
Posts: 4,565
|
|
Anythinq wrong with that? I mainly use IE 6.0 and not Opera as I told my son to use. He loves it and threw out Firefox. That was not ad but fact. I adviced him to also use FireFox.
It will take me some time to get used to Opera, time I use to fight with windmills :-)
P.S. I also have multiboot with Vista Beta and IE 7.0.
The problem is that I have two screens, and the last time I checked, Matrox had not a graphics card for multiple secreens for Vista.
|

08-07-2007, 02:25 PM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: May 2005
Location: Norway
Posts: 4,565
|
|
Re: An interesting test.
Good to note that my posts on WPW are being used.
"But then I came across this post by Kjell Gunnar Bleivik in which he basically outlines the essential components of the proposed solution. That was all the confirmation I needed. Time to get my hands dirty with some coding".
Read more ...
|

10-22-2007, 04:45 AM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: May 2005
Location: Norway
Posts: 4,565
|
|
Re: An interesting test.
Another confusing link because of the new forum software. The link is of course to this thread.
| |