WebProWorld Part of WebProNews.com
Page One Link To Us Edit Profile Private Messages Archives FAQ RSS Feeds  
 

Go Back   WebProWorld > Webmaster, IT and Security Discussion > Internet Security Discussion Forum
Subscribe to the Newsletter FREE!


Register FAQ Members List Calendar Arcade Chatbox Mark Forums Read

Internet Security Discussion Forum This forum is for the discussion of security related issues. If you find a new Phishing scheme, spyware, virus or malicious site - let us know about it. If any of the above found you... here's where you ask for help.

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 11-21-2006, 11:04 AM
kgun's Avatar
kgun kgun is offline
WebProWorld 1,000+ Club
 

Join Date: May 2005
Location: Norway
Posts: 4,565
kgun RepRank 3kgun RepRank 3
Default An interesting test.

I got the following idea, if I had time I would made the following experiment:

Set up a site on an unique IP address, with good content and a lot of IBL's and observer the log statistics under the following assumption:

robots.txt
User-agent: *
Disallow: /

Has anybody made that experiment over a long time interval and registered which bots does not respect robots.txt?

Resource link for those who want to analyze results:

WebMasterWorld subforums
  • robots.txt
  • Apache Web Server
  • Website Analytics - Tracking and Logging

Browser Capabilities Project
Reply With Quote
  #2 (permalink)  
Old 11-21-2006, 11:18 AM
Webnauts's Avatar
Webnauts Webnauts is offline
WebProWorld 1,000+ Club
 

Join Date: Aug 2003
Location: Worldwide
Posts: 6,867
Webnauts RepRank 3Webnauts RepRank 3
Default

Kjell I did not follow the links yet, but do you want to say that Google, MSN and Yahoo do not obey to these commands?

User-agent: Googlebot
Disallow: /

User-agent: MSNBot
Disallow: /

User-agent: Slurp
Disallow: /
Reply With Quote
  #3 (permalink)  
Old 11-21-2006, 01:02 PM
kgun's Avatar
kgun kgun is offline
WebProWorld 1,000+ Club
 

Join Date: May 2005
Location: Norway
Posts: 4,565
kgun RepRank 3kgun RepRank 3
Default

No not their ordinary SE bots, and I thought generally, but they may have other bots. I have not said they have.

To repeat, I thought of a general project. There are thousands of bots, and they could be colleted one by one. It should be able to do it automatically, that is make a program that
  • List them.
  • Block them.
  • Make them public.
  • List domains and IP addresses where they come from.

This should be a never ending marathon race. Ideally more than one secret domain should be used in different geographical regions.

Advantages:
  • Decrease the burden on the internet, since a lot of them would give up.
  • Save individual webmasters / websites for bandwith and thereby less need to upgrade, that is money.
  • Reduce spam, since many of them are emailharvesting bots. Reduced spam is time saved and time is money.
Reply With Quote
  #4 (permalink)  
Old 11-21-2006, 01:10 PM
Webnauts's Avatar
Webnauts Webnauts is offline
WebProWorld 1,000+ Club
 

Join Date: Aug 2003
Location: Worldwide
Posts: 6,867
Webnauts RepRank 3Webnauts RepRank 3
Default

We have so far these in our .htaccess file:

# Simple spam protection against some of the more evil user agents
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*FileHound.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*JoBo.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*TurnitinBot.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Whacker.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteCond %{HTTP_USER_AGENT} ^.*adressendeutschland.*$
Reply With Quote
  #5 (permalink)  
Old 11-21-2006, 01:16 PM
kgun's Avatar
kgun kgun is offline
WebProWorld 1,000+ Club
 

Join Date: May 2005
Location: Norway
Posts: 4,565
kgun RepRank 3kgun RepRank 3
Default

My two:
http://multifinanceit.com/htaccess.txt

http://multifinanceit.com/htaccess1.txt

They are old and not updated. It should be done by a formal institution. I should support such an institution with at least USD 100.

"Browscap.ini is getting too big. The main reason for this is the huge number of crawlers and other bots I've been adding. I never intended for browscap to keep track of so many crawlers and other bots".

Hosters should give an offer to implement them automatically. That will give an indication of how serious / reliable the hoster is.
Reply With Quote
  #6 (permalink)  
Old 11-24-2006, 06:52 PM
kgun's Avatar
kgun kgun is offline
WebProWorld 1,000+ Club
 

Join Date: May 2005
Location: Norway
Posts: 4,565
kgun RepRank 3kgun RepRank 3
Default

Quote:
Originally Posted by Webnauts
Kjell I did not follow the links yet, but do you want to say that Google, MSN and Yahoo do not obey to these commands?

User-agent: Googlebot
Disallow: /

User-agent: MSNBot
Disallow: /

User-agent: Slurp
Disallow: /
Look at the banned-ip.xml file at Gary Keiths Browser Capabilities project.

Some other resources:
IP Addresses of Search Engine Spiders

CrawlWall, the firewall for webpages.

Now when I start with a new website I start with making:
  • A robots.txt file.
  • A .hataccess file that block a lot of bad bots.
  • Configuration settings in the .htaccess file for PHP so it is porable using eg:

    php_value include_path "Path string"

    A useful little script:

    <?php
    echo ( '<pre>' );
    echo 'DOCUMENT_ROOT = ' . $_SERVER['DOCUMENT_ROOT'] ;
    echo ( '</pre>' );
    echo ( '<pre>' );
    echo 'Include_path = ' . ini_get('include_path') . "\n";
    echo ( '</pre>' );
    ?>
  • So before I write any content and make any markup, I get control over my part of the server where my sites are hosted.

That saves me time, bandwith, referrer spam and last but not least cleaner and easier to read logs.

Reccomendation: When you start on a new WebSite, start by making a firewall around it.
Reply With Quote
  #7 (permalink)  
Old 11-25-2006, 02:46 AM
Webnauts's Avatar
Webnauts Webnauts is offline
WebProWorld 1,000+ Club
 

Join Date: Aug 2003
Location: Worldwide
Posts: 6,867
Webnauts RepRank 3Webnauts RepRank 3
Default

Kgun great post. Thanks for sharing this valuable info.
Reply With Quote
  #8 (permalink)  
Old 11-27-2006, 02:20 PM
kgun's Avatar
kgun kgun is offline
WebProWorld 1,000+ Club
 

Join Date: May 2005
Location: Norway
Posts: 4,565
kgun RepRank 3kgun RepRank 3
Default

I use the following thechnique to ban an Ip address or an Ip block:
  1. Look up the IP: DnsStuff. Note the IP (xy.zw.221.230) and the CIDR (xy.zw.0.0/16).
  2. Block the IP: deny from xy.zw.221.230
  3. Block the IP range: deny from xy.zw.0.0/16
Where x,y,z and w are numbers.
Reply With Quote
  #9 (permalink)  
Old 11-28-2006, 11:20 AM
kgun's Avatar
kgun kgun is offline
WebProWorld 1,000+ Club
 

Join Date: May 2005
Location: Norway
Posts: 4,565
kgun RepRank 3kgun RepRank 3
Default

Here is an interesting thread at WMW:
I need to ban a country using htaccess

Note the following posts:
[list][*] Key Master: You've got a big job ahead of you :). I'll start you off with the first IP block (using the SetEnvIf method).

SetEnvIf Remote_Addr ^61\.[0-3]\. ban
<Files ~ "^.*$">
order allow,deny
allow from all
deny from env=ban
</Files>
[*] BjarneDM:
In my experience in observing these scans for formmail, they are not based on an analysis of your website - they just try different IPs until they get a positive response when looking for [fF]orm[mM]ail.[cgi¦pl] in either cgibin or cgi-bin.
Thus, there are three very simple defenses against these scans:
  1. use the latest version of formmail
  2. rename the cgibin folder into something random like eftesfge
  3. rename formail to something either random or descriptive like OrderMail
[/list:o:3e92d4edbc]

DownLoad FormMail.

Related link:
Project BanBots

Also note the excellent tools at DnsStuff, like

CIDR/Netmask that calculates CIDR ranges.
Reply With Quote
  #10 (permalink)  
Old 11-30-2006, 09:06 AM
kgun's Avatar
kgun kgun is offline
WebProWorld 1,000+ Club
 

Join Date: May 2005
Location: Norway
Posts: 4,565
kgun RepRank 3kgun RepRank 3
Default

Note that it is also possible to block referrers with a domain name structure with hyphens or any name in the URL. You must start .htaccess with

RewriteEngine on

only include this line once to enable the rewriting engine


Examples: You want to deny referrers with the following structure in the URL:
  1. word-word-whatever.com

    RewriteCond %{HTTP_REFERER} ^(http://)?(www\.)?[a-z]+\-[a-z]+\-.*$ [NC,OR]
  2. word-word.word.com

    RewriteCond %{HTTP_REFERER} ^(http://)?(www\.)?[a-z]+\-[a-z]\.[a-z].*$ [NC,OR]
  3. freeforex

    RewriteCond %{HTTP_REFERER} ^(http://)?(www\.)?.*(-|.)freeforex(-|.).*$ [NC,OR]

See more...
Reply With Quote
  #11 (permalink)  
Old 11-30-2006, 07:16 PM
blitzen blitzen is offline
WebProWorld Pro
 

Join Date: Apr 2006
Location: Earth
Posts: 236
blitzen RepRank 0
Default

In robots.txt, we restrict everything from getting to the cgi-bin.

YET, Google Sitemaps lists over 200 specific pages in the cgi-bin as being restricted by the robots.txt.

The individual pages are NOT listed in robots.txt, just restrict "/cgi-bin/".

Nothing in our site is linked to any of these pages. They're pages over a year old! (We don't use that particular program any more).

It's like Google is reading my directory!
Reply With Quote
  #12 (permalink)  
Old 12-01-2006, 10:23 PM
kgun's Avatar
kgun kgun is offline
WebProWorld 1,000+ Club
 

Join Date: May 2005
Location: Norway
Posts: 4,565
kgun RepRank 3kgun RepRank 3
Default

This thread is primarily about .htaccess, since from a bad bot perspective robots.txt is a joke. If you do not believe me, try the following experiment.
  • Delete your .htaccess file (if you have any).
  • To exclude all robots from the entire server upload the following robots.txt file to the root.

    User-agent: *
    Disallow: /


In theory your log shall then be clean.
Reply With Quote
  #13 (permalink)  
Old 12-09-2006, 11:26 AM
krisidious's Avatar
krisidious krisidious is offline
WebProWorld Pro
 

Join Date: Jul 2003
Location: Springfield, Misery
Posts: 256
krisidious RepRank 0
Default

Quote:
Hosters should give an offer to implement them automatically. That will give an indication of how serious / reliable the hoster is.
oh yeah... using frontpage you get another hit in the gut by not being bale to modify your htaccess file with out risking the failure of your sites ext's

so I swing in the wind normally...
__________________
Kristoff Rand
Residential Home Designer
http://www.aboveallhouseplans.com
Reply With Quote
  #14 (permalink)  
Old 12-09-2006, 11:49 AM
kgun's Avatar
kgun kgun is offline
WebProWorld 1,000+ Club
 

Join Date: May 2005
Location: Norway
Posts: 4,565
kgun RepRank 3kgun RepRank 3
Default

Quote:
Originally Posted by krisidious
so I swing in the wind normally...
And some times, including myself, I fight with windmills.

Building a firewall around your site using .htaccess, is only an advice directly to those who use the Apache web server, but indirectly to those using other web servers like Microsoft's IIS server.
Reply With Quote
  #15 (permalink)  
Old 12-09-2006, 12:08 PM
krisidious's Avatar
krisidious krisidious is offline
WebProWorld Pro
 

Join Date: Jul 2003
Location: Springfield, Misery
Posts: 256
krisidious RepRank 0
Default

I know I'm a bad man.... I use Apache Servers with Frontpage... don't whip me....
__________________
Kristoff Rand
Residential Home Designer
http://www.aboveallhouseplans.com
Reply With Quote
  #16 (permalink)  
Old 12-09-2006, 01:15 PM
kgun's Avatar
kgun kgun is offline
WebProWorld 1,000+ Club
 

Join Date: May 2005
Location: Norway
Posts: 4,565
kgun RepRank 3kgun RepRank 3
Default

Anythinq wrong with that? I mainly use IE 6.0 and not Opera as I told my son to use. He loves it and threw out Firefox. That was not ad but fact. I adviced him to also use FireFox.

It will take me some time to get used to Opera, time I use to fight with windmills :-)

P.S. I also have multiboot with Vista Beta and IE 7.0.

The problem is that I have two screens, and the last time I checked, Matrox had not a graphics card for multiple secreens for Vista.
Reply With Quote
  #17 (permalink)  
Old 08-07-2007, 02:25 PM
kgun's Avatar
kgun kgun is offline
WebProWorld 1,000+ Club
 

Join Date: May 2005
Location: Norway
Posts: 4,565
kgun RepRank 3kgun RepRank 3
Default Re: An interesting test.

Good to note that my posts on WPW are being used.

"But then I came across this post by Kjell Gunnar Bleivik in which he basically outlines the essential components of the proposed solution. That was all the confirmation I needed. Time to get my hands dirty with some coding".

Read more ...
Reply With Quote
  #18 (permalink)  
Old 10-22-2007, 04:45 AM
kgun's Avatar
kgun kgun is offline
WebProWorld 1,000+ Club
 

Join Date: May 2005
Location: Norway
Posts: 4,565
kgun RepRank 3kgun RepRank 3
Default Re: An interesting test.

Another confusing link because of the new forum software. The link is of course to this thread.