Contact Us Forum Rules Search Archive
WebProWorld Part of WebProNews.com
Page One Link To Us Edit Profile Private Messages Archives FAQ RSS Feeds  
 

Go Back   WebProWorld > Webmaster, IT and Security Discussion > Web Programming Discussion Forum
Subscribe to the Newsletter FREE!


Register FAQ Members List Calendar Arcade Chatbox Mark Forums Read

Web Programming Discussion Forum Working with an API? Developing a plugin? Writing a Mod or script for your favorite blog, Web 2.0 site or Forum? Welcome.

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 03-01-2005, 11:54 AM
Weedy Lady's Avatar
WebProWorld Veteran
 

Join Date: Nov 2003
Location: mid south USA
Posts: 385
Weedy Lady RepRank 0
Default What is a grabber?

I'm trying very hard to get things set up on my site so the bad bots can't harvest things and so people can't steal my work.

In checking my stats under "browser types" I found four listings that were designated as "grabbers". They are:
Wget, Curl, Acrobat, and WebCopier

Do I want to ban these from my site, and if so how do I do it? If it is through .htaccess please give me the exact code to add to my existing .htaccess pages. I have one in each directory, so can put the code in all of them.

I don't think it should go in robots.txt because these were listed under browsers, but if it should go in there also please let me know, and how to code it.

And speaking of robots.txt -- Google has garnered all of my images (thousands of them) and people are clicking on them and then using them -- sometimes hotlinking to them. Do I want to ban the Google image robot, and if so will this hurt my rankings? I certainly do NOT want to ban the regular Google bot!!!!

PLEASE do not tell me to use a code to prevent hotlinking. I tried four versions and none of them would work. My CP has a link to click to stop hotlinking also, but when I activate it no images will load. At all. On my own pages. I deal with hotlinking by moving my graphics often. It's a lot of work, but it does work for me when nothing else will.
__________________
The Weedy Lady at
http://www.happydaycards.com
Free E Cards for holidays and all occasions, fun pages and great recipes.
Reply With Quote
  #2 (permalink)  
Old 03-01-2005, 12:24 PM
paulhiles's Avatar
WebProWorld 1,000+ Club
 

Join Date: Jul 2003
Location: UK
Posts: 2,803
paulhiles RepRank 0
Default Banning unwanted crawls

Hi there Weedy Lady,

There are several questions in your post. I'm not going to attempt to answer them all, but I'll give you a few quick replies to get the ball rolling.

A grabber is an automated crawler/spider. Also known as a Web scraper or Screen scraper. They can be used legitimately, to index links from news sites and construct RSS feeds. However, there is always a flip-side to everything! The worst examples of 'grabbers' simply scour a site's content with the purpose of extracting email addresses and contact information.

I don't have a list of malicious grabbers / scrapers, but I'm sure other members will be able to supply a few names. I would imagine it's unlikely that any such program has been written to obey the robots exclusion rules, so as you say, adding them to your site's .htaccess file could well be your best bet.

You may find the following page helpful:
How to block spambots, ban spybots, and tell unwanted robots to go to hell

Regarding Google's spidering of images. The bot Google uses to index images is called Googlebot-Image. The robots.txt file can be modified to control the bot's activity. You can choose to ban the bot from a specific directory as below:

Code:
User-Agent: Googlebot-Image
Disallow: /images/
Or alternatively, you could ban the bot from your site altogether.

Code:
User-Agent: Googlebot-Image
Disallow: /
I can't say for sure whether or not this would affect your rankings, but I would very much doubt it!
Reply With Quote
  #3 (permalink)  
Old 03-02-2005, 05:38 AM
Faglork's Avatar
WebProWorld Veteran
 

Join Date: Feb 2005
Location: Forchheim, Germany
Posts: 947
Faglork RepRank 0
Default

Hello Weedy Lady,

there's a page over at Google with information on that:

"Remove an image from Google's Image Search"
http://www.google.com/remove.html#images

hth,
Alex
Reply With Quote
  #4 (permalink)  
Old 03-02-2005, 08:53 AM
Weedy Lady's Avatar
WebProWorld Veteran
 

Join Date: Nov 2003
Location: mid south USA
Posts: 385
Weedy Lady RepRank 0
Default thanks

Thanks to both for the information about the google images.

I've found some baddies in my stats that I would like to ban in .htaccess as well as in my robots.txt file, but I do that through my CP and it puts them into my .htaccess file. In order to do that I need url or full domain name rather than just name of the robot.

When looking at my stats I find either IP or name for visitors, but can't seem to find IP for the robots. I tried looking them up on Who Is, but don't know if the information I got was for the bot companies or for other companies who are unfortunate to have a domain name that matches the bot name. Can someone give me the IP of the following?

wget (all versions)
WebCopier
Web Image Collector
Curl

Would also like to ban MSFront Page and Acrobat, but don't know how to do those either. I have looked and looked on line and all the instruction pages say you have to modify code on the server. I don't have my own server so can't do this.

Can someone provide IPs for the above 4?

Thanks
__________________
The Weedy Lady at
http://www.happydaycards.com
Free E Cards for holidays and all occasions, fun pages and great recipes.
Reply With Quote
  #5 (permalink)  
Old 03-02-2005, 09:09 AM
Faglork's Avatar
WebProWorld Veteran
 

Join Date: Feb 2005
Location: Forchheim, Germany
Posts: 947
Faglork RepRank 0
Default Re: thanks

Quote:
Originally Posted by Weedy Lady

wget (all versions)
WebCopier
Web Image Collector
Curl

[...]

Can someone provide IPs for the above 4?

Thanks
wget is a retrieval program
http://directory.fsf.org/wget.html
--> no robot, no fixed IP here

WebCopier is a retrieval program
http://www.maximumsoft.com/
--> no robot, no fixed IP here

pretty much the same with the other two.

Just do a google search - you find them.

To cut it short: You can't ban them by IP, you have to ban them by agent ID. Although you do not have a dedicated server, this is not needed, as it may be possible to modify your .htaccess accordingly.

As for the robots.txt, I suggest to put all your pictures in a subdirectory and exclude all robots form that directory.

Alex
Reply With Quote
  #6 (permalink)  
Old 03-02-2005, 09:21 AM
Weedy Lady's Avatar
WebProWorld Veteran
 

Join Date: Nov 2003
Location: mid south USA
Posts: 385
Weedy Lady RepRank 0
Default ban to sub directories

I have 5 sub directories with graphics, and one with music files. I change the names of these directories quite often to keep hotlinkers away and take the images off if they have been hotlinked. I would have to change my robots.txt file each time I did this. I can do it........the only problem will be to remember to do it.

I have tried all variations of the hotlinking ban scripts and they just do not work for me. I copied and pasted so did not do them wrong. They just won't work. One of them worked for one check that I did, and then stopped working.

If that is the best way to keep the robots out of my graphics I guess I'll have to do it.

None of this is easy, is it?????
__________________
The Weedy Lady at
http://www.happydaycards.com
Free E Cards for holidays and all occasions, fun pages and great recipes.
Reply With Quote
  #7 (permalink)  
Old 03-02-2005, 08:58 PM
WebProWorld New Member
 

Join Date: Feb 2005
Posts: 2
magic2147 RepRank 0
Default Re: ban to sub directories

You might like to look at the Copysentry services provided at http://www.copyscape.com/

But at the end of the day if they want to knock off your stuff badly enough, they will.

To a certain extent it's a bit like steganography and digital watermarking a nice idea but probably not worth the effort. Or to put it another way don't let the effort of protecting your work distract you from the main game.

BTW which CP are you using. If it works the way you describe it is an install issue and your hosting company should be able to fix it or at least post a bug report with the CP developer.

Good luck
Reply With Quote
  #8 (permalink)  
Old 03-02-2005, 10:10 PM
Weedy Lady's Avatar
WebProWorld Veteran
 

Join Date: Nov 2003
Location: mid south USA
Posts: 385
Weedy Lady RepRank 0
Default to magic2147

Thanks for the link. I'll definitely look at it.

It isn't a problem with the CP. My hosting company says the hotlinking function works fine with other sites. I think it's the way I have mine set up.....or rather the way my coding has to be because of that. I have to put the FULL url path to any files not in the same folder (all graphics, all music, and several other things), and that makes them all look like links from outside sites.

I set it up with too many folders, and so many of my pages have really good rankings on the SEs that -- if I would change the whole structure now, it means I would have to do hundreds of redirects.

If ONLY I HAD the WPW forum and had read all this good stuff six years ago before I started my site, I definitely would have followed all the great web site design advice in the first place. Now I fear I have a monster that I just have to deal with. You don't even want to know what my directory structure is!

I truly appreciate everyone's help, because it has enabled me to make a lot more sense out of things, and I am now much more organized than before.
__________________
The Weedy Lady at
http://www.happydaycards.com
Free E Cards for holidays and all occasions, fun pages and great recipes.
Reply With Quote
  #9 (permalink)  
Old 03-02-2005, 10:18 PM
Weedy Lady's Avatar
WebProWorld Veteran
 

Join Date: Nov 2003
Location: mid south USA
Posts: 385
Weedy Lady RepRank 0
Default re the link to copy sentry

Copy Sentry sounds like a great service, but it would cost me $200 per month. I don't make that much. In fact, I don't really make enough to cover all my expenses. Nice at income tax time.

Obviously, I'm not doing this for the money. I just treat it as if I were.
__________________
The Weedy Lady at
http://www.happydaycards.com
Free E Cards for holidays and all occasions, fun pages and great recipes.
Reply With Quote
Reply

  WebProWorld > Webmaster, IT and Security Discussion > Web Programming Discussion Forum
Tags:



Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Search Engine Optimization by vBSEO 3.2.0