iEntry 10th Anniversary Forum Rules Search
WebProWorld
Register FAQ Calendar Mark Forums Read
Search Engine Optimization Forum SEO is much easier with help from peers and experts! The WebProWorld SEO forum is for the discussion and exploration of various search engine optimization topics. Any non (engine) specific SEO or SEM topics should go here.

Share Thread: & Tags

Share Thread:

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 11-21-2005, 05:20 AM
WebProWorld Member
 
Join Date: Oct 2005
Location: Manchester
Posts: 83
Psychobel RepRank 0
Default How to Identify a Scraper Site

First of all, apologies if this is requesting stuff that is already out there. I couldn't find the answer that i was looking for.

I sort of know that scraper sites exist and I've got a vague idea what they do. But I have no idea how to identify them and I don't really know what the impact on our sites would be.

I was playing with the Google Sitemap tool, in particular, the AllinURL command and I found a couple of sites that referenced our site and led to a 404 page with a "URL moved" description. It looked odd.

Any pointers gratefully received.
Reply With Quote
  #2 (permalink)  
Old 11-21-2005, 05:03 PM
Andilinks's Avatar
WebProWorld Veteran
 
Join Date: Feb 2004
Location: NM, USA
Posts: 772
Andilinks RepRank 0
Default

Here are a few hallmarks of a "scraper site." Bear in mind that it is not the scraping itself--Google and other engines scrape in a literal sense--but the usefulness of the site. If a site actually has utility it is not a "scraper site" in the sense of being parasitic.

Scraper sites usually make little sense unless they are scraping entire pages, but usually they just take snippets to avoid copyright issues and so are giberish if you read more than one sentence.

Sites that use snippets but actually are useful in some way are not scrapers since this requires some editing, juxtaposition...

Scraper sites are not updated.

There are more definitions I'm sure, some more objective than others.

Andi
__________________
...the Rockies may tumble, Gibralter may crumble... G & I Gershwin, 1937
Reply With Quote
  #3 (permalink)  
Old 11-21-2005, 05:05 PM
WebProWorld New Member
 
Join Date: May 2004
Location: South Florida
Posts: 7
neuron RepRank 0
Default How to build a scrap(p)er site

If I wanted to create a scraper page for "waterproof widgets" I would search on this phrase at any of the major search engines and then copy the SERPs page that is the result, post it to a page called www.domain.com/waterproof-widgets.html and then post Adsense on it.

By collecting the scraps from the SERPs I'm likely to get "content" that is highly relevant to the search engines, and the page can very likely rank for the result.

If I wanted to get fancy I could actually go pull more content from each of the links listed, or I could pull the results from all three major engines and then mix them up to get more content, make that content more original, and make it less detectable.

To get even fancier, you can pull the results from the SERPs on a daily basis via automated script so that your pages are updated regularly, giving you even more pull with the search engines.
Reply With Quote
  #4 (permalink)  
Old 11-21-2005, 05:17 PM
Andilinks's Avatar
WebProWorld Veteran
 
Join Date: Feb 2004
Location: NM, USA
Posts: 772
Andilinks RepRank 0
Default

hmmmm... I don't think we needed instructions on how to build a scraper site.

Yes, there are lots of scraper programs--they advertise on Adwords a lot, just search "scraper." The sophistication they employ to enable theft is chilling, I have tried a few as shareware trials.

These programs are expensive and likely to be obsolete in the next Google update. In general I think the game is about over for the scrapers anyway, Google is hot on their case (or so it would seem).

Andi
__________________
...the Rockies may tumble, Gibralter may crumble... G & I Gershwin, 1937
Reply With Quote
  #5 (permalink)  
Old 11-21-2005, 05:25 PM
WebProWorld New Member
 
Join Date: Jun 2004
Posts: 13
tobyd RepRank 0
Default

http://www.copyscape.com/ - enter your url and it will display any pages that use your content.
Reply With Quote
  #6 (permalink)  
Old 11-21-2005, 06:59 PM
Weedy Lady's Avatar
WebProWorld Veteran
 
Join Date: Nov 2003
Location: mid south USA
Posts: 405
Weedy Lady RepRank 0
Default copyscape

I know there are sites that use my contents. I find them through the "referrer" stats because they just copy and past my entire code, which means they are hotlinking to my graphics.

The copyscape site that you listed only showed sites that link to mine.......not any that are copying my stuff.

Links are good. Copying is bad. The copyscape site was not helpful.
__________________
The Weedy Lady at
http://www.happydaycards.com
Free E Cards for holidays and all occasions, fun pages and great recipes.
Reply With Quote
  #7 (permalink)  
Old 11-21-2005, 07:30 PM
BlackCat2's Avatar
WebProWorld New Member
 
Join Date: Jan 2004
Location: Texas
Posts: 12
BlackCat2 RepRank 0
Default Re: copyscape

Quote:
Originally Posted by Weedy Lady
The copyscape site that you listed only showed sites that link to mine.......not any that are copying my stuff.
Well then no sites probably were using anything from that page then. I got great results with copyscape.

Of course the page I used makes use of some free to use articles for websites. It was a good valid test though of how effective it is.

Plus it fed my curiousity as to how many other sites had that same article so there is yet another use for copyscape if you use free content articles on your sites.

It even picked up websites using the same text based affiliate text from the merchant as I have used. =)

Heidi
Reply With Quote
  #8 (permalink)  
Old 11-21-2005, 07:40 PM
danielle v2.1b's Avatar
WebProWorld Pro
 
Join Date: Jan 2004
Location: In bed at home
Posts: 107
danielle v2.1b RepRank 1
Default Re: copyscape

Quote:
Originally Posted by Weedy Lady
I know there are sites that use my contents. I find them through the "referrer" stats because they just copy and past my entire code, which means they are hotlinking to my graphics.
One of my web hosts (http://www.fxstudios.net) offers hotlink protection on accounts so that others can't steal your images. One thing is to steal your content, but your bandwidth too? That's outrage! Check with your ISP; not all offer hotlink protection but if this is a serious issue for you this can be a step towards disuading some from content theft.

I checked a couple of client URLs and it did draw some duplicate content which could be confused with scrapers. For example we have a standard company introduction for one client which is often used on third party sites as link text descriptor, as well as on the homepage of the client company's website. It seems a useful tool; thanks for sharing it.
__________________
If you've worked in the Adult SEO industry, please tell me... how do you get it up?
My web designers
Reply With Quote
  #9 (permalink)  
Old 11-21-2005, 07:51 PM
Weedy Lady's Avatar
WebProWorld Veteran
 
Join Date: Nov 2003
Location: mid south USA
Posts: 405
Weedy Lady RepRank 0
Default hotlinking protection

Yes, I am with fxstudios also (fantastic company!), but because of the java script I use on about 50 of my pages I can't use the hotlinking feature. I am hoping to find time after the holidays to completely redesign those pages to implement the hotlinking protection. However, the last time I turned it on as an experiment one of my friends could not get any of my graphics on my pages, and that does not include the ones with the Anfy java script on them. I change the "location" of my graphics every few days and that helps a bit.
__________________
The Weedy Lady at
http://www.happydaycards.com
Free E Cards for holidays and all occasions, fun pages and great recipes.
Reply With Quote
  #10 (permalink)  
Old 11-21-2005, 08:00 PM
WebProWorld Member
 
Join Date: Oct 2005
Location: Manchester
Posts: 83
Psychobel RepRank 0
Default

It certainly was useful to give me an overview of the how to. That made the penny drop, to some degree for me. The copyscape site has thrown up a few instances where our copy has shown up in really poor quality sites. So this leads on naturally to a few more questions:

1> What action can be taken? Legal?
2> I'm assuming turning them into to Google for Adsense may be useful. True?
3> What's the protential impact on my SERPs?

Thanks for all your help, I'm a wiser man already
Reply With Quote
  #11 (permalink)  
Old 11-21-2005, 11:37 PM
WebProWorld Veteran
 
Join Date: Jun 2004
Location: Indiana
Posts: 589
google junky RepRank 1
Default

Quote:
Originally Posted by Psychobel
It certainly was useful to give me an overview of the how to. That made the penny drop, to some degree for me. The copyscape site has thrown up a few instances where our copy has shown up in really poor quality sites. So this leads on naturally to a few more questions:

1> What action can be taken? Legal?
2> I'm assuming turning them into to Google for Adsense may be useful. True?
3> What's the protential impact on my SERPs?

Thanks for all your help, I'm a wiser man already
We covered this in a topic on the google section of the forums. Read my post here
http://www.webproworld.com/viewtopic...hlight=#262082


Read the others as well if you like.
Reply With Quote
  #12 (permalink)  
Old 11-22-2005, 01:37 AM
ADAM Web Design's Avatar
WebProWorld 1,000+ Club
 
Join Date: Dec 2003
Location: Toronto, Ontario, Canada
Posts: 2,345
ADAM Web Design RepRank 0
Default

Well...the first part, as google junky pointed out, was covered to death. So I won't go there again.

So, with that said, on to 2) and 3).

2) If Google has AdSense running on the SERP pages, it is unlikely they'll act any time soon. That's extra money in their coffers, and they're a for-profit corporation. So they've got a vested interest in keeping the spammers in.

Mind you, sooner or later they'll have to crack down on this type of thing...not because of irrelevant SERPs and scraper pages finding their way in, but because of advertisers who pull out because they're pissed off at the irrelevant scraper pages they find their ads on. It may take a while, but Google will act...eventually.

3) Minimal, in the worst case. You don't control the content of the scraper pages, or any other pages that you don't have FTP or other access to. It would be all too easy for competitors to knock other competitors out of the SERPs.

I'm on at least 20 scraper pages that I know about (and hundreds more I probably don't), and nothing has ever happened to me in this regard. Again, I don't control it.

If you want to see some of them, use Google and search for "ADAM Web Design" and then one of 5fish.net , 360mediaworx.com , elehost.com , or abacus.ca . I'm not going to provide direct links since linking to spam is bad, mmmkay?
Reply With Quote
  #13 (permalink)  
Old 11-22-2005, 05:29 AM
WebProWorld Pro
 
Join Date: May 2005
Location: England
Posts: 131
thebloke RepRank 0
Default

I used to work for a company that took a question set input by the user then used java to scrape every known car insurance site in the uk and return an insurance quote from each of them.

A lot of companies have got wise to this sort of thing which is why you see those distored images with a random series of alpha numeric characters that you have to type in in order to proceed. Automated scrapers can't get past this.
Reply With Quote
  #14 (permalink)  
Old 11-22-2005, 05:47 AM
WebProWorld Member
 
Join Date: Oct 2005
Location: Manchester
Posts: 83
Psychobel RepRank 0
Default

Quote:
Originally Posted by thebloke
I used to work for a company that took a question set input by the user then used java to scrape every known car insurance site in the uk and return an insurance quote from each of them.

A lot of companies have got wise to this sort of thing which is why you see those distored images with a random series of alpha numeric characters that you have to type in in order to proceed. Automated scrapers can't get past this.
The problem with that approach is that it's not accessible and strictly speaking breaches UK accessibility legislation. Also not great for a usability point of view for your average punter
Reply With Quote
  #15 (permalink)  
Old 11-22-2005, 06:42 AM
WebProWorld Pro
 
Join Date: May 2005
Location: England
Posts: 131
thebloke RepRank 0
Default

Quote:
Originally Posted by Psychobel
The problem with that approach is that it's not accessible and strictly speaking breaches UK accessibility legislation. Also not great for a usability point of view for your average punter
I guess so. A lot of the big guns do it though - check out the login page for www.overture.com, for example. It seems to be the done thing for stopping scrapers.
Reply With Quote
  #16 (permalink)  
Old 11-22-2005, 08:48 AM
kgun's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: May 2005
Location: Norway
Posts: 6,635
kgun RepRank 4kgun RepRank 4kgun RepRank 4kgun RepRank 4kgun RepRank 4
Default

One thing is to steal your content, but your bandwidth too?

1. Syndicate your content with Rss.
2. Take control of your server.
http://www.webproworld.com/viewtopic.php?t=55223
Reply With Quote
  #17 (permalink)  
Old 11-22-2005, 08:48 AM
WebProWorld Veteran
 
Join Date: Oct 2005
Posts: 532
aaron2005 RepRank 0
Default

Quote:
Originally Posted by tobyd
http://www.copyscape.com/ - enter your url and it will display any pages that use your content.
Thanks for the helpful link.

If you do a search for people who backlink to you, you will find many splog scrapers, I think ADAM is correct, it really doesn't hurt you.

Think about how stupid it would be for a search engine to punish for something you can't control?

and it's good to see more ladies in the house. :)
__________________
SEO Blog
Reply With Quote
  #18 (permalink)  
Old 11-22-2005, 10:41 AM
writergrrrl48's Avatar
WebProWorld Pro
 
Join Date: Oct 2003
Location: western Colorado
Posts: 168
writergrrrl48 RepRank 0
Default Still a little confused

What is the purpose of a scraper site? I don't understand why someone would bother to build one ... seems like a waste of time.
__________________
Shirley Bradbury - Woman-owned web business in the wilds of Western Colorado
Web Site Design & Marketing Web conferencing services
Reply With Quote
  #19 (permalink)  
Old 11-22-2005, 11:50 AM
WebProWorld Member
 
Join Date: Nov 2004
Location: Washington State
Posts: 46
maxsun RepRank 0
Default Copyright infringement

We used copyscape and were amazed at how brazen people were at completely stealing our content, verbatim.

We contacted the worst offenders (one was even a ParaLegal!) and they all removed the content ASAP.

At copyscape they say you can contact their hosting company and ICANN, the search engines. It's pretty much a bad thing to get branded as a plagerist.

We have a SERIOUS problem with people stealing our images at Ebay. We've contacted Ebay multiple times, the offenders multiple times. And Ebay so far, hasn't even replied us!

It's a serious issue and I'd sure like to know how to protect our images. We're on an Apache Server (unix). Any great ideas that won't affect the coming Xmas traffic or greatly upset the website?
__________________
MaxSun

http://www.LuckyGemstones.com
Reply With Quote
  #20 (permalink)  
Old 11-22-2005, 12:12 PM
WebProWorld Veteran
 
Join Date: Oct 2004
Posts: 451
wednesday RepRank 0
Default

htaccess for hotlinking protection
http://altlab.com/htaccess_tutorial.html
Reply With Quote
  #21 (permalink)  
Old 11-22-2005, 12:57 PM
kgun's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: May 2005
Location: Norway
Posts: 6,635
kgun RepRank 4kgun RepRank 4kgun RepRank 4kgun RepRank 4kgun RepRank 4
Default Martin

yes. Use .htaccess to get control of your web server.

You may also use PHP to prevent hot linking from other sites.

http://www.sitepoint.com/books/

http://www.sitepoint.com/books/phpan...2ebb8dedea92f3

Chapter 7.

Books that should be in every webdesigners bookcollection.

You may download the first chapters free.

http://www.sitepoint.com/forums/
Reply With Quote
  #22 (permalink)  
Old 11-22-2005, 01:23 PM
WebProWorld Veteran
 
Join Date: Oct 2004
Posts: 451
wednesday RepRank 0
Default

You can prevent your site from been downloaded by using bot traps. For example: http://www.google.com/search?q=site%...world.com+trap

You must have google as referral. because they are cloaking
Reply With Quote
  #23 (permalink)  
Old 11-22-2005, 02:06 PM
kgun's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: May 2005
Location: Norway
Posts: 6,635
kgun RepRank 4kgun RepRank 4kgun RepRank 4kgun RepRank 4kgun RepRank 4
Default Again there is a simple method

"Setting a Spider-trap
The best method of identifying bad bots is to create what is known as a Spider-trap. Create a directory, block that directory to all agents using robots.txt and link to the directory from a page (usually as a small 1x1 pixel link).

Only bad bots will access that directory (ie they've ignored our robots.txt exclusion). These bots can then be directed to a script that will immediately grab their IP address, User Agent or Referrer and add it to an .htaccess file - so that they're banned from the site".


http://www.webproworld.com/viewtopic...cd74b4cd6fb731

And you may again use PHP.
Reply With Quote
  #24 (permalink)  
Old 11-22-2005, 04:36 PM
kgun's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: May 2005
Location: Norway
Posts: 6,635
kgun RepRank 4kgun RepRank 4kgun RepRank 4kgun RepRank 4kgun RepRank 4
Default PHP functions getting environment variable

<?php
$url = $_SERVER["HTTP_REFERER"];
$browser = $_SERVER["HTTP_USER_AGENT"];
$ip = $_SERVER["REMOTE_ADDR"];
?>
http://www.w3schools.com/php/php_functions.asp

Related link:
http://no2.php.net/getenv
Reply With Quote
  #25 (permalink)  
Old 11-25-2005, 05:51 AM
WebProWorld Pro
 
Join Date: May 2005
Location: England
Posts: 131
thebloke RepRank 0
Default Re: Martin

Quote:
Originally Posted by kgun
Use .htaccess to get control of your web server.
I think this can only be used on an apache server? What about Microsoft?
Reply With Quote
  #26 (permalink)  
Old 11-26-2005, 03:04 AM
WebProWorld Pro
 
Join Date: Apr 2004
Posts: 167
roam_dx RepRank 0
Default

Quote:
I sort of know that scraper sites exist and I've got a vague idea what they do. But I have no idea how to identify them and I don't really know what the impact on our sites would be.

Here's what you're looking for,

http://www.webproworld.com/viewtopic.php?p=264747
Reply With Quote
  #27 (permalink)  
Old 11-27-2005, 09:47 PM
WebProWorld Veteran
 
Join Date: Oct 2005
Posts: 532
aaron2005 RepRank 0
Default

here is the best tool to find scraper sites, period.

http://www.linkhounds.com/link-harvester/backlinks.php
__________________
SEO Blog
Reply With Quote
Reply

  WebProWorld > Search Engines > Search Engine Optimization Forum

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT -4. The time now is 05:04 PM.



Search Engine Optimization by vBSEO 3.3.0