WebProWorld Part of WebProNews.com
Page One Link To Us Edit Profile Private Messages Archives FAQ RSS Feeds  
 

Go Back   WebProWorld > Search Engines > Search Engine Optimization Forum
Subscribe to the Newsletter FREE!


Register FAQ Members List Calendar Arcade Chatbox Mark Forums Read

Search Engine Optimization Forum SEO is much easier with help from peers and experts! The WebProWorld SEO forum is for the discussion and exploration of various search engine optimization topics. Any non (engine) specific SEO or SEM topics should go here.

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 11-21-2005, 05:20 AM
Psychobel Psychobel is offline
WebProWorld Member
 

Join Date: Oct 2005
Location: Manchester
Posts: 81
Psychobel RepRank 0
Default How to Identify a Scraper Site

First of all, apologies if this is requesting stuff that is already out there. I couldn't find the answer that i was looking for.

I sort of know that scraper sites exist and I've got a vague idea what they do. But I have no idea how to identify them and I don't really know what the impact on our sites would be.

I was playing with the Google Sitemap tool, in particular, the AllinURL command and I found a couple of sites that referenced our site and led to a 404 page with a "URL moved" description. It looked odd.

Any pointers gratefully received.
Reply With Quote
  #2 (permalink)  
Old 11-21-2005, 05:03 PM
Andilinks's Avatar
Andilinks Andilinks is offline
WebProWorld Veteran
 

Join Date: Feb 2004
Location: NM, USA
Posts: 773
Andilinks RepRank 0
Default

Here are a few hallmarks of a "scraper site." Bear in mind that it is not the scraping itself--Google and other engines scrape in a literal sense--but the usefulness of the site. If a site actually has utility it is not a "scraper site" in the sense of being parasitic.

Scraper sites usually make little sense unless they are scraping entire pages, but usually they just take snippets to avoid copyright issues and so are giberish if you read more than one sentence.

Sites that use snippets but actually are useful in some way are not scrapers since this requires some editing, juxtaposition...

Scraper sites are not updated.

There are more definitions I'm sure, some more objective than others.

Andi
__________________
...the Rockies may tumble, Gibralter may crumble... G & I Gershwin, 1937
Reply With Quote
  #3 (permalink)  
Old 11-21-2005, 05:05 PM
neuron neuron is offline
WebProWorld New Member
 

Join Date: May 2004
Location: South Florida
Posts: 8
neuron RepRank 0
Default How to build a scrap(p)er site

If I wanted to create a scraper page for "waterproof widgets" I would search on this phrase at any of the major search engines and then copy the SERPs page that is the result, post it to a page called www.domain.com/waterproof-widgets.html and then post Adsense on it.

By collecting the scraps from the SERPs I'm likely to get "content" that is highly relevant to the search engines, and the page can very likely rank for the result.

If I wanted to get fancy I could actually go pull more content from each of the links listed, or I could pull the results from all three major engines and then mix them up to get more content, make that content more original, and make it less detectable.

To get even fancier, you can pull the results from the SERPs on a daily basis via automated script so that your pages are updated regularly, giving you even more pull with the search engines.
Reply With Quote
  #4 (permalink)  
Old 11-21-2005, 05:17 PM
Andilinks's Avatar
Andilinks Andilinks is offline
WebProWorld Veteran
 

Join Date: Feb 2004
Location: NM, USA
Posts: 773
Andilinks RepRank 0
Default

hmmmm... I don't think we needed instructions on how to build a scraper site.

Yes, there are lots of scraper programs--they advertise on Adwords a lot, just search "scraper." The sophistication they employ to enable theft is chilling, I have tried a few as shareware trials.

These programs are expensive and likely to be obsolete in the next Google update. In general I think the game is about over for the scrapers anyway, Google is hot on their case (or so it would seem).

Andi
__________________
...the Rockies may tumble, Gibralter may crumble... G & I Gershwin, 1937
Reply With Quote
  #5 (permalink)  
Old 11-21-2005, 05:25 PM
tobyd tobyd is offline
WebProWorld New Member
 

Join Date: Jun 2004
Posts: 12
tobyd RepRank 0
Default

http://www.copyscape.com/ - enter your url and it will display any pages that use your content.
Reply With Quote
  #6 (permalink)  
Old 11-21-2005, 06:59 PM
Weedy Lady's Avatar
Weedy Lady Weedy Lady is offline
WebProWorld Veteran
 

Join Date: Nov 2003
Location: mid south USA
Posts: 374
Weedy Lady RepRank 0
Default copyscape

I know there are sites that use my contents. I find them through the "referrer" stats because they just copy and past my entire code, which means they are hotlinking to my graphics.

The copyscape site that you listed only showed sites that link to mine.......not any that are copying my stuff.

Links are good. Copying is bad. The copyscape site was not helpful.
__________________
The Weedy Lady at
http://www.happydaycards.com
Free E Cards for holidays and all occasions, fun pages and great recipes.
Reply With Quote
  #7 (permalink)  
Old 11-21-2005, 07:30 PM
BlackCat2's Avatar
BlackCat2 BlackCat2 is offline
WebProWorld New Member
 

Join Date: Jan 2004
Location: Texas
Posts: 13
BlackCat2 RepRank 0
Default Re: copyscape

Quote:
Originally Posted by Weedy Lady
The copyscape site that you listed only showed sites that link to mine.......not any that are copying my stuff.
Well then no sites probably were using anything from that page then. I got great results with copyscape.

Of course the page I used makes use of some free to use articles for websites. It was a good valid test though of how effective it is.

Plus it fed my curiousity as to how many other sites had that same article so there is yet another use for copyscape if you use free content articles on your sites.

It even picked up websites using the same text based affiliate text from the merchant as I have used. =)

Heidi
Reply With Quote
  #8 (permalink)  
Old 11-21-2005, 07:40 PM
danielle v2.1b's Avatar
danielle v2.1b danielle v2.1b is offline
WebProWorld Member
 

Join Date: Jan 2004
Location: In bed at home
Posts: 88
danielle v2.1b RepRank 0
Default Re: copyscape

Quote:
Originally Posted by Weedy Lady
I know there are sites that use my contents. I find them through the "referrer" stats because they just copy and past my entire code, which means they are hotlinking to my graphics.
One of my web hosts (http://www.fxstudios.net) offers hotlink protection on accounts so that others can't steal your images. One thing is to steal your content, but your bandwidth too? That's outrage! Check with your ISP; not all offer hotlink protection but if this is a serious issue for you this can be a step towards disuading some from content theft.

I checked a couple of client URLs and it did draw some duplicate content which could be confused with scrapers. For example we have a standard company introduction for one client which is often used on third party sites as link text descriptor, as well as on the homepage of the client company's website. It seems a useful tool; thanks for sharing it.
__________________
If you've worked in the Adult SEO industry, please tell me... how do you get it up?
Reply With Quote
  #9 (permalink)  
Old 11-21-2005, 07:51 PM
Weedy Lady's Avatar
Weedy Lady Weedy Lady is offline
WebProWorld Veteran
 

Join Date: Nov 2003
Location: mid south USA
Posts: 374
Weedy Lady RepRank 0
Default hotlinking protection

Yes, I am with fxstudios also (fantastic company!), but because of the java script I use on about 50 of my pages I can't use the hotlinking feature. I am hoping to find time after the holidays to completely redesign those pages to implement the hotlinking protection. However, the last time I turned it on as an experiment one of my friends could not get any of my graphics on my pages, and that does not include the ones with the Anfy java script on them. I change the "location" of my graphics every few days and that helps a bit.
__________________
The Weedy Lady at
http://www.happydaycards.com
Free E Cards for holidays and all occasions, fun pages and great recipes.
Reply With Quote
  #10 (permalink)  
Old 11-21-2005, 08:00 PM
Psychobel Psychobel is offline
WebProWorld Member
 

Join Date: Oct 2005
Location: Manchester
Posts: 81
Psychobel RepRank 0
Default

It certainly was useful to give me an overview of the how to. That made the penny drop, to some degree for me. The copyscape site has thrown up a few instances where our copy has shown up in really poor quality sites. So this leads on naturally to a few more questions:

1> What action can be taken? Legal?
2> I'm assuming turning them into to Google for Adsense may be useful. True?
3> What's the protential impact on my SERPs?

Thanks for all your help, I'm a wiser man already
Reply With Quote
  #11 (permalink)  
Old 11-21-2005, 11:37 PM
google junky google junky is offline
WebProWorld Veteran
 

Join Date: Jun 2004
Location: Indiana
Posts: 484
google junky RepRank 0
Default

Quote:
Originally Posted by Psychobel
It certainly was useful to give me an overview of the how to. That made the penny drop, to some degree for me. The copyscape site has thrown up a few instances where our copy has shown up in really poor quality sites. So this leads on naturally to a few more questions:

1> What action can be taken? Legal?
2> I'm assuming turning them into to Google for Adsense may be useful. True?
3> What's the protential impact on my SERPs?

Thanks for all your help, I'm a wiser man already
We covered this in a topic on the google section of the forums. Read my post here
http://www.webproworld.com/viewtopic...hlight=#262082


Read the others as well if you like.
Reply With Quote
  #12 (permalink)  
Old 11-22-2005, 01:37 AM
ADAM Web Design's Avatar
ADAM Web Design ADAM Web Design is offline
WebProWorld 1,000+ Club
 

Join Date: Dec 2003
Location: Toronto, Ontario, Canada
Posts: 2,217
ADAM Web Design RepRank 0
Default

Well...the first part, as google junky pointed out, was covered to death. So I won't go there again.

So, with that said, on to 2) and 3).

2) If Google has AdSense running on the SERP pages, it is unlikely they'll act any time soon. That's extra money in their coffers, and they're a for-profit corporation. So they've got a vested interest in keeping the spammers in.

Mind you, sooner or later they'll have to crack down on this type of thing...not because of irrelevant SERPs and scraper pages finding their way in, but because of advertisers who pull out because they're pissed off at the irrelevant scraper pages they find their ads on. It may take a while, but Google will act...eventually.

3) Minimal, in the worst case. You don't control the content of the scraper pages, or any other pages that you don't have FTP or other access to. It would be all too easy for competitors to knock other competitors out of the SERPs.

I'm on at least 20 scraper pages that I know about (and hundreds more I probably don't), and nothing has ever happened to me in this regard. Again, I don't control it.

If you want to see some of them, use Google and search for "ADAM Web Design" and then one of 5fish.net , 360mediaworx.com , elehost.com , or abacus.ca . I'm not going to provide direct links since linking to spam is bad, mmmkay?
Reply With Quote
  #13 (permalink)  
Old 11-22-2005, 05:29 AM
thebloke thebloke is offline
WebProWorld Pro
 

Join Date: May 2005
Location: England
Posts: 119
thebloke RepRank 0
Default

I used to work for a company that took a question set input by the user then used java to scrape every known car insurance site in the uk and return an insurance quote from each of them.

A lot of companies have got wise to this sort of thing which is why you see those distored images with a random series of alpha numeric characters that you have to type in in order to proceed. Automated scrapers can't get past this.
Reply With Quote
  #14 (permalink)  
Old 11-22-2005, 05:47 AM
Psychobel Psychobel is offline
WebProWorld Member
 

Join Date: Oct 2005
Location: Manchester
Posts: 81
Psychobel RepRank 0
Default

Quote:
Originally Posted by thebloke
I used to work for a company that took a question set input by the user then used java to scrape every known car insurance site in the uk and return an insurance quote from each of them.

A lot of companies have got wise to this sort of thing which is why you see those distored images with a random series of alpha numeric characters that you have to type in in order to proceed. Automated scrapers can't get past this.
The problem with that approach is that it's not accessible and strictly speaking breaches UK accessibility legislation. Also not great for a usability point of view for your average punter
Reply With Quote
  #15 (permalink)  
Old 11-22-2005, 06:42 AM
thebloke thebloke is offline
WebProWorld Pro
 

Join Date: May 2005
Location: England
Posts: 119
thebloke RepRank 0
Default

Quote:
Originally Posted by Psychobel
The problem with that approach is that it's not accessible and strictly speaking breaches UK accessibility legislation. Also not great for a usability point of view for your average punter
I guess so. A lot of the big guns do it though - check out the login page for www.overture.com, for example. It seems to be the done thing for stopping scrapers.
Reply With Quote
  #16 (permalink)  
Old 11-22-2005, 08:48 AM
kgun's Avatar
kgun kgun is offline
WebProWorld 1,000+ Club
 

Join Date: May 2005
Location: Norway
Posts: 4,565
kgun RepRank 3kgun RepRank 3
Default

One thing is to steal your content, but your bandwidth too?

1. Syndicate your content with Rss.
2. Take control of your server.
http://www.webproworld.com/viewtopic.php?t=55223
Reply With Quote
  #17 (permalink)  
Old 11-22-2005, 08:48 AM
aaron2005 aaron2005 is offline
WebProWorld Veteran
 

Join Date: Oct 2005
Posts: 529
aaron2005 RepRank 0
Default

Quote:
Originally Posted by tobyd
http://www.copyscape.com/ - enter your url and it will display any pages that use your content.
Thanks for the helpful link.

If you do a search for people who backlink to you, you will find many splog scrapers, I think ADAM is correct, it really doesn't hurt you.

Think about how stupid it would be for a search engine to punish for something you can't control?

and it's good to see more ladies in the house. :)
__________________
SEO Blog
Reply With Quote
  #18 (permalink)  
Old 11-22-2005, 10:41 AM
writergrrrl48's Avatar
writergrrrl48 writergrrrl48 is offline
WebProWorld Pro
 

Join Date: Oct 2003
Location: western Colorado
Posts: 164
writergrrrl48 RepRank 0
Default Still a little confused

What is the purpose of a scraper site? I don't understand why someone would bother to build one ... seems like a waste of time.
__________________
Shirley Bradbury - Woman-owned web business in the wilds of Western Colorado
Web Site Design & Marketing Web conferencing services
Reply With Quote
  #19 (permalink)  
Old 11-22-2005, 11:50 AM
maxsun maxsun is offline
WebProWorld Member
 

Join Date: Nov 2004
Location: Washington State
Posts: 45
maxsun RepRank 0
Default Copyright infringement

We used copyscape and were amazed at how brazen people were at completely stealing our content, verbatim.

We contacted the worst offenders (one was even a ParaLegal!) and they all removed the content ASAP.

At copyscape they say you can contact their hosting company and ICANN, the search engines. It's pretty much a bad thing to get branded as a plagerist.

We have a SERIOUS problem with people stealing our images at Ebay. We've contacted Ebay multiple times, the offenders multiple times. And Ebay so far, hasn't even replied us!

It's a serious issue and I'd sure like to know how to protect our images. We're on an Apache Server (unix). Any great ideas that won't affect the coming Xmas traffic or greatly upset the website?
__________________
MaxSun

http://www.LuckyGemstones.com
Reply With Quote
  #20 (permalink)  
Old 11-22-2005, 12:12 PM