 |

11-21-2005, 05:20 AM
|
|
WebProWorld Member
|
|
Join Date: Oct 2005
Location: Manchester
Posts: 81
|
|
How to Identify a Scraper Site
First of all, apologies if this is requesting stuff that is already out there. I couldn't find the answer that i was looking for.
I sort of know that scraper sites exist and I've got a vague idea what they do. But I have no idea how to identify them and I don't really know what the impact on our sites would be.
I was playing with the Google Sitemap tool, in particular, the AllinURL command and I found a couple of sites that referenced our site and led to a 404 page with a "URL moved" description. It looked odd.
Any pointers gratefully received.
|

11-21-2005, 05:03 PM
|
 |
WebProWorld Veteran
|
|
Join Date: Feb 2004
Location: NM, USA
Posts: 773
|
|
Here are a few hallmarks of a "scraper site." Bear in mind that it is not the scraping itself--Google and other engines scrape in a literal sense--but the usefulness of the site. If a site actually has utility it is not a "scraper site" in the sense of being parasitic.
Scraper sites usually make little sense unless they are scraping entire pages, but usually they just take snippets to avoid copyright issues and so are giberish if you read more than one sentence.
Sites that use snippets but actually are useful in some way are not scrapers since this requires some editing, juxtaposition...
Scraper sites are not updated.
There are more definitions I'm sure, some more objective than others.
Andi
__________________
...the Rockies may tumble, Gibralter may crumble... G & I Gershwin, 1937
|

11-21-2005, 05:05 PM
|
|
WebProWorld New Member
|
|
Join Date: May 2004
Location: South Florida
Posts: 8
|
|
How to build a scrap(p)er site
If I wanted to create a scraper page for "waterproof widgets" I would search on this phrase at any of the major search engines and then copy the SERPs page that is the result, post it to a page called www.domain.com/waterproof-widgets.html and then post Adsense on it.
By collecting the scraps from the SERPs I'm likely to get "content" that is highly relevant to the search engines, and the page can very likely rank for the result.
If I wanted to get fancy I could actually go pull more content from each of the links listed, or I could pull the results from all three major engines and then mix them up to get more content, make that content more original, and make it less detectable.
To get even fancier, you can pull the results from the SERPs on a daily basis via automated script so that your pages are updated regularly, giving you even more pull with the search engines.
|

11-21-2005, 05:17 PM
|
 |
WebProWorld Veteran
|
|
Join Date: Feb 2004
Location: NM, USA
Posts: 773
|
|
hmmmm... I don't think we needed instructions on how to build a scraper site.
Yes, there are lots of scraper programs--they advertise on Adwords a lot, just search "scraper." The sophistication they employ to enable theft is chilling, I have tried a few as shareware trials.
These programs are expensive and likely to be obsolete in the next Google update. In general I think the game is about over for the scrapers anyway, Google is hot on their case (or so it would seem).
Andi
__________________
...the Rockies may tumble, Gibralter may crumble... G & I Gershwin, 1937
|

11-21-2005, 05:25 PM
|
|
WebProWorld New Member
|
|
Join Date: Jun 2004
Posts: 12
|
|
http://www.copyscape.com/ - enter your url and it will display any pages that use your content.
|

11-21-2005, 06:59 PM
|
 |
WebProWorld Veteran
|
|
Join Date: Nov 2003
Location: mid south USA
Posts: 374
|
|
copyscape
I know there are sites that use my contents. I find them through the "referrer" stats because they just copy and past my entire code, which means they are hotlinking to my graphics.
The copyscape site that you listed only showed sites that link to mine.......not any that are copying my stuff.
Links are good. Copying is bad. The copyscape site was not helpful.
|

11-21-2005, 07:30 PM
|
 |
WebProWorld New Member
|
|
Join Date: Jan 2004
Location: Texas
Posts: 13
|
|
Re: copyscape
Quote:
|
Originally Posted by Weedy Lady
The copyscape site that you listed only showed sites that link to mine.......not any that are copying my stuff.
|
Well then no sites probably were using anything from that page then. I got great results with copyscape.
Of course the page I used makes use of some free to use articles for websites. It was a good valid test though of how effective it is.
Plus it fed my curiousity as to how many other sites had that same article so there is yet another use for copyscape if you use free content articles on your sites.
It even picked up websites using the same text based affiliate text from the merchant as I have used. =)
Heidi
|

11-21-2005, 07:40 PM
|
 |
WebProWorld Member
|
|
Join Date: Jan 2004
Location: In bed at home
Posts: 88
|
|
Re: copyscape
Quote:
|
Originally Posted by Weedy Lady
I know there are sites that use my contents. I find them through the "referrer" stats because they just copy and past my entire code, which means they are hotlinking to my graphics.
|
One of my web hosts ( http://www.fxstudios.net) offers hotlink protection on accounts so that others can't steal your images. One thing is to steal your content, but your bandwidth too? That's outrage! Check with your ISP; not all offer hotlink protection but if this is a serious issue for you this can be a step towards disuading some from content theft.
I checked a couple of client URLs and it did draw some duplicate content which could be confused with scrapers. For example we have a standard company introduction for one client which is often used on third party sites as link text descriptor, as well as on the homepage of the client company's website. It seems a useful tool; thanks for sharing it.
__________________
If you've worked in the Adult SEO industry, please tell me... how do you get it up?
|

11-21-2005, 07:51 PM
|
 |
WebProWorld Veteran
|
|
Join Date: Nov 2003
Location: mid south USA
Posts: 374
|
|
hotlinking protection
Yes, I am with fxstudios also (fantastic company!), but because of the java script I use on about 50 of my pages I can't use the hotlinking feature. I am hoping to find time after the holidays to completely redesign those pages to implement the hotlinking protection. However, the last time I turned it on as an experiment one of my friends could not get any of my graphics on my pages, and that does not include the ones with the Anfy java script on them. I change the "location" of my graphics every few days and that helps a bit.
|

11-21-2005, 08:00 PM
|
|
WebProWorld Member
|
|
Join Date: Oct 2005
Location: Manchester
Posts: 81
|
|
It certainly was useful to give me an overview of the how to. That made the penny drop, to some degree for me. The copyscape site has thrown up a few instances where our copy has shown up in really poor quality sites. So this leads on naturally to a few more questions:
1> What action can be taken? Legal?
2> I'm assuming turning them into to Google for Adsense may be useful. True?
3> What's the protential impact on my SERPs?
Thanks for all your help, I'm a wiser man already
|

11-21-2005, 11:37 PM
|
|
WebProWorld Veteran
|
|
Join Date: Jun 2004
Location: Indiana
Posts: 484
|
|
Quote:
|
Originally Posted by Psychobel
It certainly was useful to give me an overview of the how to. That made the penny drop, to some degree for me. The copyscape site has thrown up a few instances where our copy has shown up in really poor quality sites. So this leads on naturally to a few more questions:
1> What action can be taken? Legal?
2> I'm assuming turning them into to Google for Adsense may be useful. True?
3> What's the protential impact on my SERPs?
Thanks for all your help, I'm a wiser man already
|
We covered this in a topic on the google section of the forums. Read my post here
http://www.webproworld.com/viewtopic...hlight=#262082
Read the others as well if you like.
|

11-22-2005, 01:37 AM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: Dec 2003
Location: Toronto, Ontario, Canada
Posts: 2,217
|
|
Well...the first part, as google junky pointed out, was covered to death. So I won't go there again.
So, with that said, on to 2) and 3).
2) If Google has AdSense running on the SERP pages, it is unlikely they'll act any time soon. That's extra money in their coffers, and they're a for-profit corporation. So they've got a vested interest in keeping the spammers in.
Mind you, sooner or later they'll have to crack down on this type of thing...not because of irrelevant SERPs and scraper pages finding their way in, but because of advertisers who pull out because they're pissed off at the irrelevant scraper pages they find their ads on. It may take a while, but Google will act...eventually.
3) Minimal, in the worst case. You don't control the content of the scraper pages, or any other pages that you don't have FTP or other access to. It would be all too easy for competitors to knock other competitors out of the SERPs.
I'm on at least 20 scraper pages that I know about (and hundreds more I probably don't), and nothing has ever happened to me in this regard. Again, I don't control it.
If you want to see some of them, use Google and search for "ADAM Web Design" and then one of 5fish.net , 360mediaworx.com , elehost.com , or abacus.ca . I'm not going to provide direct links since linking to spam is bad, mmmkay?
|

11-22-2005, 05:29 AM
|
|
WebProWorld Pro
|
|
Join Date: May 2005
Location: England
Posts: 119
|
|
I used to work for a company that took a question set input by the user then used java to scrape every known car insurance site in the uk and return an insurance quote from each of them.
A lot of companies have got wise to this sort of thing which is why you see those distored images with a random series of alpha numeric characters that you have to type in in order to proceed. Automated scrapers can't get past this.
|

11-22-2005, 05:47 AM
|
|
WebProWorld Member
|
|
Join Date: Oct 2005
Location: Manchester
Posts: 81
|
|
Quote:
|
Originally Posted by thebloke
I used to work for a company that took a question set input by the user then used java to scrape every known car insurance site in the uk and return an insurance quote from each of them.
A lot of companies have got wise to this sort of thing which is why you see those distored images with a random series of alpha numeric characters that you have to type in in order to proceed. Automated scrapers can't get past this.
|
The problem with that approach is that it's not accessible and strictly speaking breaches UK accessibility legislation. Also not great for a usability point of view for your average punter
|

11-22-2005, 06:42 AM
|
|
WebProWorld Pro
|
|
Join Date: May 2005
Location: England
Posts: 119
|
|
Quote:
|
Originally Posted by Psychobel
The problem with that approach is that it's not accessible and strictly speaking breaches UK accessibility legislation. Also not great for a usability point of view for your average punter
|
I guess so. A lot of the big guns do it though - check out the login page for www.overture.com, for example. It seems to be the done thing for stopping scrapers.
|

11-22-2005, 08:48 AM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: May 2005
Location: Norway
Posts: 4,565
|
|
One thing is to steal your content, but your bandwidth too?
1. Syndicate your content with Rss.
2. Take control of your server.
http://www.webproworld.com/viewtopic.php?t=55223
|

11-22-2005, 08:48 AM
|
|
WebProWorld Veteran
|
|
Join Date: Oct 2005
Posts: 529
|
|
Quote:
|
Originally Posted by tobyd
http://www.copyscape.com/ - enter your url and it will display any pages that use your content.
|
Thanks for the helpful link.
If you do a search for people who backlink to you, you will find many splog scrapers, I think ADAM is correct, it really doesn't hurt you.
Think about how stupid it would be for a search engine to punish for something you can't control?
and it's good to see more ladies in the house. :)
|

11-22-2005, 10:41 AM
|
 |
WebProWorld Pro
|
|
Join Date: Oct 2003
Location: western Colorado
Posts: 164
|
|
Still a little confused
What is the purpose of a scraper site? I don't understand why someone would bother to build one ... seems like a waste of time.
|

11-22-2005, 11:50 AM
|
|
WebProWorld Member
|
|
Join Date: Nov 2004
Location: Washington State
Posts: 45
|
|
Copyright infringement
We used copyscape and were amazed at how brazen people were at completely stealing our content, verbatim.
We contacted the worst offenders (one was even a ParaLegal!) and they all removed the content ASAP.
At copyscape they say you can contact their hosting company and ICANN, the search engines. It's pretty much a bad thing to get branded as a plagerist.
We have a SERIOUS problem with people stealing our images at Ebay. We've contacted Ebay multiple times, the offenders multiple times. And Ebay so far, hasn't even replied us!
It's a serious issue and I'd sure like to know how to protect our images. We're on an Apache Server (unix). Any great ideas that won't affect the coming Xmas traffic or greatly upset the website?
|

11-22-2005, 12:12 PM
|
|
| |