|
|
||||||
|
||||||
| Index Link To US Private Messages Archive FAQ RSS | ||||||
| Webmaster Resources Discussion Forum Sitemaps and robots and logfiles -- Oh My! If you have any questions, comments, concerns and/or ideas about the tools currently available to webmasters to make their lives... 'easier'. Here's where you need to be. Know of a good tool? Post it here. Got something funny in your logfiles? Maybe we can help. |
Share Thread: & Tags
|
||||
|
![]() |
|
|
LinkBack | Thread Tools | Display Modes |
|
||||
|
Having read here that cloaking for some legitimate purpose does not seem to cause any problem with SE, I am thinking of cloaking my sitemap, using SE spider IPlists, to only allow legitimate spiders to crawl it.
This is just a counter measure against scrapers, who appear to use my sitemaps for scraping. Before I do it I wanted to ask whether anyone can see any problem could be caused by cloaking a sitemap in such a way and do you think this might be helpful in stopping scraper sites, as they do appear to be using my sitemaps in their scripts and it seems like I am offering them my site on a plate by allowing them access to my sitemaps?
__________________
2009 Hairstyles - Pictures of 2009 hairstyles and a virtual hairstyler demo. Price Comparison Site - Compare prices of well known brands and products. Last edited by chandrika; 04-02-2008 at 07:36 AM. |
|
||||
|
Well, I thought that it would not be so easy for them to scrape it if they cant access my sitemap. The sitemap makes it very easy for a scraper to find any and all URL's in my site, they dont need to use bots or anything, simply take the sitemap and there they have a very nice index of every single page on my site.
When I see sites that have scraped mine in google, I get to see a little of the scripts they use to do it, and it appears that they use a link to my sitemap to generate the content for their site. Maybe it is not called scraping, because they do not actually take my content and display it anywhere, they simply present my content to Google as their content, use my page names, titles etc. But click the link in the search results and it is just a redirect to the affiliate page. For example One such sitedoing this as seen in the search results... Fuji Z5fd Rasp Berry www. MYSITE.com/shopUK/sitemap.php?merchant=PC+World. www. MYSITE.com/shopUK/product/FUJI-Z5FD-MOCHA.html ... laptopwin DOT info/fuji-z5fd-rasp-berry/ - Similar pages - Note this The link in the results simply redirects via affiliate link to the merchants site. It is really annoying me because it is not the one occurence, it is alot of similar sites doing this and I want to do something about it. So I thought if I cloak my sitemap, then sites like this, i they use my sitemap to create their pages and content, will be blocked from doing so. Maybe its not important, maybe it doesnt matter and wont do any harm to my site, but I am not sure about that and so want to understand it really.
__________________
2009 Hairstyles - Pictures of 2009 hairstyles and a virtual hairstyler demo. Price Comparison Site - Compare prices of well known brands and products. Last edited by chandrika; 04-02-2008 at 07:19 PM. |
|
||||
|
I am actually an affiliate myself, and on my site I use merchants datafeeds, that I download from the merchants and keep updated, to display price comparisons and a couple of other features.
The pages I am seeing in search results are scraped pages doing a redirect direct to a merchant (via their own affiliate link); but showing the page they scraped from my work to Googlebot. The sites doing this of course have no contact details so I cant do anything except try to prevent them from being able to do it somehow, which is why I thought cloaking my sitemap may put a stop to it.
__________________
2009 Hairstyles - Pictures of 2009 hairstyles and a virtual hairstyler demo. Price Comparison Site - Compare prices of well known brands and products. Last edited by chandrika; 04-02-2008 at 08:43 PM. |
|
||||
|
There is no way to hide from scrapers without hiding from SEs as well. Neither needs a site map in order to perform their task(s).
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
|
||||
|
Quote:
I have also tried limiting the number of allowed connections to my server, so that no one IP can have more than a set number of connections going without getting blocked. As I was told that the SE do not make so many connecions, that that is only scrapers and it does not appear to have hampered the legitimate bots doin that. But these sites, I do not think they are directly spidering my site, I know they can, but it would be alot less easy for them than just having the urls given to them in the sitemap, which can simply be uploaded to their database for use in this link hijacking, without having spidered anything. I previously reported a couple of these sites to Google and I have noticed they have been removed, but I dont want to spend my days reporting stuff like that. I just want to secure my site so it isnt an issue and I can get on with my work.
__________________
2009 Hairstyles - Pictures of 2009 hairstyles and a virtual hairstyler demo. Price Comparison Site - Compare prices of well known brands and products. |
|
||||
|
Quote:
And, as there are legitimate uses for scrapers, I should not be surprised to find that some observe the rules set forth in a site's meta-data instructions to robots. The only way to stop any undesired bot from accessing your site, regardless of its purpose, is to identify and block the IP address(es) that it uses.
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
|
||||
|
Yes I guess there are some legitimate scrapers, I have seen a couple that use scraped content, but in a useful way and provide a backlink.
So I am always going to run the risk of blocking legitimate visitors and bots if I try to tackle it that way. The specific problem with the sites that use my sitemap url in a script that shows Googlebot my content as theirs, I suppose is somewhat different to scraping, as they are not actually scraping my site, just trying to trick Googlebot to think that my content is on their website, when it isnt at all.
__________________
2009 Hairstyles - Pictures of 2009 hairstyles and a virtual hairstyler demo. Price Comparison Site - Compare prices of well known brands and products. |
|
||||
|
Quote:
Monitor both your site log(s) and Google for signs of illicit activity, and then: 1) Block the offending IP addresses from accessing your site(s); 2) Notify Google, and seek to have them remove the plagarized content and/or block the offenders; 3) Seek the assistance of the offending parties' hosts and/or registrars; and/or, 4) Intitate legal action agasinst said parties.
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
|
||||
|
Thanks all for the advice, much appreciated.
I think I will set up a google alert to monitor when results like these come up, I hadnt thought of using Google alerts for something like that, but I read it somewhere and it will make it easier to keep an eye on whats going on. As for the cloaking of my sitemap, I will forget about that for now, as another thing I didnt think of, was that using public IP lists of spiders, is pretty risky thing to do anyway.
__________________
2009 Hairstyles - Pictures of 2009 hairstyles and a virtual hairstyler demo. Price Comparison Site - Compare prices of well known brands and products. |
|
||||
|
Quote:
Good luck.
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
|
||||
|
Most of the website owners use their internal linking structure as
< a href="/web_design_services"> Web design services</a> This makes it quite easier for automated bots to scrape your entire content. Once your website is copied to another server, your /=root takes path of new scraper url website. To avoid this, always make sure to include your website link in anchor text. ie <a href="http://smartguy . com/web_design_services">Web Design</a> Now if somebody scrapes your content, he would be actually giving you a back link. You can sit back and enjoy playing mario while scrapers will bump their heads against their computers
__________________
SEO Optimization Company - SEO Hawk - UK, US, Canada, and Australia SEO Optimisation UK | Latest SEO Blog on the Planet |
|
||||
|
Quote:
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
|
||||
|
Thanks, really appreciated all the advice here, Mario will have to wait though as I am still stuck on Zelda in the Cave Of Ordeals.
__________________
2009 Hairstyles - Pictures of 2009 hairstyles and a virtual hairstyler demo. Price Comparison Site - Compare prices of well known brands and products. |
|
||||
|
Quote:
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
|
||||
|
lol....yes...good point.
__________________
2009 Hairstyles - Pictures of 2009 hairstyles and a virtual hairstyler demo. Price Comparison Site - Compare prices of well known brands and products. |
|
||||
|
Quote:
They can easily scrap content on entire website and upload that same content to the new root If yourwebsite . com usese / to represent root, then you would be more vulnerable. A scraper software can copy your entire content and can easily upload it. Once uploaded, your website's new navigation structure will become :- scraper site . com/YOURPAGES If you use your website url rather than root within the internal navigation of your website, you could be giving a tough time to scraper softwares. This is just a suggestion to safeguard your website. However you cannot ward-off someone who is bent on copying your text, and website.
__________________
SEO Optimization Company - SEO Hawk - UK, US, Canada, and Australia SEO Optimisation UK | Latest SEO Blog on the Planet |
|
||||
|
I have done as suggested and am making sure all URLs are absolute instead of relative, as you say if someone is determined to scrape it then they can, but I doubt anyone is that interested in my site and if I make a few changes they might just go elsewehere.
Interestingly, when I started looking at this my server was sending me loads of messages telling me of IPs with multiple connections going on. I reduced the number of allowed connections to the server and at first I was still getting messages saying attempted to make 500 connections whatever, but then having made it so when such was attempted the IP got blocked, after a few weeks, the messages have stopped and it appears they have stopped trying. So I think that might be another way to limit the amount of scraping, by limiting number of connections allowed to the server. I remember some years back I had to copy an old website of mine from online, after my hard drive crashed. I used some software that could just download the entire site, and there was an option as to how many connections the spider would make, obviously the more connections, the faster the crawl and copy, which I expect many scrapers use software like that and as such limiting connections on the server is also another preventative measure, which wont stop it, but can slow them down at least, which might put them off. I was told that limiting connections, will not affect the legitimate spiders crawling, and it does not appear to have done. I am not sure how legitimate bots crawl and record a sites data, do they just use a single connection, or make multiple connections to speed things up? What would be a reasonable number of connections to allow to a server?
__________________
2009 Hairstyles - Pictures of 2009 hairstyles and a virtual hairstyler demo. Price Comparison Site - Compare prices of well known brands and products. |
|
||||
|
I have found a really useful project to do with this.
The project is here Distributed Spam Harvester Tracking Network | Project Honey Pot They have an http blacklist that allows website administrators to "take advantage of the data generated by Project Honey Pot in order to keep suspicious and malicious web robots off their sites. Project Honey Pot tracks harvesters, comment spammers, and other suspicious visitors to websites." There is also some ways to avoid scrapers here Use a simple bottrap to block bad bots - stop others from stealing your content today By incorporating the honeypot with the htaccess stuff, it sounds like it might help. Will have to wait and see.
__________________
2009 Hairstyles - Pictures of 2009 hairstyles and a virtual hairstyler demo. Price Comparison Site - Compare prices of well known brands and products. Last edited by chandrika; 04-04-2008 at 07:22 PM. |
|
||||
|
Quote:
Whether one uses absolute or relative addressing is irrelevant, in that, if a relative address can be correctly resolved by a benign application it can also be so resolved by a malignant one. Therefore, employing absolute addressing will not serve to avoid the threat that you address.
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
|
||||
|
I guess that a sitemap serves essentially as a datafeed.
So any simple script can take advantage of an extensive sitemap, other than cloaking the sitemap for use only by the major se, and a redirect to the homepage for anyone else. I dont see how I can stop them using my sitemaps to populate their own databases with. All my anti scraping efforts wont stop that, because they dont really even visit my site or scrape it, they simply have a list of my product page urls and serve them to bots as if that is their content. It is the legit SE bot that then gets the content from my page...maybe they use a 302 redirect or something, so that the legit SE puts their url in the results, but spiders my webpage and attributes my content to their page Thats what they do with a 302 redirect isnt it?
__________________
2009 Hairstyles - Pictures of 2009 hairstyles and a virtual hairstyler demo. Price Comparison Site - Compare prices of well known brands and products. |
|
||||
|
Quote:
Or, they could be merely copying your sitemap, in which case your server log should only your sitemap having been requested by the miscreants, and, if my understanding of 302s is correct, serving up your pages via a 302 on demand. The 302 would ensure that the SEs continued to return to their site for the page(s) in question. You might try a test to see if the latter is the case by creating a temporary test page, include that page in your sitemap, but use robots.txt to instruct robots not to index that page. If the test page gets indexed under your troublemaker's site, but not under your own, that would suggest that they're employing re-direction rather than actually copying you pages. Of course, if I'm wrong in my understanding of how 302s work, you'll need an expert in these matters, such as Webnauts, to help you further along in answering this question. One point which you may addressed that I missed is the question of whether the pages in question are being indexed under your site or the others only, or under both.
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
|
||||
|
Quote:
__________________
2009 Hairstyles - Pictures of 2009 hairstyles and a virtual hairstyler demo. Price Comparison Site - Compare prices of well known brands and products. |
|
||||
|
I'd be interested to know how that works out. Until then, enjoy.
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
|
||||
|
During that hour or so period which serves to mediate between my being asleep and functionally awake, it occured to me that there may be an easier and quicker way to test for the method being used by the scraper(s).
If you are able to visit the site(s) in question, and from there find link(s) which result in your page(s) being displayed, then all you need to do is to make a small change, one easily detectable by you, to your page(s), then immediately (so as to avoid the question of whether or not the offending site had been crawled and re-indexed between the time of your making alterations and your time of viewing the results) visit the site(s) using "scraped" content, view the page(s) that you altered, and see if your alteration is present. If the offending site promptly show your altered content, then they are simply using your site map to dynamically serve up the page content from your server; otherwise, they are serving up content that they've previously read from your site and cached. In any case, if you've not already done so, you might wish to place a copyright notice on all of your pages that are of concern to you, so as to have such available should the question of ownership arise.
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
![]() |
|
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Cloaking issues | trouble133 | Google Discussion Forum | 2 | 04-01-2008 01:13 PM |
| XML Sitemap Vs HTML Sitemap | mark.smith | Search Engine Optimization Forum | 4 | 02-14-2008 08:52 PM |
| CSS Cloaking | wige | Search Engine Optimization Forum | 12 | 06-11-2007 11:20 PM |
| IP Cloaking | adbart | Search Engine Optimization Forum | 5 | 01-26-2006 10:20 AM |
| Cloaking....what is it really? | angelpure | Search Engine Optimization Forum | 14 | 03-09-2005 03:30 PM |
|
WebProWorld |
Advertise |
Contact Us |
About |
Forum Rules |
MVP's |
Archive |
Newsletter Archive |
Top |
WebProNews
WebProWorld is an iEntry, Inc. ® site - © 2009 All Rights Reserved Privacy Policy and Legal iEntry, Inc. 2549 Richmond Rd. Lexington KY, 40509 |