Quote:
Originally Posted by chandrika
Is that because they will spoof their ip addresses to look like the SE spiders IP addresses?
I have also tried limiting the number of allowed connections to my server, so that no one IP can have more than a set number of connections going without getting blocked. As I was told that the SE do not make so many connecions, that that is only scrapers and it does not appear to have hampered the legitimate bots doin that.
But these sites, I do not think they are directly spidering my site, I know they can, but it would be alot less easy for them than just having the urls given to them in the sitemap, which can simply be uploaded to their database for use in this link hijacking, without having spidered anything.
I previously reported a couple of these sites to Google and I have noticed they have been removed, but I dont want to spend my days reporting stuff like that. I just want to secure my site so it isnt an issue and I can get on with my work.
|
Legitimate spider/crawlers/robots do not need a site map in order to index a site; therefore, there is no need for scrapers to impersonate SEs in order to do their job. All of these do precisely the same thing - locate and retrieve data from your site. The difference lies solely in what they do with this data once it's been obtained.
And, as there are legitimate uses for scrapers, I should not be surprised to find that some observe the rules set forth in a site's meta-data instructions to robots.
The only way to stop any undesired bot from accessing your site, regardless of its purpose, is to identify and block the IP address(es) that it uses.