Tactics for identifying and banning content scrapers
I just wanted to provide my 2 cents regarding blocking web scrapers.
First, my main interest in all of this isn't so much about protecting the content, as it is about preventing the abuse of server resources from aggressive scrapers. It's gotten so bad that it's basically DDoSing the server on occasion. The challenge I face is that the IP's revolve constantly, as do the user agents and referers. So, how does one tackle this problem? Is there a way to identify scrapers and ban them automatically? I believe there is.
What Sogo7 said is very true:
The same is true for images and external CSS scripts. I've written a script that runs on cron every minute that reads the apache raw access files based on this premise, and I take it a step further. The script is looking for what I would deem as "abusive behavior". That is to say that if there are 4 or more page requests with no requests for page elements in the span of a second - that's unnatural behavior. Even if someone is using a text only browser, it's unlikely that they would make that many requests in the span of a second. The script is very very effective of identifying bots. So much so, that I had to create a whitelist of good bots (Google, Yahoo, MSN, Yandex, etc.) to exclude from being identified as "abusive".
Originally Posted by Sogo7
The script returns the IP + hostname of the offending requests, and stores them in a MySQL database which is then queried when someone visits a page. If they aren't in the DB, then it remembers that in a session to avoid needless and continuous validation for their session. For those that are in the database, it returns a 403 with a captcha form prompting them the submit for reconsideration. This way, if there was a scraper on a dynamic IP, that had since been assigned to a legitimate user, they would have the opportunity to easily gain access, but a bot would not.
To recap, this near real-time script automates the identification of requests that are scrapers by recognizing unnatural behavior not likely to be a regular visitor. It stores these IP's & hostnames as a reference for validation when there is a request for a page, and revokes access if they do not validate. No more manually reviewing the access logs, hunting down the IP blocks, and editing and bloating the .htaccess. Even the fail-safe (captcha) is automated. No need to manually review reconsideration requests.
Perhaps there is a better way to do what I'm trying to do, but for me, this works perfectly. On another note, I installed monit and configured it to automatically restart the server if it is unresponsive. This coupled with my script has effectively eliminated server downtime, and scrapers!
Last edited by weegillis; 02-12-2012 at 05:10 PM.
Reason: Please use 'Reply With Quote' to permit full post attribution.