Submit Your Article Forum Rules
Results 1 to 1 of 1

Thread: Tactics for identifying and banning content scrapers

  1. #1
    Junior Member
    Join Date
    Feb 2012

    Question Tactics for identifying and banning content scrapers

    I just wanted to provide my 2 cents regarding blocking web scrapers.

    First, my main interest in all of this isn't so much about protecting the content, as it is about preventing the abuse of server resources from aggressive scrapers. It's gotten so bad that it's basically DDoSing the server on occasion. The challenge I face is that the IP's revolve constantly, as do the user agents and referers. So, how does one tackle this problem? Is there a way to identify scrapers and ban them automatically? I believe there is.

    What Sogo7 said is very true:

    Quote Originally Posted by Sogo7 View Post
    Remember scrapers read the actual page content as delivered from the webserver, therefore javascript elements have not been run and this can be used to your advantage...
    The same is true for images and external CSS scripts. I've written a script that runs on cron every minute that reads the apache raw access files based on this premise, and I take it a step further. The script is looking for what I would deem as "abusive behavior". That is to say that if there are 4 or more page requests with no requests for page elements in the span of a second - that's unnatural behavior. Even if someone is using a text only browser, it's unlikely that they would make that many requests in the span of a second. The script is very very effective of identifying bots. So much so, that I had to create a whitelist of good bots (Google, Yahoo, MSN, Yandex, etc.) to exclude from being identified as "abusive".

    The script returns the IP + hostname of the offending requests, and stores them in a MySQL database which is then queried when someone visits a page. If they aren't in the DB, then it remembers that in a session to avoid needless and continuous validation for their session. For those that are in the database, it returns a 403 with a captcha form prompting them the submit for reconsideration. This way, if there was a scraper on a dynamic IP, that had since been assigned to a legitimate user, they would have the opportunity to easily gain access, but a bot would not.

    To recap, this near real-time script automates the identification of requests that are scrapers by recognizing unnatural behavior not likely to be a regular visitor. It stores these IP's & hostnames as a reference for validation when there is a request for a page, and revokes access if they do not validate. No more manually reviewing the access logs, hunting down the IP blocks, and editing and bloating the .htaccess. Even the fail-safe (captcha) is automated. No need to manually review reconsideration requests.

    Perhaps there is a better way to do what I'm trying to do, but for me, this works perfectly. On another note, I installed monit and configured it to automatically restart the server if it is unresponsive. This coupled with my script has effectively eliminated server downtime, and scrapers!
    Last edited by weegillis; 02-12-2012 at 05:10 PM. Reason: Please use 'Reply With Quote' to permit full post attribution.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts