I have a website with content I have aggregated over time. I am concerned it might be scraped and abused. I do need to, however, make the content accessible for bots like googlebot and others. Is there any plugin that I can use that will help me prevent and block malicious scrapers and bots? How can I set up a policy to control this and manage this? Any experience? I have an apache webserver running a php webapp.
Answers
Add AnswerMost of the tools here consist of mainly auth-login '3-strikes' like tools. As a former webmaster for an apache server, we had a script that would check each incoming IP address, and run that against a set of rules. It is fairly easy to write a script that will check against a policy file that enforces rules of you choice - rules like "only x hits from this IP per second", and only y unique downloads per IP per hour". We used to have a unique 1x1 pixel on each page that would be monitored. You can block that pixel by robots.txt rule, so well behaved bots will ignore it, but bad bots will attempt to download it, and you can then enforce your policy. Of course, you want to make sure you don't unintentionally block off legitimate users who might be behind one public IP!
Share your knowledge