Introduction: How to Identify Bad Bots and Block Them
There are good bots and there are bad bots. Good bots (robots which crawl a website) crawl and index your site and bring in traffic. Bad bots consume bandwidth, slow down your server, steal your content and look for vulnerability to compromise your server.
I have battled them in in the last 15 years as a sysadmin. This 'How-to' is based on my personal experience. Bad bots come in all sizes and use different User-Agent strings to identify themselves. There are many bots out there - which may crawl your site with different levels of alacrity. Many are harmless though. Other than the search engines, some robots are operated by other legitimate agencies to determine the best matching campaign for a page's content for a potential advertiser or to look for linking information or to take a snapshot for archiving purposes.
You can find a list of common bots here:
As far as we could tell - they obey the directives of Robots.txt in a given website. The list contains bots with identifiable information given in their User-Agent field. When you browse through the list, you will also find that many major search engines switch User-Agent strings as per their need. Most decent bots will give their contact link in their User-Agent string to help the webmaster to communicate their preference or offer ways to block them from the text file - Robots.txt
You can slow down the rate of crawling or deny access to certain directories from that text file.
For example, you can deny access to all the pages in your document root for this 'Zum' bot from your robots.txt file as here:
All regular bots will read this file and obey the directives contained there. But bad bots don't bother to read your robots file or read them to know which are the prohibited directories to crawl. So this 'How-to-identify Bad bots' uses a simple ruse to detect their intention and create a log file for further action.
Step 1: Create a file that can write a log in your server. I have given a Perl script here bots.pl. Make sure that this file is saved in your cgi-bin directory (assuming that your server can execute Perl script). Set the permission to execute for this text file. Fire up your browser and point to this page. You can read your browser's User Agent string, your IP address, the referrer page ( it will be blank now) and the server time in which this request was served. By the way, you will see a blank page.
Step 2: The above page bots.pl should be linked from your index page - hidden from the human visitors.
Create a link like so:
You can use .htaccess to block the bad bots assuming that you use the Apache HTTP server. In case you have a few Bad bots which use a particular User-Agent string regularly, it is easy to block them based on that string.
SetEnvIfNoCase User-Agent "^Wget" bad_user
SetEnvIfNoCase User-Agent "^Riddler" bad_user
Deny from env=bad_user
The above Instructable is based on this blog.
Thank you for reading this Instructable. I will be happy to answer any queries related to this Instructable in the comments section.