loading

There are good bots and there are bad bots. Good bots (robots which crawl a website) crawl and index your site and bring in traffic. Bad bots consume bandwidth, slow down your server, steal your content and look for vulnerability to compromise your server.


I have battled them in in the last 15 years as a sysadmin. This 'How-to' is based on my personal experience. Bad bots come in all sizes and use different User-Agent strings to identify themselves. There are many bots out there - which may crawl your site with different levels of alacrity. Many are harmless though. Other than the search engines, some robots are operated by other legitimate agencies to determine the best matching campaign for a page's content for a potential advertiser or to look for linking information or to take a snapshot for archiving purposes.

You can find a list of common bots here:

As far as we could tell - they obey the directives of Robots.txt in a given website. The list contains bots with identifiable information given in their User-Agent field. When you browse through the list, you will also find that many major search engines switch User-Agent strings as per their need. Most decent bots will give their contact link in their User-Agent string to help the webmaster to communicate their preference or offer ways to block them from the text file - Robots.txt
You can slow down the rate of crawling or deny access to certain directories from that text file.

For example, you can deny access to all the pages in your document root for this 'Zum' bot from your robots.txt file as here:

User-agent: ZumBot

Disallow: /

All regular bots will read this file and obey the directives contained there. But bad bots don't bother to read your robots file or read them to know which are the prohibited directories to crawl. So this 'How-to-identify Bad bots' uses a simple ruse to detect their intention and create a log file for further action.

Step 1: Create a file that can write a log in your server. I have given a Perl script here bots.pl. Make sure that this file is saved in your cgi-bin directory (assuming that your server can execute Perl script). Set the permission to execute for this text file. Fire up your browser and point to this page. You can read your browser's User Agent string, your IP address, the referrer page ( it will be blank now) and the server time in which this request was served. By the way, you will see a blank page.

Step 2: The above page bots.pl should be linked from your index page - hidden from the human visitors.
Create a link like so:

<a href=" your domain/cgi-bin/bots.pl> </a>

Now you are set. The log file will contain the details of the bad bots. But Wait. To conserve bandwidth most mainstream normal bots will cache the robots.txt. So there is a possibility that they may have cached your robots.txt earlier and may not be aware of the new directive. In such a case, they would crawl this blocked page. So ignore them from your list.

Blocking Bad Bots


Check this bad-bots file later for further remedial action. There are many ways to deny access to these unwelcome bots.

Option 1:
You can check the IP address against a white list ( you add your own IP address as well as that of major search engines in this white list) and the final IP addresses can be blocked in the firewall.

Or assign the User-Agent string to a deny list which can result in 403 – status (Forbidden). It uses less server resources.

For example, one of our sites uses a CGI script in our CMS. The following snippet of code will send a 403 – Forbidden status to User-Agents wget and Zum:

if ($ENV{'HTTP_USER_AGENT'}=~/wget|zum/i) {
print "status:403 Forbidden\n"; print "Content-type:text/html \n\n"; exit; }

Option 2:
You can use .htaccess to block the bad bots assuming that you use the Apache HTTP server. In case you have a few Bad bots which use a particular User-Agent string regularly, it is easy to block them based on that string.

SetEnvIfNoCase User-Agent "^Wget" bad_user
SetEnvIfNoCase User-Agent "^Riddler" bad_user

Deny from env=bad_user

The above Instructable is based on this blog.

Thank you for reading this Instructable. I will be happy to answer any queries related to this Instructable in the comments section.

About This Instructable

7,191views

8favorites

License:

Bio: What started as a passion for the technology - computers and electronics has finally coalesced into a career spinning webs on the Net.
More by Targetwoman:How to make a light box for Photography Tips for a Woman Traveling Alone Home Automation - Watering Plants 
Add instructable to: