Hello everyone!
In this tutorial, "
Log and Block 'bad bots' that disregard robots.txt", I will show you how to both log and block bad bots.
What is a bad bot?
A bad bot is a bot that either:
- Carries a well known bad UA
- Uses faked UA strings (either another bots name or user browser UA stings
- Ignores the robots.txt standard
The following will help against bad bots, but it should be noted that there are much worse bots out there, that this may not catch.
This scripts main concern lies with #3, bots that ignore the
robots.txt standard.
Let's begin, shall we?
Step 1
- Create a file in your site root called \'blacklist.txt\' and CHMOD it to 0666
- Add the following to your .htaccess
<FilesMatch blacklist\.txt>
Order deny, allow
Deny from all
</FilesMatch>
- Download the attached "pixel.zip", extract and upload the included \'pixel.gif\' to your site root.
- Set up a subdirectory (i.e. \'badbots\' - please use a random name - and be sure to replace it in all the following code/examples)
- Add:
User-agent: *
Disallow: badbots
to your robots.txt, if you do not have one - create it.
- Add a hidden link to your all of your sites pages, preferably near the very bottom - near the </body> tag; that contains the following:
<a href="/badbots/" style="display: none;"><img src="pixel.gif" alt=" " width="1" height="1" border="0" /></a>
Human users should never see this or be able to click the link.
Step 2
Now we need to create a file to be the index page of your newly created subdirectory for these bots.
Here's the code, I've tried to comment as much as possible:
<?php
// #########################################################################
// Define some things...
// Your domain name, if this isn't right.. then set it manually by removing _getenv('SERVER_NAME')
define('DOMAIN', $_SERVER['SERVER_NAME']);
// Blacklist file, no need to edit
define('BLFILE', $_SERVER['DOCUMENT_ROOT'] . '/blacklist.txt');
// IP Address, no need to edit.
define('IPADDR', get_ip());
// #########################################################################
// Functions
/**
* Gets the ip address
*
* @param none
* @return string
*/
function get_ip()
{
if ($_SERVER['HTTP_X_FORWARDED_FOR'])
{
if (preg_match_all("#[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}#s", $_SERVER['HTTP_X_FORWARDED_FOR'], $ips))
{
while (list($key, $val) = @each($ips[0]))
{
if (!preg_match("#^(10|172.16|192.168).#", $val))
{
$ip = $val;
break;
}
}
}
}
else if ($_SERVER['HTTP_CLIENT_IP'])
{
$ip = $_SERVER['HTTP_CLIENT_IP'];
}
else if ($_SERVER['HTTP_FROM'])
{
$ip = $_SERVER['HTTP_FROM'];
}
else
{
$ip = $_SERVER['REMOTE_ADDR'];
}
return $ip;
}
/**
* Checks to see if a given bot ip is already in the log or not
*
* @param void
* @return boolean
*/
function is_logged()
{
if (PHP_VERSION >= '4.3.0')
{
$bots = file_get_contents(BLFILE);
}
else
{
$fp = @fopen(BLFILE, 'r');
while (!@feof($fp))
{
$bots .= fgets($fp, 128);
}
fclose($fp);
}
$bots = trim($bots);
if (!empty($bots))
{
$bots = preg_split("#\n#", $bots, -1, PREG_SPLIT_NO_EMPTY);
foreach ($bots AS $bot)
{
if (strpos($bot, IPADDR))
{
return true;
}
}
unset($bots);
}
return false;
}
// #########################################################################
// Begin HTML
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title> </title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="Content-Language" content="en" />
</head>
<body>
<?php
/**
* Ok, let's determine if we're coming from a page
* that needs to check if a bad bot is visiting.
*/
$mode = (isset($mode)) ? $mode : '';
switch ($mode)
{
case 'external':
// This is another page, checking if a bot is logged.
if (is_logged())
{
// This is a bad bot, reject it
sleep(10);
?>
<p>This site is temporarily not available due to abuse.</p>
<p>If you feel this in error, send an email to the webmaster.<br />If you are an anti-social, ill-behaving bot, then just go away you good for nothing POS. .</p>
<?php
}
break;
default:
?>
<p>Nothing to see here, <a href="http://<?php echo DOMAIN; ?>">move along</a>... </p>
<?php
// This will end up being something like:
// xx.xxx.xx.x [Wed, 06 Jun 2007 07:32:02] GET /file.php HTTP/1.0 Mozilla/4.0 (compatible; MSIE 5.0; Windows NT)
$string = IPADDR . ' [' . gmdate('D, d M Y H:i:s') . '] ' . "$_SERVER[REQUEST_METHOD] ";
$string .= "$_SERVER[REQUEST_URI] $_SERVER[SERVER_PROTOCOL] $_SERVER[HTTP_REFERER] ";
$string .= "$_SERVER[HTTP_USER_AGENT]\n";
if (!is_logged())
{
// New bot, add to log & email.
if (PHP_VERSION >= 5)
{
// PHP 5 makes it simple
$bots = file_put_contents(BLFILE, $string);
}
else
{
$fp = @fopen(BLFILE, 'a+');
fwrite($fp, $string);
fclose($fp);
}
}
unset($string);
break;
}
?>
</body>
</html>
Step 3
If you have some sort of global file then you can add the following to protect your entire site, if not.. include it at the very beginning of every page, before any other code/output:
<?php
$mode = 'external';
require_once($_SERVER['DOCUMENT_ROOT'] . '/badbots/index.php');
?>
And that's all folks! :)
You can email me by using the contact form at
my blog with your comments, suggestions, problems, etc.