Publishing System Settings Logout Login Register
Log and Block 'bad bots' that disregard robots.txt
TutorialCommentsThe AuthorReport Tutorial
Tutorial Avatar
Rating
Add to Favorites
Posted on June 13th, 2007
16020 views
PHP Coding
Hello everyone!

In this tutorial, "Log and Block 'bad bots' that disregard robots.txt", I will show you how to both log and block bad bots.

What is a bad bot?
A bad bot is a bot that either:
      
  • Carries a well known bad UA
  •   
  • Uses faked UA strings (either another bots name or user browser UA stings
  •   
  • Ignores the robots.txt standard

The following will help against bad bots, but it should be noted that there are much worse bots out there, that this may not catch.

This scripts main concern lies with #3, bots that ignore the robots.txt standard.

Let's begin, shall we?

Step 1
       
  1. Create a file in your site root called \'blacklist.txt\' and CHMOD it to 0666
  2.    
  3. Add the following to your .htaccess

    <FilesMatch blacklist\.txt>
        Order deny, allow
        Deny from all
    </FilesMatch>
  4.    
  5. Download the attached "pixel.zip", extract and upload the included \'pixel.gif\' to your site root.
  6.    
  7. Set up a subdirectory (i.e. \'badbots\' - please use a random name - and be sure to replace it in all the following code/examples)
  8.    
  9. Add:

    User-agent: *
    Disallow: badbots
     
    to your robots.txt, if you do not have one - create it.
  10.   
  11. Add a hidden link to your all of your sites pages, preferably near the very bottom - near the </body> tag; that contains the following:
    <a href="/badbots/" style="display: none;"><img src="pixel.gif" alt=" " width="1" height="1" border="0" /></a>
    Human users should never see this or be able to click the link.

Step 2
Now we need to create a file to be the index page of your newly created subdirectory for these bots.

Here's the code, I've tried to comment as much as possible:

<?php

// #########################################################################
// Define some things...

// Your domain name, if this isn't right.. then set it manually by removing _getenv('SERVER_NAME')
define('DOMAIN', $_SERVER['SERVER_NAME']);

// Blacklist file, no need to edit
define('BLFILE', $_SERVER['DOCUMENT_ROOT'] . '/blacklist.txt');

// IP Address, no need to edit.
define('IPADDR', get_ip());

// #########################################################################
// Functions

/**
* Gets the ip address
*
* @param  none
* @return string
*/
function get_ip()
{
    if ($_SERVER['HTTP_X_FORWARDED_FOR'])
    {
        if (preg_match_all("#[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}#s", $_SERVER['HTTP_X_FORWARDED_FOR'], $ips))
        {
            while (list($key, $val) = @each($ips[0]))
            {
                if (!preg_match("#^(10|172.16|192.168).#", $val))
                {
                    $ip = $val;
                    break;
                }
            }
        }
    }
    else if ($_SERVER['HTTP_CLIENT_IP'])
    {
        $ip = $_SERVER['HTTP_CLIENT_IP'];
    }
    else if ($_SERVER['HTTP_FROM'])
    {
        $ip = $_SERVER['HTTP_FROM'];
    }
    else
    {
        $ip = $_SERVER['REMOTE_ADDR'];
    }
    return $ip;
}

/**
* Checks to see if a given bot ip is already in the log or not
*
* @param  void
* @return boolean
*/
function is_logged()
{
    if (PHP_VERSION >= '4.3.0')
    {
        $bots = file_get_contents(BLFILE);
    }
    else
    {
        $fp = @fopen(BLFILE, 'r');

        while (!@feof($fp))
        {
            $bots .= fgets($fp, 128);
        }
        fclose($fp);
    }

    $bots = trim($bots);

    if (!empty($bots))
    {
        $bots = preg_split("#\n#", $bots, -1, PREG_SPLIT_NO_EMPTY);

        foreach ($bots AS $bot)
        {
            if (strpos($bot, IPADDR))
            {
                return true;
            }
        }
        unset($bots);
    }
    return false;
}

// #########################################################################
// Begin HTML

?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title> </title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="Content-Language" content="en" />
</head>

<body>

<?php

/**
* Ok, let's determine if we're coming from a page
* that needs to check if a bad bot is visiting.
*/
$mode = (isset($mode)) ? $mode : '';

switch ($mode)
{
    case 'external':
        // This is another page, checking if a bot is logged.
        if (is_logged())
        {
            // This is a bad bot, reject it
            sleep(10);
?>
<p>This site is temporarily not available due to abuse.</p>
<p>If you feel this in error, send an email to the webmaster.<br />If you are an anti-social, ill-behaving bot, then just go away you good for nothing POS. .</p>
<?php
        }
        break;
    default:
?>
<p>Nothing to see here, <a href="http://<?php echo DOMAIN; ?>">move along</a>... </p>
<?php
        // This will end up being something like:
        // xx.xxx.xx.x [Wed, 06 Jun 2007 07:32:02] GET /file.php HTTP/1.0 Mozilla/4.0 (compatible; MSIE 5.0; Windows NT)
        $string = IPADDR . ' [' . gmdate('D, d M Y H:i:s') . '] ' . "$_SERVER[REQUEST_METHOD] ";
        $string .= "$_SERVER[REQUEST_URI] $_SERVER[SERVER_PROTOCOL] $_SERVER[HTTP_REFERER] ";
        $string .= "$_SERVER[HTTP_USER_AGENT]\n";

        if (!is_logged())
        {
            // New bot, add to log & email.
            if (PHP_VERSION >= 5)
            {
                // PHP 5 makes it simple
                $bots = file_put_contents(BLFILE, $string);
            }
            else
            {
                $fp = @fopen(BLFILE, 'a+');
                fwrite($fp, $string);
                fclose($fp);
            }
        }
        unset($string);
        break;
}

?>

</body>
</html>


Step 3
If you have some sort of global file then you can add the following to protect your entire site, if not.. include it at the very beginning of every page, before any other code/output:

<?php

$mode = 'external';
require_once($_SERVER['DOCUMENT_ROOT'] . '/badbots/index.php');

?>


And that's all folks! :)

You can email me by using the contact form at my blog with your comments, suggestions, problems, etc.
Dig this tutorial?
Thank the author by sending him a few P2L credits!

Send
SecondV

SecondV is an experienced PHP and MySQL developer who has been programming for nearly 6 years.
View Full Profile Add as Friend Send PM
Pixel2Life Home Advanced Search Search Tutorial Index Publish Tutorials Community Forums Web Hosting P2L On Facebook P2L On Twitter P2L Feeds Tutorial Index Publish Tutorials Community Forums Web Hosting P2L On Facebook P2L On Twitter P2L Feeds Pixel2life Homepage Submit a Tutorial Publish a Tutorial Join our Forums P2L Marketplace Advertise on P2L P2L Website Hosting Help and FAQ Topsites Link Exchange P2L RSS Feeds P2L Sitemap Contact Us Privacy Statement Legal P2L Facebook Fanpage Follow us on Twitter P2L Studios Portal P2L Website Hosting Back to Top