Jump to content


Robot.txt


9 replies to this topic

#1 _*Creative Insanity_*

  • Guests

Posted 21 May 2007 - 02:34 PM

I have a robot text file in my web root as I want to stop those damn SE spiders as they are taking up WAY too much bandwidth for the good they are doing for my sites.
I have added this to the text file:
User-agent: *
Disallow: attachment.php
Disallow: avatar.php
Disallow: editpost.php
Disallow: member.php
Disallow: member2.php
Disallow: misc.php
Disallow: moderator.php
Disallow: newreply.php
Disallow: newthread.php
Disallow: online.php
Disallow: poll.php
Disallow: postings.php
Disallow: printthread.php
Disallow: private.php
Disallow: private2.php
Disallow: report.php
Disallow: search.php
Disallow: sendtofriend.php
Disallow: threadrate.php
Disallow: usercp.php
Disallow: /admin/
Disallow: /images/
Disallow: /mod/
Disallow: /attachment
Disallow: /showthread
but still they come in their masses and chew out my bandwidth.
Is there a sure fire way of stopping these spiders in their tracks?

ta muchly.

#2 Demonslay

    P2L Jedi

  • Members
  • PipPipPip
  • 970 posts
  • Gender:Male
  • Location:A strange world where water falls out of the sky... for no reason.
  • Interests:Graphic Design, Coding, Splinter Cell, Cats

Posted 21 May 2007 - 03:21 PM

I've always seen it as 'robots.txt', but either way, not all crawlers follow the file. They aren't required to, and in fact, I don't think many other than large ones (such as Google and probably Ask) even use it.

This is a .htaccess implication I've found before.

RewriteEngine On 
# Block Bots
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR] 
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR] 
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR] 
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR] 
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR] 
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR] 
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR] 
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR] 
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR] 
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR] 
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR] 
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR] 
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR] 
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR] 
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR] 
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR] 
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR] 
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR] 
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR] 
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR] 
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR] 
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR] 
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR] 
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR] 
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR] 
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR] 
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR] 
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR] 
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR] 
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR] 
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR] 
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR] 
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR] 
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR] 
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR] 
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR] 
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR] 
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Zeus 
RewriteRule ^.* - [F,L]

And this is a way I've got in my main PHP core, just as a back-up I guess.

// Block Bots
if(($_SERVER['HTTP_USER_AGENT'] == '') || ($_SERVER['HTTP_USER_AGENT'] == '-')) die();

Can't guarantee either work, as I know little on the issue myself and of how to test how effective they are, but they might help in some.

#3 _*Creative Insanity_*

  • Guests

Posted 21 May 2007 - 03:29 PM

I have tried htaccess with no joy. I have also tried the meta tag one with also no joy.
The biggest sucker seems to be yahoo. Sometimes I see yahoo spiders all over the place.. at one time there was 23 at the same time. Pathetic spiders and they do no good for my site as all the searches for content I have in my site there is no resolve on their sites on the first 6 pages.. so I say, what good are they if they do your site no good. I really don't want them. If they persist I am going to write to these SEs and do my nuts.

ta anyway Demonslay. This is one problem I have been fighting for a long time now.

#4 rc69

    PHP Master PD

  • P2L Staff
  • PipPipPipPip
  • 3,827 posts
  • Gender:Male
  • Location:Here
  • Interests:Web Development

Posted 21 May 2007 - 05:00 PM

For Yahoo i had a similar problem. It wasn't killing my bandwidth, it was just to persistant. So i told it to slow down.
User-agent: Slurp
Crawl-delay: 60
Google "robots.txt" if you have further questions about the particular file. Even Yahoo has an FAQ about their bot on their site (a link to the FAQ is found in the bot's user_agent string).

Edit: One other thing i just noticed. I don't know if you have a typo in the title or not, but the file name should be "robots.txt" (plural of robot).

Edited by rc69, 21 May 2007 - 05:03 PM.


#5 _*Creative Insanity_*

  • Guests

Posted 21 May 2007 - 05:53 PM

ah huh.. ta RC yeah I did have them as robot.. why on earth did I do that after all the pages I have read on robot text files.
I would of thought that User-agent: * would of covered slurp as well.
Learn something every day.. even at my age LOL.

#6 Demonslay

    P2L Jedi

  • Members
  • PipPipPip
  • 970 posts
  • Gender:Male
  • Location:A strange world where water falls out of the sky... for no reason.
  • Interests:Graphic Design, Coding, Splinter Cell, Cats

Posted 21 May 2007 - 05:58 PM

It does cover 'Slurp'. A wildcard covers all bots; RC's code simply puts a special delay on that specific bot.

#7 _*Creative Insanity_*

  • Guests

Posted 21 May 2007 - 08:25 PM

Well I have changed my code a little to read:
User-agent *
Disallow: /
Which according to the site I was using said that closed the entire site to all bots.
Time will tell I guess :D

#8 Av-

    I Feel Left Out

  • Members
  • PipPipPipPip
  • 1,971 posts
  • Gender:Male
  • Location:10 ft. below sea level

Posted 22 May 2007 - 11:55 AM

I think you are blocking ALL visitors from your site now... bot or not

#9 rc69

    PHP Master PD

  • P2L Staff
  • PipPipPipPip
  • 3,827 posts
  • Gender:Male
  • Location:Here
  • Interests:Web Development

Posted 22 May 2007 - 01:12 PM

No, it's robots.txt, you're not a robot are you? .htaccess is the only thing that can ban everybody. Your typical browser, and most bots even, won't bother to look at that file. It's like JavaScript (robots.txt) vs. php (.htaccess).

Edited by rc69, 22 May 2007 - 01:13 PM.


#10 _*Creative Insanity_*

  • Guests

Posted 23 May 2007 - 01:47 AM

Ok I did have the file named as robot (why after all the info I have read on this) and since I renamed it I ain't seen one spider woo hoo!





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users