Saturday, March 01, 2003

bad 'bots

How much traffic to your web site comes from human visitors, and how much is from programs sniffing out information? One visitor that we've seen in our logs appeared to be coming from the International Atomic Energy Agency. (It's not.) I wasn't aware that the companies like the RIAA and MPAA are using a service that sends out programs looking for their copyrighted materials. (They appear to be.)

There are a lot of programs that circulate the web and collect information. Some are out there indexing the web for search engines. Others are looking at how different blogs are connecting to the web. Most of these programs are known as either robots or spiders or crawlers.

A protocol was developed in the mid '90s that was used to create instructions for how these programs are supposed to behave upon the web. The Robots Exclusion protocol is one that many of these programs do follow. But, not all of them do.

Following the protocol means putting a text file on your server that programs are supposed to look at before they roam around the rest of your site. The robots and spiders and crawlers listed above can be instructed not to visit your site, or certain parts of your site. If you want to deny access to specific ones, you can list them in your robot text file. But, as I said above, not all programs listen to the text files.

There are technological solutions to this problem. An excellent post on the subject is Mark Pilgrim's How to block spambots, ban spybots, and tell unwanted robots to go to hell

No comments: