What is a SPAMBOT?
A Spambot is a piece of software, a program that someone has written. Which language it was written in does not matter, but most are probably written in C for speed and portability reasons. A spambot should not be confused with regular robots, also known as spiders or web-crawlers. A spambot starts out on a web page. It scans the page for two things: hyperlinks and email addresses. It stores the email addresses to use as targets for spam, and follows each hyperlink to a new page, starting the process all over. Spambots also usually do not follow the guidelines in the robots.txt file, like civilized robots are supposed to. Most spambots are a part of a larger program, allowing them to send out the spam to email addresses as it find them. Others merely store the email addresses for later use. Spambots vary in their intelligence and sophistication, but even the smartest can be fairly easily fooled by the tricks on this site. The simplest spambot would simply find mailto links, and follow each hyperlink as it comes up, until it reaches a dead end. The smartest ones can recognize email addresses in many forms, recognize dead links, avoid certain types of email addresses (such as *.edu and *.gov) and track many pages at once.
Detecting SPAMBOTS
Why do you want to detect spambots? To put it simply, knowledge is power. Besides, it's always nice to know when people are abusing your site by running a spambot through it. Detecting them also helps you refine your anti-spambot tricks, by knowing where and how often they strike. It also makes it easier to refine your pages so that normal users are not affected as much by the spambots.
One of the best ways to detect spambots is to have more than one email, then look carefully at your spam and see which address it was sent to. These days, there are a number of free email services you can use. (Note, however, that most of these services have spam filtering, so you may not receive the spam even if it is sent). Take an email account that you do not use much, and put it on a webpage. Don't give it out anywhere else. When you receive spam to this address, you know how and where the spammer got your email address.
|
Detection by using plussed email addresses
A "plussed" email address may not be available on all systems. When in doubt, send yourself some email to test it! A plussed email address in one in which a plus sign, plus some other letters, are added after the username. The email is still delivered as if the plus and the other letters are not there. You can then look at your email and see what is after the plus. For example, if you email address was "bill@abcdefg.com", you could also use "bill+spamtrap@abcdefg.com", "bill+monkey@abcdefg.com", or even "bill+FromMyWebpage@abcdefg.com".
|
Detection by using SMTP comments
Another way to do this is to use SMTP comments in the email address. The comment ocurs in parenthesis, and apear like this:
"bill(spamtrap)@abcdefg.com"
or
"bill@abcdefg.(spamtrapper2).com"
both of which are the same as "bill@abcdefg.com" - the items in the parenthesis are ignored, similar to the plus method above. Note that some spambots may not pick up the email address if it has a parenthesis in it, but most probably will.
|
|
Detection by dynamic email addresses
A very nice way to detect not only where a spambot is getting your email addresses, but when , is to use dynamic email addresses. The idea is to have the addresses on the page change over time, allowing you to tell when it was taken from the page. Using the plussed email is one way - another is if you own a site, you can create random email addresses to populate it. The program is run as a cronjob, and changes the email address to a pseudo-random name every hour. The actual web page is changed, so that spam can be tracked to the very hour in which the address was stolen. Note that the email addresses generated are best in the form XXYY@, where XX is a random alphabetic string, and YY is the hour (from 0-23).
|
Detection by using CGI traps
By using a page that only spambots are likely to visit, you can keep track of them by not only checking the access log for that page, but by having a small CGI script that writes a log of all the viewers of the page. You could also have the script triggered by certain actions on a page, or even by checking the path that was taken to get there.
The simplest method to lure the spambots into a page is place a 1x1 pixel transparent GIF link in your page(s) that the BOTs will follow but that human viewers will likely never catch. |
|
Detecting by name in access logs
Some spambots are bold enough (or dumb enough) to announce their presense. Look in your access logs for suspicious looking USER_AGENT fields. One clue is a small number of different IP addresses using a particular agent. You also will see some entries for "good robots", like these:
- InfoSeek Sidewinder/0.9
- Slurp/2.0 (slurp@inktomi.com; http://www.inktomi.com/slurp.html)
- ArchitextSpider/ libwww/5.0a
- Scooter/1.0 scooter@pa.dec.com
For an excellent list of web robots, take a look at the The Web Robots Database . Notice how the Inktomi robot even leaves a web address about itself - very nice. If only they all did that...
For a good list of spambots and ways to configure your webserver to do something about them, see Protect your Webserver From Spam Harvesters . |
Detection by ratio in access logs
Spambots tend to be very impolite robots - not only do they tend to ignore robots.txt, but they greedily grab many web pages, without even waiting a bit (most robots have a small delay between "fetches" to avoid slowing down the server). This can be used to your advantage by looking at your access logs for not only IP addresses that hit many of your pages, but that did so in a short amount of time. Basically, spambots will have a high number of hits, and a short time between hits. |