How to differentiate bots and humans

From the perspective of a HTTP server, all it sees are HTTP headers and IP address. The HTTP headers are trivial to forge, and so is the IP address. So, it seems impossible to tell between a user and a bot ( web spider ), right? Not quite. For bots, here I mean both good bots ( that honors robots.txt and identify themselves ) and malicious bots ( doesn’t honor robots.txt and pretend to be human users ).

You cannot tell with absolute certainty a bot and a human, but tricks can be employed to give a higher confidence of a guess.

1. User-Agent header
A browser will send a User-Agent header which will identify themselves to the HTTP server like below ( Firefox ):

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14

A good bot will identify themselves like MSNBot :

MSNBOT/0.1 (http://search.msn.com/msnbot.htm)

But bad bots will pretend to be a browser, so you can’t identify bad bots and humans with this one.

2. Cache
A browser will cache resources such as images and links by default. Layout the same image or link at different closely related pages, poorly written bots will fetch repeatedly the same resource regardless of the link.

3. Invisible links
Links that cannot be clicked or seen by normal means are less likely be reached by the average human. A link like

<a href=”http://youcannotseethislink.com/”></a&gt;

should not get clicked by humans. If some IP really did, it is likely a few of them are real humans, but a more likely guess would be bots scraping the HTML.

4. Regular expression
Since most bots use regular expression to parse web pages, place links in HTML comments.

<!– <a href=”http://www.grepthis.com/”>About me</a> –>

Since few users would open the HTML source and grab the links, it would likely be bots in the first place. You can also put links in places where it should not be, like these :

<img src=”http://thisisnormallink.com&#8221; alt=”http://youshouldnevercomehere.com/”></img&gt;

<thetag href=”http://youshouldnevercomehere.com/”></thetag&gt;

Smarter implementations will not fall to this. However, trivial regular expression will. Though, beware of text browser users.

5. Regular expression – hidden typo in link protocols
It is known that some websites try to obfuscate links from common users or bots by writing

hxxp://www.alinktosomething.com/

htp://www.alinktosomething.com/

ttp://www.alinktosomething.com/ ( The japanese guys love this! )

If these links, when hidden, got picked up, could be a sign of regular expression in the bots doing the job.

6. Deep-depth links
Few users will be so eager to browser meaningless links, make a link into another page, then into yet another page, repeat a few more depths. Fill the deep depth pages with no meaningful content, most people will quit a few levels deep, but some bots will dig really deep.

7. Burst of HTTP requests
Humans might refresh a page repeatedly in a short timeframe, but that is only a single page. A human (an IP ) is unlikely to many pages within a short timeframe in general. Take this lightly though, because the users might be behind a proxy. This will not work well for distributed bots, too.

8. Links in Javascript
Since javascript emulation is still not trivial, if you can generate some links with javascript, IP addresses that are not reading these links but a lot of other static HTML links could be bots, because they cannot see the links generated by javascript.

9. robots.txt
There is a robots.txt for webservers to tell (good) bots what not to crawl (or tell bad bots what to crawl =] ). Humans should rarely read this file, this should raise a flag if some IP is fetching this file.

10. robots.txt + unreachable link in other pages
If there is a link that cannot be found anywhere except in robots.txt, since good bots honor robots.txt in general, those who go to the links are even more likely bad bots or hackers. Though, a bot is a likely guess.

11. Browser fingerprinting
Different browsers have their own implementations and order of HTTP headers in a HTTP Request. Say, if Firefox sends 5 headers A B C D E in this order, Internet Explorer might send in A B D E C, and Opera might send in A B E D C. Some bots might not be out of these patterns, and thus likely self-tailored bots.

12. RFC conformance
If web browsers send RFC-conformant HTTP request packets, poorly written bots might not, and this is a good indication of self-tailor bots again. I am not exactly sure about this one. If you know this, please let me know, thank you.

13. IP blocklists
If the IPs are from some public IP blacklist, and is acting a little like a bot, it is safe to assume bot behaviour.

None of them can tell 100% a human and a bot, but these together should give an average OK guess on the identity of the IP address. Currently the above methods are what I can think of, though I am sure a lot of you know much more than listed above. Let me know if you know of other tricks that can identify them.

Advertisements

2 Responses to “How to differentiate bots and humans”


  1. 1 cskane July 13, 2008 at 7:24 am

    From point 6, I think you can develop another similar approach which only needs at most two pages to finish the work. The idea is that we have two pages A and B, they link to each other, and either one of them taking record what IP request the page. In most of the cases, stupid bots will keep visiting the nodes in this cycle.

    You may improve this by randomly generate some meaningless parameters to the links each time the page is being loaded. I think a smart bots will keep running between these two pages because, different parameters may cause different result HTML page. Then the more the IP appears in the records, the high probability that a bot own the IP.

    Of course it is not 100% sure when detecting robots, because bots can browse the world by using web proxy or other stuffs that hide themselves.

  2. 2 log0 July 13, 2008 at 7:33 am

    That’s one trick, to loop the spider, but that depends on if you want to kill the spider. But most crawlers keep track of history, so you probably need to append a dynamic parameter to be sure.

    It’s just one of the methods to get more confirmation of a guess, no guarantees.


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s





%d bloggers like this: