Archive for the 'WebAppSec' Category

First look on Cookies


I wrote a simple script to set some cookies, and found some cute numbers on the maximum cookies to be set per domain name per path. The cookies are in the form of <key>=<val>, e.g. 1=1, 2=1, 3=1, 4=1. The length of the cookie name matters, as I found out.

Internet Explorer 7 – 20 cookies, maximum of 244 Set-Cookies per page.

Firefox 3 – 50 cookies.

Safari 3 – 1161 cookies, no limit of Set-cookies per page. See analysis below.

Opera 9 – 30 cookies.

Chrome 0.4 – 59~70 cookies, I have no idea why it is varying.

Tencent Traveller 2 – 20 cookies, follows the behaviour of Internet Explorer 7.

Except Safari 3, all browsers have a limit on the number of cookies to be set. I guess Safari is using a link list for that. For most browsers, although the HTTP Response code is 200, they will report the page as cannot be displayed. However, for Safari, since it has no limit, when the cookie headers are too long ( > 7619 ), Apache replies with a 400 Bad Request.

Haven’t think of any interesting tests yet, but feel free to discuss if there is anything we can do about them. By the way, I remember hotmail sets a whole lot of cookies, like BrowserSense and BS are just duplicates obviously (legacy code, yeehh!), I wonder are they hitting the limits soon? =)

The Tencent Traveller 2, as I will bet none of you outside of China will know about, is actually a browser in China that is built on top of IE7. Consider a GUI on top of IE7, and it even uses cookies of IE7, too. I have no idea of its adoption in China. Only after testing I realized I am using a very old version of it. I’ll see if there’s anything interesting in its newest version, 4.4.

So much for debugging last time. Let’s get back to the web. =P

Tencent Traveller –

RFC2109 –

RFC2965 –


China is a Good Place to Pen Test

In the midst of crazy work, I breezed through certain websites and randomly injected some simple attack vectors over this month in China. I found several large sites that are vulnerable to XSS and SQL Injection.

Large sites including DangDang, Sina China, Sogou, Baidu, some of them fixed the problems after emailing them or maybe after reading the logs, too. However, the SQL Injection in DangDang remains unfixed and that is not good. I will not disclose here though ( you can certainly find it easily. It is just simple and buggy. )

To give you an idea of how big the sites above are relative to China, here are some analogies :

DangDang – Amazon

Sina China – Yahoo! News

Sogou –

Baidu – Google ( bonus, Baidu beats Google in China )

I guess I will have to find more time to play with these sites to look for more holes. But for now, I have tons of work piling up. Ouch. I have vacation today, in office.

More About the Characters Causing XSS in Opera

I talked about an XSS in Opera 9.51 as discovered by Chris Weber [1]. I talked with Chris and he hinted me something about the character encoding, which I certainly have little concept about it, and I found out why it is happening.

Except U+180E, U+180F, they are all associated with spaces ( The Zs, Zl, Zp categories ) [2] [3]. In fact :

U+2028 – Line separator in unicode 3.0
U+2029 – Paragraph separator in unicode 3.0

Different kind of spaces in unicode :

U+2000 to U+200A

U+180E is a Mongolian Vowel Separator character [4], and U+180F is a non-existent character. Somehow U+180E fall into the Space Separator category as well. For U+180F, I really have no idea how it happened. If you know what is happening, I really appreciate your sharing.

Well, quite a good lesson on international character sets, no? I really recommend Chris Weber’s blog [5] because it contains a lot of information on character encoding and web application security that you should not miss it out.

References :

[1] –
[2] –
[3] –
[4] –
[5] –

Interesting XSS In Opera 9.51

Bad things can happen when things are not treated as what they are. Consider the HTML tag below :

<img src=non-existent-link&onerror=alert(1) />

The browser should treat it as the following content :

<img src=”non-existent-link&onerror=alert(1)” />

Not too bad, right? But, what if the character “&” is treated as space?

<img src=”non-existent-link” onerror=”alert(1)” />

Not too great, our image tag points to something non-existent, and inadvertently triggers the onerror callback, and free javascript for everyone who comes along… in Opera 9.51. Credits to Chris Weber to the discovery.

This is the case when some character becomes a space character. ( No, the “&” character is just for easier interpretation. )

Here is a list of characters that is treated as a space character :

U+2000 to U+200A

The list of Chris Weber seems to be smaller, but the above are working as I tested it against with Opera 9.51 on Windows XP. So I am pretty sure those are extra. =)

I also tested on Firefox 3.0.1, Internet Explorer 7.0.5730.13, Chrome The vulnerability does not exist there. For Safari 3.1.2 (525.21), I *think* there are not problems because it passed some of tests, but I failed to run all the tests due to its serious performance problems ( guesstimating to be memory leak? ). More on that in the next post.

Reference :

Advisory : Attack of the Mongolian space evaders

URL Bruteforce Discovery

Save as .bak ??? Ouch!

Save as .bak ??? Ouch!

Have you ever saved .php into .php.bak? ?

Yep, no kidding. Did you ever do that? I did. By reversing the thoughts of a web developer, we can guess a lot of logic and pattern out of his own creation. For the case above, the page might leak its naming pattern, or the salt to its password or even a cleartext database connection string. By bruteforcing different naming conventions you can find a lot of unexpected resources, which you can be sure is not for strangers and us.

But since the resources expose themselves naked…

So, here is a list of extensions you can try ( any more? Some of them are from w3af. ) :

== Web Pages ==
.htm .html .shtml .php .aspx .asp .jsp .py .txt .out .pl

== Source Code ==
.js .phps .java .c .cpp .cs .fla .vb .inc .old .old1 .bak .bak1 .backup .backup1

== Source Code Append ==
.v1 .v2 ,v1 ,v2

== Data File ==
.xml .properties

== Archives ==
.tar .zip .rar .gzip .bzip2 .gz .tgz .tar.gz .jar .arj

== Binaries ==
.swf .class .exe

To notice is that, one of them Source Code Append, such as login.php,v1 . It is one convention used.

There is also another kind of URL discovery, but more elegant and harder to automate – by guessing the naming convention of pages. Suppose it is 2nd of September now, you can also exploit logic errors such as date mechanism URL such as this :

Maybe we can have a look into the future, stocks? =) Not bad. There are a lot of such guesses you can try and unexpected rewards awaits. As the case here shows, such unexpected access to certain areas of the website can prove disastrous.

References :
w3af –

How to differentiate bots and humans

From the perspective of a HTTP server, all it sees are HTTP headers and IP address. The HTTP headers are trivial to forge, and so is the IP address. So, it seems impossible to tell between a user and a bot ( web spider ), right? Not quite. For bots, here I mean both good bots ( that honors robots.txt and identify themselves ) and malicious bots ( doesn’t honor robots.txt and pretend to be human users ).

You cannot tell with absolute certainty a bot and a human, but tricks can be employed to give a higher confidence of a guess.

1. User-Agent header
A browser will send a User-Agent header which will identify themselves to the HTTP server like below ( Firefox ):

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20080404 Firefox/

A good bot will identify themselves like MSNBot :

MSNBOT/0.1 (

But bad bots will pretend to be a browser, so you can’t identify bad bots and humans with this one.

2. Cache
A browser will cache resources such as images and links by default. Layout the same image or link at different closely related pages, poorly written bots will fetch repeatedly the same resource regardless of the link.

3. Invisible links
Links that cannot be clicked or seen by normal means are less likely be reached by the average human. A link like

<a href=””></a&gt;

should not get clicked by humans. If some IP really did, it is likely a few of them are real humans, but a more likely guess would be bots scraping the HTML.

4. Regular expression
Since most bots use regular expression to parse web pages, place links in HTML comments.

<!– <a href=””>About me</a> –>

Since few users would open the HTML source and grab the links, it would likely be bots in the first place. You can also put links in places where it should not be, like these :

<img src=”; alt=””></img&gt;

<thetag href=””></thetag&gt;

Smarter implementations will not fall to this. However, trivial regular expression will. Though, beware of text browser users.

5. Regular expression – hidden typo in link protocols
It is known that some websites try to obfuscate links from common users or bots by writing



ttp:// ( The japanese guys love this! )

If these links, when hidden, got picked up, could be a sign of regular expression in the bots doing the job.

6. Deep-depth links
Few users will be so eager to browser meaningless links, make a link into another page, then into yet another page, repeat a few more depths. Fill the deep depth pages with no meaningful content, most people will quit a few levels deep, but some bots will dig really deep.

7. Burst of HTTP requests
Humans might refresh a page repeatedly in a short timeframe, but that is only a single page. A human (an IP ) is unlikely to many pages within a short timeframe in general. Take this lightly though, because the users might be behind a proxy. This will not work well for distributed bots, too.

8. Links in Javascript
Since javascript emulation is still not trivial, if you can generate some links with javascript, IP addresses that are not reading these links but a lot of other static HTML links could be bots, because they cannot see the links generated by javascript.

9. robots.txt
There is a robots.txt for webservers to tell (good) bots what not to crawl (or tell bad bots what to crawl =] ). Humans should rarely read this file, this should raise a flag if some IP is fetching this file.

10. robots.txt + unreachable link in other pages
If there is a link that cannot be found anywhere except in robots.txt, since good bots honor robots.txt in general, those who go to the links are even more likely bad bots or hackers. Though, a bot is a likely guess.

11. Browser fingerprinting
Different browsers have their own implementations and order of HTTP headers in a HTTP Request. Say, if Firefox sends 5 headers A B C D E in this order, Internet Explorer might send in A B D E C, and Opera might send in A B E D C. Some bots might not be out of these patterns, and thus likely self-tailored bots.

12. RFC conformance
If web browsers send RFC-conformant HTTP request packets, poorly written bots might not, and this is a good indication of self-tailor bots again. I am not exactly sure about this one. If you know this, please let me know, thank you.

13. IP blocklists
If the IPs are from some public IP blacklist, and is acting a little like a bot, it is safe to assume bot behaviour.

None of them can tell 100% a human and a bot, but these together should give an average OK guess on the identity of the IP address. Currently the above methods are what I can think of, though I am sure a lot of you know much more than listed above. Let me know if you know of other tricks that can identify them.