2012-10-30 10:17:05 UTC
of hyperlinks. If a spider or bot starts to walk that site, it will visit
literally hundreds of thousand or perhaps millions of pages, many of which
are things like "vdiff" and "annotate" which are computationally expensive
to generate or like "zip" or "tarball" which give multi-megabyte replies.
If you get a lot of bots walking a Fossil site, it can really load down the
CPU and run up bandwidth charges.
To prevent this, Fossil uses bot-exclustion techniques. First it looks at
the USER_AGENT string in the HTTP header and uses that to distinguish bots
from humans. Of course, a USER_AGENT string is easily forged, but most
bots are honest about who they are so this is a good initial filter. (The
undocumented "fossil test-ishuman" command can be used to experiment with
this bot discriminator.)
The second line of defense is that hyperlinks are disabled in the
transmitted HTML. There is no href= attribute on the <a> tags. The href=
loaded. The idea here is that a bot can easily forge a USER_AGENT string,
don't normally go to that kind of trouble.
So, then, to walk a Fossil website, an agent has to (1) present a
USER_AGENT string from a known friendly web browser and (2) interpret
This two-phase defense against bots is usually effective. But last night,
a couple of bots got through on the SQLite website. No great damage was
done as we have ample bandwidth and CPU reserves to handle this sort of
thing. Even so, I'd like to understand how they got through so that I
might improve Fossil's defenses.
The first run on the SQLite website originated in Chantilly, VA and gave a
USER_AGENT string as follows:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0;
SLCC2; .NET_CLR 2.0.50727; .NET_CLR 3.5.30729; .NET_CLR 3.0.30729;
Media_Center_PC 6.0; .NET4.0C; WebMoney_Advisor; MS-RTC_LM_8)
The second run came from Berlin and gives this USER_AGENT:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Both sessions started out innocently. The logs suggest that there really
was a human operator initially. But then after about 3 minutes of "normal"
browsing, each session starts downloading every hyperlink in sight at a
rate of about 5 to 10 pages per second. It is as if the user had pressed a
"Download Entire Website" button on their browser. Question: Is there
such a button in IE?
Another question: Are significant numbers of people still using IE6 and
IE7? Could we simply change Fossil to consider IE prior to version 8 to be
a bot, and hence not display any hyperlinks until the user has logged in?
Yet another question: Is there any other software on Windows that I am not
aware of that might be causing the above behaviors? Are there plug-ins or
other tools for IE that will walk a website and download all its content?
Finally: Do you have any further ideas on how to defend a Fossil website
against runs such as the two we observed on SQLite last night?
Tnx for the feedback....
D. Richard Hipp
D. Richard Hipp