Discussion:
Help improve bot exclusion
(too old to reply)
Richard Hipp
2012-10-30 10:17:05 UTC
Permalink
A Fossil website for a project with a few thousand check-ins can have a lot
of hyperlinks. If a spider or bot starts to walk that site, it will visit
literally hundreds of thousand or perhaps millions of pages, many of which
are things like "vdiff" and "annotate" which are computationally expensive
to generate or like "zip" or "tarball" which give multi-megabyte replies.
If you get a lot of bots walking a Fossil site, it can really load down the
CPU and run up bandwidth charges.

To prevent this, Fossil uses bot-exclustion techniques. First it looks at
the USER_AGENT string in the HTTP header and uses that to distinguish bots
from humans. Of course, a USER_AGENT string is easily forged, but most
bots are honest about who they are so this is a good initial filter. (The
undocumented "fossil test-ishuman" command can be used to experiment with
this bot discriminator.)

The second line of defense is that hyperlinks are disabled in the
transmitted HTML. There is no href= attribute on the <a> tags. The href=
attributes are added by javascript code that runs after the page has been
loaded. The idea here is that a bot can easily forge a USER_AGENT string,
but running javascript code is a bit more work and even malicious bots
don't normally go to that kind of trouble.

So, then, to walk a Fossil website, an agent has to (1) present a
USER_AGENT string from a known friendly web browser and (2) interpret
Javascript.

This two-phase defense against bots is usually effective. But last night,
a couple of bots got through on the SQLite website. No great damage was
done as we have ample bandwidth and CPU reserves to handle this sort of
thing. Even so, I'd like to understand how they got through so that I
might improve Fossil's defenses.

The first run on the SQLite website originated in Chantilly, VA and gave a
USER_AGENT string as follows:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0;
SLCC2; .NET_CLR 2.0.50727; .NET_CLR 3.5.30729; .NET_CLR 3.0.30729;
Media_Center_PC 6.0; .NET4.0C; WebMoney_Advisor; MS-RTC_LM_8)

The second run came from Berlin and gives this USER_AGENT:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

Both sessions started out innocently. The logs suggest that there really
was a human operator initially. But then after about 3 minutes of "normal"
browsing, each session starts downloading every hyperlink in sight at a
rate of about 5 to 10 pages per second. It is as if the user had pressed a
"Download Entire Website" button on their browser. Question: Is there
such a button in IE?

Another question: Are significant numbers of people still using IE6 and
IE7? Could we simply change Fossil to consider IE prior to version 8 to be
a bot, and hence not display any hyperlinks until the user has logged in?

Yet another question: Is there any other software on Windows that I am not
aware of that might be causing the above behaviors? Are there plug-ins or
other tools for IE that will walk a website and download all its content?

Finally: Do you have any further ideas on how to defend a Fossil website
against runs such as the two we observed on SQLite last night?

Tnx for the feedback....
--
D. Richard Hipp
***@sqlite.org
Arjen Markus
2012-10-30 10:23:17 UTC
Permalink
On Tue, 30 Oct 2012 06:17:05 -0400
Post by Richard Hipp
This two-phase defense against bots is usually
effective. But last night,
a couple of bots got through on the SQLite website. No
great damage was
done as we have ample bandwidth and CPU reserves to
handle this sort of
thing. Even so, I'd like to understand how they got
through so that I
might improve Fossil's defenses.
The first run on the SQLite website originated in
Chantilly, VA and gave a
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1;
WOW64; Trident/5.0;
SLCC2; .NET_CLR 2.0.50727; .NET_CLR 3.5.30729; .NET_CLR
3.0.30729;
Media_Center_PC 6.0; .NET4.0C; WebMoney_Advisor;
MS-RTC_LM_8)
The second run came from Berlin and gives this
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Both sessions started out innocently. The logs suggest
that there really
was a human operator initially. But then after about 3
minutes of "normal"
browsing, each session starts downloading every
hyperlink in sight at a
rate of about 5 to 10 pages per second. It is as if the
user had pressed a
"Download Entire Website" button on their browser.
Question: Is there
such a button in IE?
I just tried it: you can save a URL as a single web page
or a "web archive" (extension .wht,
whatever that means). So it seems quite possible - and it
appears to be the default when
using "save as".

This was with IE 8.

Regards,

Arjen



DISCLAIMER: This message is intended exclusively for the addressee(s) and may contain confidential and privileged information. If you are not the intended recipient please notify the sender immediately and destroy this message. Unauthorized use, disclosure or copying of this message is strictly prohibited.
The foundation 'Stichting Deltares', which has its seat at Delft, The Netherlands, Commercial Registration Number 41146461, is not liable in any way whatsoever for consequences and/or damages resulting from the improper, incomplete and untimely dispatch, receipt and/or content of this e-mail.
Lluís Batlle i Rossell
2012-10-30 10:23:50 UTC
Permalink
Post by Richard Hipp
Finally: Do you have any further ideas on how to defend a Fossil website
against runs such as the two we observed on SQLite last night?
This problem affects almost any web software, and I think that job is delegated
to robots.txt. Isn't this approach good enough? And in the particular case of
the fossil standalone server, it could serve a robots.txt.

How do programs like 'viewcvs' or 'viewsvn' deal with that?

Regards,
Lluís.
Richard Hipp
2012-10-30 12:20:14 UTC
Permalink
Post by Lluís Batlle i Rossell
Post by Richard Hipp
Finally: Do you have any further ideas on how to defend a Fossil website
against runs such as the two we observed on SQLite last night?
This problem affects almost any web software, and I think that job is delegated
to robots.txt. Isn't this approach good enough?
Robots.txt only works over an entire domain. If your Fossil server is
running as CGI within that domain, you can manually modify your robots.txt
file to exclude all or part of the fossil URI space. But as that file is
not under control of Fossil, you have to make this configuration yourself -
Fossil cannot help you. This burden can become acute when you are managing
many dozens or even hundreds of Fossil repositories. An automatic system
is better.
Post by Lluís Batlle i Rossell
And in the particular case of
the fossil standalone server, it could serve a robots.txt.
How do programs like 'viewcvs' or 'viewsvn' deal with that?
Regards,
Lluís.
_______________________________________________
fossil-users mailing list
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
--
D. Richard Hipp
***@sqlite.org
Bernd Paysan
2012-10-30 12:27:57 UTC
Permalink
On Tue, Oct 30, 2012 at 6:23 AM, Lluís Batlle i Rossell
Post by Lluís Batlle i Rossell
Post by Richard Hipp
Finally: Do you have any further ideas on how to defend a Fossil website
against runs such as the two we observed on SQLite last night?
This problem affects almost any web software, and I think that job is delegated
to robots.txt. Isn't this approach good enough?
Robots.txt only works over an entire domain. If your Fossil server is
running as CGI within that domain, you can manually modify your robots.txt
file to exclude all or part of the fossil URI space. But as that file is
not under control of Fossil, you have to make this configuration yourself -
Fossil cannot help you. This burden can become acute when you are managing
many dozens or even hundreds of Fossil repositories. An automatic system
is better.
The search engine crawlers do honor the robots meta-tag:

http://www.robotstxt.org/meta.html

Adding this is a piece of cake (just change the page template), but it doesn't
help against malware.
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://bernd-paysan.de/
Kees Nuyt
2012-10-30 12:16:28 UTC
Permalink
On Tue, 30 Oct 2012 06:17:05 -0400, Richard Hipp <***@sqlite.org> wrote:

[...]
Post by Richard Hipp
Both sessions started out innocently. The logs suggest that there really
was a human operator initially. But then after about 3 minutes of "normal"
browsing, each session starts downloading every hyperlink in sight at a
rate of about 5 to 10 pages per second. It is as if the user had pressed a
"Download Entire Website" button on their browser. Question: Is there
such a button in IE?
No, just "save page as ...". It will not follow hyperlinks, only save
html and embedded resources, like images.
Post by Richard Hipp
Another question: Are significant numbers of people still using IE6 and
IE7? Could we simply change Fossil to consider IE prior to version 8 to be
a bot, and hence not display any hyperlinks until the user has logged in?
I don't think it would help much. Newer versions will potentially run
the same add-ons.

By the way, over 5% of the population still use these older versions.
http://stats.wikimedia.org/archive/squid_reports/2012-09/SquidReportClients.htm
Post by Richard Hipp
Yet another question: Is there any other software on Windows that I am not
aware of that might be causing the above behaviors? Are there plug-ins or
other tools for IE that will walk a website and download all its content?
There are several browser add-ons that will try to walk complete
websites, e.g.:
http://www.winappslist.com/download_managers.htm
http://www.unixdaemon.net/ie-plugins.html

One can also think of validator tools.

Standalone programs usually will not run javascript.
Post by Richard Hipp
Finally: Do you have any further ideas on how to defend a Fossil website
against runs such as the two we observed on SQLite last night?
Perhaps the href javascript should run "onfocus", rather than "onload"?
(untested)

Other defenses could use DoS defense techniques, like not honouring (or
agressively delay responses to) more than a certain number of requests
within a certain time, which is not nice, because the server would have
to maintain (more) session state.

Sidenote:
As far as I can tell several modern browsers have a "read ahead" option,
that will try to load more pages of the site before a link is clicked.
https://developers.google.com/chrome/whitepapers/prerender
Those will not walk a whole site though.
--
Groet, Cordialement, Pozdrawiam, Regards,

Kees Nuyt
Kees Nuyt
2012-10-30 13:01:47 UTC
Permalink
[Default] On Tue, 30 Oct 2012 06:17:05 -0400, Richard Hipp
Post by Richard Hipp
Finally: Do you have any further ideas on how to defend a Fossil website
against runs such as the two we observed on SQLite last night?
Another suggestion:
Include a (mostly invisible, perhaps hard to recognize) logout hyperlink
on every page that immediately invalidates the session if it is
followed. Users will not see it and not be bothered by it, bots will
stumble upon it.
--
Groet, Cordialement, Pozdrawiam, Regards,

Kees Nuyt
Stanislav Paskalev
2012-10-30 13:31:13 UTC
Permalink
Add/remove the links on mouseOver. Although this might be a little bit
far-fetched and should probably be exposed as an option.

Regards,
Stanislav Paskalev
Post by Kees Nuyt
[Default] On Tue, 30 Oct 2012 06:17:05 -0400, Richard Hipp
Post by Richard Hipp
Finally: Do you have any further ideas on how to defend a Fossil website
against runs such as the two we observed on SQLite last night?
Include a (mostly invisible, perhaps hard to recognize) logout hyperlink
on every page that immediately invalidates the session if it is
followed. Users will not see it and not be bothered by it, bots will
stumble upon it.
--
Groet, Cordialement, Pozdrawiam, Regards,
Kees Nuyt
_______________________________________________
fossil-users mailing list
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Nolan Darilek
2012-10-30 15:13:47 UTC
Permalink
And, most importantly, don't sacrifice accessibility in the name of
excluding bots. Mouseover links are notoriously inaccessible. Same with
only adding href on focus via JS rather than on page load. If I tab
through a page, that would seem to break keyboard navigation.
Post by Stanislav Paskalev
Add/remove the links on mouseOver. Although this might be a little bit
far-fetched and should probably be exposed as an option.
Regards,
Stanislav Paskalev
Post by Kees Nuyt
[Default] On Tue, 30 Oct 2012 06:17:05 -0400, Richard Hipp
Post by Richard Hipp
Finally: Do you have any further ideas on how to defend a Fossil website
against runs such as the two we observed on SQLite last night?
Include a (mostly invisible, perhaps hard to recognize) logout hyperlink
on every page that immediately invalidates the session if it is
followed. Users will not see it and not be bothered by it, bots will
stumble upon it.
--
Groet, Cordialement, Pozdrawiam, Regards,
Kees Nuyt
_______________________________________________
fossil-users mailing list
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
_______________________________________________
fossil-users mailing list
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Kees Nuyt
2012-10-30 16:02:08 UTC
Permalink
[Default] On Tue, 30 Oct 2012 10:13:47 -0500, Nolan Darilek
Post by Nolan Darilek
And, most importantly, don't sacrifice accessibility in the name of
excluding bots. Mouseover links are notoriously inaccessible. Same with
only adding href on focus via JS rather than on page load. If I tab
through a page, that would seem to break keyboard navigation.
I agree.
I should have been more explicit: run the script when <body> gets focus,
not per hyperlink.
--
Groet, Cordialement, Pozdrawiam, Regards,

Kees Nuyt
Steve Havelka
2012-10-30 16:27:57 UTC
Permalink
My guess is that you don't really want to filter out bots, specifically,
but really anyone who's attempting to hit every link Fossil makes--that
is to say, it's the behavior that we're trying to stop here, not the actor.

I suppose what I'd do is set up a mechanism to detect when the remote
user is pulling down data too quickly to be a bot/non-abusive person,
and when Fossil detects that, send back a blank "Whoa, nellie! slow
down, human!" page for a minute or five.

I'd allow the user to configure two thresholds, number of pages per
second to trigger this, and number of seconds within a five-minute
window that the "number of pages per seconds" threshold is exceeded.
I'd give them defaults of "3 pages per second" and "3 times in five
minutes". So, for example, if a user hits 3 links in one second, which
can happen if you know exactly where you're going and the repository
loads quickly, it's ok the first time, even the second, but the third
time, it locks you out of the web interface for a little while.

Command-line stuff, like cloning/push/pull actions, ought to remain
accessible under all circumstances, regardless of the activity on the
web UI.

What do you think?
Post by Richard Hipp
A Fossil website for a project with a few thousand check-ins can have
a lot of hyperlinks. If a spider or bot starts to walk that site, it
will visit literally hundreds of thousand or perhaps millions of
pages, many of which are things like "vdiff" and "annotate" which are
computationally expensive to generate or like "zip" or "tarball" which
give multi-megabyte replies. If you get a lot of bots walking a
Fossil site, it can really load down the CPU and run up bandwidth charges.
To prevent this, Fossil uses bot-exclustion techniques. First it
looks at the USER_AGENT string in the HTTP header and uses that to
distinguish bots from humans. Of course, a USER_AGENT string is
easily forged, but most bots are honest about who they are so this is
a good initial filter. (The undocumented "fossil test-ishuman"
command can be used to experiment with this bot discriminator.)
The second line of defense is that hyperlinks are disabled in the
transmitted HTML. There is no href= attribute on the <a> tags. The
href= attributes are added by javascript code that runs after the page
has been loaded. The idea here is that a bot can easily forge a
USER_AGENT string, but running javascript code is a bit more work and
even malicious bots don't normally go to that kind of trouble.
So, then, to walk a Fossil website, an agent has to (1) present a
USER_AGENT string from a known friendly web browser and (2) interpret
Javascript.
This two-phase defense against bots is usually effective. But last
night, a couple of bots got through on the SQLite website. No great
damage was done as we have ample bandwidth and CPU reserves to handle
this sort of thing. Even so, I'd like to understand how they got
through so that I might improve Fossil's defenses.
The first run on the SQLite website originated in Chantilly, VA and
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64;
Trident/5.0; SLCC2; .NET_CLR 2.0.50727; .NET_CLR 3.5.30729; .NET_CLR
3.0.30729; Media_Center_PC 6.0; .NET4.0C; WebMoney_Advisor; MS-RTC_LM_8)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Both sessions started out innocently. The logs suggest that there
really was a human operator initially. But then after about 3 minutes
of "normal" browsing, each session starts downloading every hyperlink
in sight at a rate of about 5 to 10 pages per second. It is as if the
user had pressed a "Download Entire Website" button on their browser.
Question: Is there such a button in IE?
Another question: Are significant numbers of people still using IE6
and IE7? Could we simply change Fossil to consider IE prior to
version 8 to be a bot, and hence not display any hyperlinks until the
user has logged in?
Yet another question: Is there any other software on Windows that I
am not aware of that might be causing the above behaviors? Are there
plug-ins or other tools for IE that will walk a website and download
all its content?
Finally: Do you have any further ideas on how to defend a Fossil
website against runs such as the two we observed on SQLite last night?
Tnx for the feedback....
--
D. Richard Hipp
_______________________________________________
fossil-users mailing list
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Nico Williams
2012-10-31 23:24:56 UTC
Permalink
If the robot runs in the context of a browser (as a plugin, say), then
using JS to populate href attributes becomes an irrelevancy: the robot
sees the DOM of the page as it would be rendered to the user.

Kees Nuyt's suggestion of a hidden link which disables the session
when followed strikes me as quite clever and worth trying. I also
second Kees' suggestion of applying DoS fighting techniques.

Nico
--

Continue reading on narkive:
Loading...