2015-03-02 12:30:44 UTC
succinctly points up some of the problems with DVCS versus centralized
VCS (like subversion). Much further discussion occurs on the various
news aggregator sites.
So I was thinking, could Fossil 2.0 be enhanced in ways to support
scaling to the point where it works on really massive projects?
The key idea would be to relax the requirement that each client load
the entire history of the project. Instead, a clone would only load a
limited amount of history (a month, a year, perhaps even just the most
recent check-in). This would make cloning much faster and the
resulting clone much smaller. Missing content could be downloaded
from the server on an as-needed basis. So, for example, if the user
does "fossil update trunk:2010-01-01" then the local client would
first have to go back to the server to fetch content from 2010. The
additional content would be added to the local repository. And so the
repository would still grow. But it grows only on an as-needed basis
rather than starting out at full size. And in the common case where
the developer never needs to look at any content over a few months
old, the growth is limited.
By downloading the meta-data that is currently computed locally by
"rebuild", many operations on older content, such as timelines or
search, could be performed even without having the data present. In
the "bsd-src.fossil" repository, the content is 78% of the repository
file and the meta-data is the other 22%. So a clone that stored only
the most recent content together with all metadata might be about
1/4th the size of a full clone. For even greater savings, perhaps the
metadata could be time-limited, though not as severely as the content.
So perhaps the clone would only initialize to the last month of
content and the last five years of metadata.
For "wide" repositories (such as bsd-src) that hold many thousands of
files in a single check-out, Fossil could be enhanced to allow
cloning, checkout, and commit of just a small slice of the entire
tree. So, for example, a clone might hold just the bin/ subdirectory
of bsd-src containing just 56 files, rather than all 147720 files of a
complete check-out. Fossil should be able to do everything it
normally does with just this subset, including commit changes, except
that on new manifests generated by the commit, the R-card would have
to be omitted since the entire tree is necessary to compute the
R-card. But the R-card is optional already, controlled by the
"repo-cksum" setting, which is turned off in bsd-src, so there would
be no loss in functionality.
Tickets and wiki in a clone might be similarly limited to (say) the
previous 12 months of content, or the most recent change, whichever is
With these kinds of changes, it seems like Fossil might be made to
scale to arbitrarily massive repositories on the client side. On the
server side, the current design would work until the repository grew
too big to fit into a single disk file, at which point the server
would need to be redesigned to use a client/server database like,
PostgreSQL, that can scale to sizes larger than the 140 terabyte limit
of SQLite. But that would be a really big repo. 22 years of BSD
history fits in 7.2 GB, or 61 GB uncompressed. So it would take a
rather larger project to get into the terabyte range.
The sync protocol would need to be greatly enhanced to support this
functionality. Also, the schema for the meta-data, which currently is
an implementation detail, would need to become part of the interface.
Exposing the meta-data as interface would have been unthinkable a few
years ago, but at this point we have accumulated enough experience
about what is needed in the meta-data to perhaps make exposing its
design a reasonable alternative.
These are just thoughts to elicit comments and discussion. I have
several unrelated and much higher-priority tasks to keep me busy at
the moment, so this is not something that would happen right away,
unless somebody else steps up to do a lot of the implementation work.
D. Richard Hipp
D. Richard Hipp