Discussion:
Fossil 2.1: Scaling
(too old to reply)
Richard Hipp
2015-03-02 12:30:44 UTC
Permalink
Ben Pollack's essay at
http://bitquabit.com/post/unorthodocs-abandon-your-dvcs-and-return-to-sanity/
succinctly points up some of the problems with DVCS versus centralized
VCS (like subversion). Much further discussion occurs on the various
news aggregator sites.

So I was thinking, could Fossil 2.0 be enhanced in ways to support
scaling to the point where it works on really massive projects?

The key idea would be to relax the requirement that each client load
the entire history of the project. Instead, a clone would only load a
limited amount of history (a month, a year, perhaps even just the most
recent check-in). This would make cloning much faster and the
resulting clone much smaller. Missing content could be downloaded
from the server on an as-needed basis. So, for example, if the user
does "fossil update trunk:2010-01-01" then the local client would
first have to go back to the server to fetch content from 2010. The
additional content would be added to the local repository. And so the
repository would still grow. But it grows only on an as-needed basis
rather than starting out at full size. And in the common case where
the developer never needs to look at any content over a few months
old, the growth is limited.

By downloading the meta-data that is currently computed locally by
"rebuild", many operations on older content, such as timelines or
search, could be performed even without having the data present. In
the "bsd-src.fossil" repository, the content is 78% of the repository
file and the meta-data is the other 22%. So a clone that stored only
the most recent content together with all metadata might be about
1/4th the size of a full clone. For even greater savings, perhaps the
metadata could be time-limited, though not as severely as the content.
So perhaps the clone would only initialize to the last month of
content and the last five years of metadata.

For "wide" repositories (such as bsd-src) that hold many thousands of
files in a single check-out, Fossil could be enhanced to allow
cloning, checkout, and commit of just a small slice of the entire
tree. So, for example, a clone might hold just the bin/ subdirectory
of bsd-src containing just 56 files, rather than all 147720 files of a
complete check-out. Fossil should be able to do everything it
normally does with just this subset, including commit changes, except
that on new manifests generated by the commit, the R-card would have
to be omitted since the entire tree is necessary to compute the
R-card. But the R-card is optional already, controlled by the
"repo-cksum" setting, which is turned off in bsd-src, so there would
be no loss in functionality.

Tickets and wiki in a clone might be similarly limited to (say) the
previous 12 months of content, or the most recent change, whichever is
larger.

With these kinds of changes, it seems like Fossil might be made to
scale to arbitrarily massive repositories on the client side. On the
server side, the current design would work until the repository grew
too big to fit into a single disk file, at which point the server
would need to be redesigned to use a client/server database like,
PostgreSQL, that can scale to sizes larger than the 140 terabyte limit
of SQLite. But that would be a really big repo. 22 years of BSD
history fits in 7.2 GB, or 61 GB uncompressed. So it would take a
rather larger project to get into the terabyte range.

The sync protocol would need to be greatly enhanced to support this
functionality. Also, the schema for the meta-data, which currently is
an implementation detail, would need to become part of the interface.
Exposing the meta-data as interface would have been unthinkable a few
years ago, but at this point we have accumulated enough experience
about what is needed in the meta-data to perhaps make exposing its
design a reasonable alternative.

These are just thoughts to elicit comments and discussion. I have
several unrelated and much higher-priority tasks to keep me busy at
the moment, so this is not something that would happen right away,
unless somebody else steps up to do a lot of the implementation work.
--
D. Richard Hipp
***@sqlite.org
Richard Boehme
2015-03-02 14:33:21 UTC
Permalink
One question that arises is: how do I define what a "server" is? Can I
get the complete repository history for everything else but get a more
limited history for files that are larger than a certain size, or that
have certain extensions?

How would this work with sub-repositories (sorry, not versed very well
in fossil, but I understand that there can be sub respositories that
are nested under the main one (for instance for a directory which
contains a lot of videos or images))

Thanks.

Richard
Post by Richard Hipp
Ben Pollack's essay at
http://bitquabit.com/post/unorthodocs-abandon-your-dvcs-and-return-to-sanity/
succinctly points up some of the problems with DVCS versus centralized
VCS (like subversion). Much further discussion occurs on the various
news aggregator sites.
So I was thinking, could Fossil 2.0 be enhanced in ways to support
scaling to the point where it works on really massive projects?
The key idea would be to relax the requirement that each client load
the entire history of the project. Instead, a clone would only load a
limited amount of history (a month, a year, perhaps even just the most
recent check-in). This would make cloning much faster and the
resulting clone much smaller. Missing content could be downloaded
from the server on an as-needed basis. So, for example, if the user
does "fossil update trunk:2010-01-01" then the local client would
first have to go back to the server to fetch content from 2010. The
additional content would be added to the local repository. And so the
repository would still grow. But it grows only on an as-needed basis
rather than starting out at full size. And in the common case where
the developer never needs to look at any content over a few months
old, the growth is limited.
By downloading the meta-data that is currently computed locally by
"rebuild", many operations on older content, such as timelines or
search, could be performed even without having the data present. In
the "bsd-src.fossil" repository, the content is 78% of the repository
file and the meta-data is the other 22%. So a clone that stored only
the most recent content together with all metadata might be about
1/4th the size of a full clone. For even greater savings, perhaps the
metadata could be time-limited, though not as severely as the content.
So perhaps the clone would only initialize to the last month of
content and the last five years of metadata.
For "wide" repositories (such as bsd-src) that hold many thousands of
files in a single check-out, Fossil could be enhanced to allow
cloning, checkout, and commit of just a small slice of the entire
tree. So, for example, a clone might hold just the bin/ subdirectory
of bsd-src containing just 56 files, rather than all 147720 files of a
complete check-out. Fossil should be able to do everything it
normally does with just this subset, including commit changes, except
that on new manifests generated by the commit, the R-card would have
to be omitted since the entire tree is necessary to compute the
R-card. But the R-card is optional already, controlled by the
"repo-cksum" setting, which is turned off in bsd-src, so there would
be no loss in functionality.
Tickets and wiki in a clone might be similarly limited to (say) the
previous 12 months of content, or the most recent change, whichever is
larger.
With these kinds of changes, it seems like Fossil might be made to
scale to arbitrarily massive repositories on the client side. On the
server side, the current design would work until the repository grew
too big to fit into a single disk file, at which point the server
would need to be redesigned to use a client/server database like,
PostgreSQL, that can scale to sizes larger than the 140 terabyte limit
of SQLite. But that would be a really big repo. 22 years of BSD
history fits in 7.2 GB, or 61 GB uncompressed. So it would take a
rather larger project to get into the terabyte range.
The sync protocol would need to be greatly enhanced to support this
functionality. Also, the schema for the meta-data, which currently is
an implementation detail, would need to become part of the interface.
Exposing the meta-data as interface would have been unthinkable a few
years ago, but at this point we have accumulated enough experience
about what is needed in the meta-data to perhaps make exposing its
design a reasonable alternative.
These are just thoughts to elicit comments and discussion. I have
several unrelated and much higher-priority tasks to keep me busy at
the moment, so this is not something that would happen right away,
unless somebody else steps up to do a lot of the implementation work.
--
D. Richard Hipp
_______________________________________________
fossil-users mailing list
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
--
Thank you.

Richard Boehme

Email: ***@gmail.com
Phone: 443-739-8502
Work Phone: 410-966-6606 (Mon - Thu 6 AM - 4:30 PM)
Richard Hipp
2015-03-02 14:44:33 UTC
Permalink
Post by Richard Boehme
One question that arises is: how do I define what a "server" is? Can I
get the complete repository history for everything else but get a more
limited history for files that are larger than a certain size, or that
have certain extensions?
That is theoretically possible given the file format. It is simply a
question of writing the necessary code to implement that capability.
Post by Richard Boehme
How would this work with sub-repositories (sorry, not versed very well
in fossil, but I understand that there can be sub respositories that
are nested under the main one (for instance for a directory which
contains a lot of videos or images))
I think sub-repositories is an orthogonal topic.
--
D. Richard Hipp
***@sqlite.org
Joerg Sonnenberger
2015-03-02 16:24:09 UTC
Permalink
Post by Richard Hipp
So I was thinking, could Fossil 2.0 be enhanced in ways to support
scaling to the point where it works on really massive projects?
I think the single biggest practical issue right now still goes back to
the baseline manifests not being efficient enough. Would you consider
changing the rules to allow truely incremental manifests? I agree that
having full manifests is sometimes nicer, but I think those would be
build on-demand and cached separately. I belive that is the majority of
the current meta data, which matters a lot whenever a rebuild happens.

Joerg
Richard Hipp
2015-03-02 16:38:38 UTC
Permalink
Post by Joerg Sonnenberger
Post by Richard Hipp
So I was thinking, could Fossil 2.0 be enhanced in ways to support
scaling to the point where it works on really massive projects?
I think the single biggest practical issue right now still goes back to
the baseline manifests not being efficient enough. Would you consider
changing the rules to allow truely incremental manifests? I agree that
having full manifests is sometimes nicer, but I think those would be
build on-demand and cached separately. I belive that is the majority of
the current meta data, which matters a lot whenever a rebuild happens.
The current mechanism is to have periodic full baseline manifests, and
then have deltas against those baselines in between. Hence, no more
than two artifacts ever need to be decoded in order to access a
manifest - the baseline and its delta.

Are you proposing to have deltas of deltas, so that a potentially
large number of artifacts need to be decoded in order to reconstruct
the complete manifest?

I don't understand how that would help. Can you provide more explanation?
--
D. Richard Hipp
***@sqlite.org
Joerg Sonnenberger
2015-03-02 17:22:38 UTC
Permalink
Post by Richard Hipp
Post by Joerg Sonnenberger
Post by Richard Hipp
So I was thinking, could Fossil 2.0 be enhanced in ways to support
scaling to the point where it works on really massive projects?
I think the single biggest practical issue right now still goes back to
the baseline manifests not being efficient enough. Would you consider
changing the rules to allow truely incremental manifests? I agree that
having full manifests is sometimes nicer, but I think those would be
build on-demand and cached separately. I belive that is the majority of
the current meta data, which matters a lot whenever a rebuild happens.
The current mechanism is to have periodic full baseline manifests, and
then have deltas against those baselines in between. Hence, no more
than two artifacts ever need to be decoded in order to access a
manifest - the baseline and its delta.
I know. The manifest contains two parts: non-file content and the file
list. For delta manifests, the file list is encoded as changes relative
to the base line.
Post by Richard Hipp
Are you proposing to have deltas of deltas, so that a potentially
large number of artifacts need to be decoded in order to reconstruct
the complete manifest?
I think we have two different situations when it comes to access the
file list:

(1) Getting the full list. This is primarily used for initial checks and
as part of the status handling of checkouts, maybe also for the web view.

(2) Getting the changes relative to another checkin. This is what update
etc. is interested in.

The problem with the base line encoding is that it still has a high
degree of redundancy. While delta compression removes a good chunk of
the overhead in terms of disk space, rebuild still has to process the
full amount. That's a significant part for a large tree. My suggestion
is to store a plain file delta in the manifest. Let's call this is a
pure delta manifest. Rebuild parsing is then linear in the number of
changed files. The plink table is a direct mapping of the pure delta
manifest, they have effectively the same data. To keep the performance
of case (1) above, a new full manifest table is stored separate and
computed on demand. That can be either during rebuild or on first
access. Heuristics like "X commits since last full manifest" can be
applied. This is a (local) cache, no need to transfer it via sync
protocol, no need to preserve it during rebuild either.

Joerg
Warren Young
2015-03-03 21:58:18 UTC
Permalink
I’m going to start two different reply forks: I’ll reply to the Pollack article here, then send another message later to chime in on your proposal, drh.
Post by Richard Hipp
Ben Pollack's essay at
http://bitquabit.com/post/unorthodocs-abandon-your-dvcs-and-return-to-sanity/
succinctly points up some of the problems with DVCS versus centralized
VCS (like subversion).
Thanks for the pointer. It sums up most of my problems with the Git and GitHub models. It’s too bad Pollack doesn’t include Fossil in his comparison.

I don’t think all of his points apply to Fossil:

1. “Sanely track renames.” In this respect, I think Fossil offers one step forward, one back relative to Subversion.

While Fossil does seem to realize that a rename isn’t the same thing as add+delete *most* of the time — I have managed to confuse it a few times into seeing a rename as add+delete — it doesn’t backtrace through a rename in finfo output:

f new x.fossil
mkdir x
cd x
f open ../x.fossil
touch a
f add a
f ci -m ‘initial'
f mv a b
mv a b
f ci -m ‘renamed a to b’
f finfo b
2015-03-03 [bc09e28048] .. (user: warren, artifact: [da39a3ee5e], branch:
trunk)

Point being, I usually end up having to go into “fossil ui” to trace the ancestry of a file back through a rename. Doubtless it is possible to do this from the command line somehow, but I miss the behavior of “svn log” which did the backtrace for you.


2. Explosion of manuals and tutorials.

To some extent, the relative paucity of Fossil training material is a consequence of its…ummm… unpopularity? But, it is also the case that it doesn’t *need* as much training material.

I read the Red Bean Book, and I still don’t quite understand how svn branching and merging is supposed to work in practice. We ended up ignoring about half of the mechanics proposed therein, which made our conversion to Fossil more difficult, but it worked at the time for us.

With Fossil, though, I’m branching and merging for the first time without difficulty. The hardest part was getting past the scar tissue laid down by svn. “It’s that simple, really? No way, can’t be. There’s got to be more to it than that!”

Partly this is due to the ability to create a branch as part of a checkin with "f ci --branch”. Partly it is due to the branch structure being made visible in f ui. These are genuine advances over Subversion, and I thank you for them, drh.


3. “fossil bundle” makes Fossil nearly as easy to use as Subversion for drive-by patches. I believe the equivalent Fossil sequence is:

a. Clone the master repo
b. Open a copy of the repo
c. Make your change
d. Check it in on a branch; ignore the auto-sync complaint
e. Bundle the new branch
f. Send the bundle to the project maintainer
g. Watch it get ignored

So, just two more steps than svn, rather than 4, as with git.

One of the two extra steps is due to the fact that clone is separate from open in Fossil, which I consider a feature. It allows multiple opens on a local clone.

I absolutely hate the Git alternative where you have to keep switching the local checkout/repo to see different branches. The checkout operation itself is time consuming, then it eats more time due to the forced rebuilds, since it must rebuild the objects to match the changed sources. With Fossil, I can keep not only multiple source checkouts from a single repo clone, but also multiple build trees.

The other extra step is the apparent necessity to check your changes in. This is also a feature since it’s how Fossil records the checkin comment, the user info, etc. If the project maintainer accepts the bundle without changes, this step saves him work that he’d otherwise have to do manually in the patch(1) case.

If someone wants to tie “fossil bundle” into the ticketing system, we can save a step here by bypassing the email step. (fossil bundle submit?)

Perhaps we could save the other step by offering a clone-and-open mode, perhaps by storing the Fossil repo file inside the opened tree? I propose “fossil hack,” so named because it would be used by people who just want to do a quick hack on your repo, not seriously spend a lot of time with it.
Warren Young
2015-03-03 22:34:45 UTC
Permalink
Post by Richard Hipp
The key idea would be to relax the requirement that each client load
the entire history of the project. Instead, a clone would only load a
limited amount of history (a month, a year, perhaps even just the most
recent check-in).
This would be wonderful!

I would suggest a refinement on the simple “SELECT * WHERE modification_date < 1month” idea, though: I actually want the past month (or whatever) of history on *each* open branch relative to the date of the checkin time on the tip of that branch.

That is, if I last changed the “v8” branch two years ago, I still expect it to give me a month’s worth of file info on that branch. I need it if I am going to do tech support on that branch: “v10 does thus-and-such when you press the Foobie button, but v8 behaves differently. Can you look into the code to find out why?” I also need it if I’m going to backport a fix to v8 from v10.

I’m thinking this should be the default behavior, at least by configuration at the master repo level. It should take extra flags to get a complete clone. This is a break with existing practice, but Pollack’s right: it’s rare for me to actually need deep history on every branch. The longest I typically need history for is however long it’s been since the last release on that branch.

If I get this feature, it’s going to make me want another one, though: the ability to merge two repositories.

For one of my several svn repos which I converted to Fossil, I purposely checked in only the tip of the svn trunk into a fresh repo, rather than convert a decade of history. I did it for the reason Pollack points out in the article: faster checkins and other tree traversals. If I need to go back into the pre-Fossil history, I have a separate svn-to-Fossil conversion repo.

If Fossil’s speed becomes independent of the depth of the checkin history, I’d like the ability to make those two into a single repo, since the *effect* of a clone will then be more like my current setup, where there are relatively few checkins, and none of the blobs have forks in their history yet.

I realize I can probably do that by hand with some kind of export-and-reimport via the Git fast-export path, but I’d like to do it entirely within Fossil, if possible.
Post by Richard Hipp
For "wide" repositories (such as bsd-src) that hold many thousands of
files in a single check-out, Fossil could be enhanced to allow
cloning, checkout, and commit of just a small slice of the entire
tree.
This would also be awesome. I miss that from Subversion. Those of us who have converted from Subversion often have trees that depend on the ability to check out just a slice of the tree.

I had only one Subversion repo at home, with everything “versionable” stored in it. I could then check out different chunks of the tree, placing each where I wanted that working subtree to live.

If I were creating such a system fresh in Fossil today, I’d create separate Fossil repos for each different sub-tree of files. Not because this is what *I* need, but because this is what *Fossil* expects.

I’m currently hacking around this by checking the monolithic repo out into a hidden location, then creating symlinks back into subfolders of that checkout. Yes, ick.

I’ve considered reconverting the repo, using Git’s ability to rewrite history and thereby slice the repo up, but that’s just more work than it’s worth to me.

What I really want is what you propose: Subversion-like subrepo checkouts.
Nico Williams
2015-04-17 01:58:55 UTC
Permalink
Post by Richard Hipp
Ben Pollack's essay at
http://bitquabit.com/post/unorthodocs-abandon-your-dvcs-and-return-to-sanity/
succinctly points up some of the problems with DVCS versus centralized
VCS (like subversion). Much further discussion occurs on the various
news aggregator sites.
So I was thinking, could Fossil 2.0 be enhanced in ways to support
scaling to the point where it works on really massive projects?
The key idea would be to relax the requirement that each client load
the entire history of the project. Instead, a clone would only load a
git can do this, and it's a relatively new feature. The really nice
thing would be to load whatever is needed on demand, or to perform
certain operations (e.g., producing annotated sources, viewing
history, ...) on the server.
Post by Richard Hipp
limited amount of history (a month, a year, perhaps even just the most
recent check-in). This would make cloning much faster and the
resulting clone much smaller. Missing content could be downloaded
from the server on an as-needed basis. So, for example, if the user
does "fossil update trunk:2010-01-01" then the local client would
first have to go back to the server to fetch content from 2010. The
additional content would be added to the local repository. And so the
repository would still grow. But it grows only on an as-needed basis
rather than starting out at full size. And in the common case where
the developer never needs to look at any content over a few months
old, the growth is limited.
By downloading the meta-data that is currently computed locally by
"rebuild", many operations on older content, such as timelines or
search, could be performed even without having the data present. In
the "bsd-src.fossil" repository, the content is 78% of the repository
file and the meta-data is the other 22%. So a clone that stored only
the most recent content together with all metadata might be about
1/4th the size of a full clone. For even greater savings, perhaps the
metadata could be time-limited, though not as severely as the content.
So perhaps the clone would only initialize to the last month of
content and the last five years of metadata.
For "wide" repositories (such as bsd-src) that hold many thousands of
files in a single check-out, Fossil could be enhanced to allow
cloning, checkout, and commit of just a small slice of the entire
tree. So, for example, a clone might hold just the bin/ subdirectory
of bsd-src containing just 56 files, rather than all 147720 files of a
complete check-out. Fossil should be able to do everything it
normally does with just this subset, including commit changes, except
that on new manifests generated by the commit, the R-card would have
to be omitted since the entire tree is necessary to compute the
R-card. But the R-card is optional already, controlled by the
"repo-cksum" setting, which is turned off in bsd-src, so there would
be no loss in functionality.
Yes, this would be very nice. Though a BSD would probably need
significant build system rototilling to make it possible for
developers to work on isolated portions of the code with partial
clones only.
Post by Richard Hipp
The sync protocol would need to be greatly enhanced to support this
functionality. Also, the schema for the meta-data, which currently is
an implementation detail, would need to become part of the interface.
Exposing the meta-data as interface would have been unthinkable a few
years ago, but at this point we have accumulated enough experience
about what is needed in the meta-data to perhaps make exposing its
design a reasonable alternative.
Exposing the metadata would be one of the best things Fossil could do,
IMO, once it's ready.

Nico
--

Continue reading on narkive:
Loading...