Discussion:
Backups of deconstructed fossil repositories
(too old to reply)
Thomas Levine
2018-06-17 18:16:29 UTC
Permalink
As content is added to a fossil repository, files in the corresponding
deconstructed repository never change; they are only added. Most backup
software will track changes to the deconstructed repository with great
efficiency.

I should thus take my backups of the deconstructed repositories, yes?
That is, should I back up the SQLite database format of the fossil
repository or the deconstructed directory format of the repository?

One inconvenience I noted is that the deconstruct command always writes
artefacts to the filesystem, even if a file of the appropriate name and
size and contents already exists. Would the developers welcome a flag
to blob_write_to_file in src/blob.c to skip the writing of a new
artefact file if the file already exists? That is, rebuild_step in
src/rebuild.c would check for the existance of the file corresponding
the artefact's hash, and if such a file exists already (even if its
content is wrong), rebuild_step would skip writing this artefact.
Warren Young
2018-06-17 20:05:48 UTC
Permalink
Post by Thomas Levine
One inconvenience I noted is that the deconstruct command always writes
artefacts to the filesystem, even if a file of the appropriate name and
size and contents already exists.
You might want to split that observation into two, as rsync does:

- name, size, and modification date match
- contents also match

If you’re willing to gamble that if the first test returns true that the second will also returns true, it buys you a big increase in speed. The gamble is worth taking as long as the files’ modification timestamps are trustworthy.

When the timestamps aren’t trustworthy, you do the first test, then if that returns true, also do the second as extra assurance.
Post by Thomas Levine
Would the developers welcome a flag
to blob_write_to_file in src/blob.c to skip the writing of a new
artefact file if the file already exists?
In addition to your backup case, it might also benefit snapshotting mechanisms found in many virtual machine systems and in some of the more advanced filesystems. (ZFS, btrfs, APFS…)

However, I’ll also give a counterargument to the whole idea: you probably aren’t saving anything in the end. An intelligent deconstruct + backup probably saves no net I/O over just re-copying the Fossil repo DB to the destination unless the destination is *much* slower than the machine being backed up.

(rsync was created for the common case where networks are much slower than the computers they connect. rsync within a single computer is generally no faster than cp -r, and sometimes slower, unless you take the mtime optimization mentioned above.)

The VM/ZFS + snapshots case has a similar argument against it: if you’re using snapshots to back up a Fossil repo, deconstruction isn’t helpful. The snapshot/CoW mechanism will only clone the changed disk blocks in the repo.

So, what problem are you solving? If it isn’t the slow-networks problem, I suspect you’ve got an instance of the premature optimization problem here. If you go ahead and implement it, measure before committing the change, and if you measure a meaningful difference, document the conditions to help guide expectations.
Warren Young
2018-06-17 20:08:38 UTC
Permalink
Post by Warren Young
If you’re willing to gamble that if the first test returns true that the second will also returns true, it buys you a big increase in speed. The gamble is worth taking as long as the files’ modification timestamps are trustworthy.
I just remembered something: “fossil up” purposely does not modify the mtimes of the files it writes to match the mtime of the file in the repository because it can cause difficult-to-diagnose build system errors. Writing changed files out with the current wall time as the mtime is more likely to cause correct builds.

I wonder if the fossil deconstruct mechanism also does the same thing? If so, then you can’t take the rsync mtime optimization without changing that behavior.
Stephan Beal
2018-06-17 20:12:51 UTC
Permalink
Post by Warren Young
If you’re willing to gamble that if the first test returns true that the
second will also returns true, it buys you a big increase in speed. The
gamble is worth taking as long as the files’ modification timestamps are
trustworthy.
I just remembered something: “fossil up” purposely does not modify the
mtimes of the files it writes to match the mtime of the file in the
repository because it can cause difficult-to-diagnose build system errors.
Writing changed files out with the current wall time as the mtime is more
likely to cause correct builds.
To that i'm going to add that fossil doesn't actually store any file
timestamps! It only records the time of a commit. When fossil is asked
"what's the timestamp for file X?", the answer is really the timestamp of
the last commit in which that file was modified.
--
----- stephan beal
http://wanderinghorse.net/home/stephan/
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
Thomas Levine
2018-06-29 23:21:46 UTC
Permalink
Post by Warren Young
However, I’ll also give a counterargument to the whole idea: you
probably aren’t saving anything in the end. An intelligent deconstruct
+ backup probably saves no net I/O over just re-copying the Fossil repo
DB to the destination unless the destination is *much* slower than the
machine being backed up.
(rsync was created for the common case where networks are much slower
than the computers they connect. rsync within a single computer is
generally no faster than cp -r, and sometimes slower, unless you take
the mtime optimization mentioned above.)
The VM/ZFS + snapshots case has a similar argument against it: if you’re
using snapshots to back up a Fossil repo, deconstruction isn’t helpful.
The snapshot/CoW mechanism will only clone the changed disk blocks in
the repo.
So, what problem are you solving? If it isn’t the slow-networks
problem, I suspect you’ve got an instance of the premature optimization
problem here. If you go ahead and implement it, measure before
committing the change, and if you measure a meaningful difference,
document the conditions to help guide expectations.
I want my approximately daily backups to be small.

I currently version the fossil SQLite files in borg, and I am considering versioning instead the artefact dumps. I figure these will change less than the SQLite files do and that they also will be smaller because they lack caches.

But the backups are already very small.

I suppose I could test this.

Richard Hipp
2018-06-17 22:42:52 UTC
Permalink
Post by Thomas Levine
As content is added to a fossil repository, files in the corresponding
deconstructed repository never change; they are only added. Most backup
software will track changes to the deconstructed repository with great
efficiency.
I should thus take my backups of the deconstructed repositories, yes?
Fossil itself tracks changes with great efficiency. The best backup
of a fossil repository is a clone.

The self-hosting Fossil repo at https://fossil-scm.org/ is backed up
by two clones, one at https://www2.fossil-scm.org/ and the other at
https://www3.fossil-scm.org/site.cgi. Each of these clones is in a
separate data center in a different part of the world. The second
clone uses a different ISP (DigitalOcean instead of Linode). Both
clones sync to the master hourly via a cron job.
--
D. Richard Hipp
***@sqlite.org
Loading...