Discussion:
[fossil-users] can fossil try harder on sync failure?
Matt Welland
2014-04-16 16:01:28 UTC
Permalink
We are seeing quite a few of these sync failures on busy repositories:

fossil commit cfgdat tests -m "Added another drc test"
Autosync: ssh://host/path/project.fossil
Round-trips: 1 Artifacts sent: 0 received: 0
Error: Database error: database is locked: {UPDATE event SET mtime=(SELECT
m1 FROM time_fudge WHERE mid=objid) WHERE objid IN (SELECT mid FROM
time_fudge);}
Round-trips: 1 Artifacts sent: 0 received: 0
Pull finished with 360 bytes sent, 280 bytes received
Autosync failed
continue in spite of sync failure (y/N)? n

Could fossil silently retry a couple times instead of giving up so easily?

If the user says y and continues then we get forks in the timeline which
are very confusing to non-experts.
--
Matt
-=-
90% of the nations wealth is held by 2% of the people. Bummer to be in the
majority...
Stephan Beal
2014-04-16 16:14:35 UTC
Permalink
On Wed, Apr 16, 2014 at 6:01 PM, Matt Welland <***@gmail.com> wrote:

> Error: Database error: database is locked: {UPDATE event SET mtime=(SELECT
> m1 FROM time_fudge WHERE mid=objid) WHERE objid IN (SELECT mid FROM
> time_fudge);}
> Round-trips: 1 Artifacts sent: 0 received: 0
> Pull finished with 360 bytes sent, 280 bytes received
> Autosync failed
> continue in spite of sync failure (y/N)? n
>
> Could fossil silently retry a couple times instead of giving up so easily?
>
> If the user says y and continues then we get forks in the timeline which
> are very confusing to non-experts.
>

Isn't the db being locked a sign that a fork is almost eminent? If someone
is writing to the repo and that lock is blocking your autosync, then a fork
has possibly already happened (or will if autosync retries, either
automatically or because the user tapped Y).


--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
Matt Welland
2014-04-16 16:22:01 UTC
Permalink
On Wed, Apr 16, 2014 at 9:14 AM, Stephan Beal <***@googlemail.com> wrote:

>
> On Wed, Apr 16, 2014 at 6:01 PM, Matt Welland <***@gmail.com> wrote:
>
>> Error: Database error: database is locked: {UPDATE event SET
>> mtime=(SELECT m1 FROM time_fudge WHERE mid=objid) WHERE objid IN (SELECT
>> mid FROM time_fudge);}
>> Round-trips: 1 Artifacts sent: 0 received: 0
>> Pull finished with 360 bytes sent, 280 bytes received
>> Autosync failed
>> continue in spite of sync failure (y/N)? n
>>
>> Could fossil silently retry a couple times instead of giving up so easily?
>>
>> If the user says y and continues then we get forks in the timeline which
>> are very confusing to non-experts.
>>
>
> Isn't the db being locked a sign that a fork is almost eminent? If someone
> is writing to the repo and that lock is blocking your autosync, then a fork
> has possibly already happened (or will if autosync retries, either
> automatically or because the user tapped Y).
>

Yes, exactly. Presumably a commit from someone else is in progress. All
fossil has to do is wait a second and then try the sync again and then
report the "fossil will fork" message if appropriate or follow though with
the commit if the overlapping commit was on a different branch.


>
> --
> ----- stephan beal
> http://wanderinghorse.net/home/stephan/
> http://gplus.to/sgbeal
> "Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
> those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
>
> _______________________________________________
> fossil-users mailing list
> fossil-***@lists.fossil-scm.org
> http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
>
>


--
Matt
-=-
90% of the nations wealth is held by 2% of the people. Bummer to be in the
majority...
Stephan Beal
2014-04-16 16:26:49 UTC
Permalink
On Wed, Apr 16, 2014 at 6:22 PM, Matt Welland <***@gmail.com> wrote:

> then try the sync again and then report the "fossil will fork" message if
> appropriate or follow though with the commit if the overlapping commit was
> on a different branch.
>

Ah, right - i didn't think that through to the next step. That does indeed
sound like it would be an improvement. This weekend is a four-day one for
us in southern Germany (for Easter), so i'll see if i can tinker with this
if someone doesn't beat me to it.

--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
Joerg Sonnenberger
2014-04-16 16:35:35 UTC
Permalink
On Wed, Apr 16, 2014 at 06:26:49PM +0200, Stephan Beal wrote:
> Ah, right - i didn't think that through to the next step. That does indeed
> sound like it would be an improvement. This weekend is a four-day one for
> us in southern Germany (for Easter), so i'll see if i can tinker with this
> if someone doesn't beat me to it.

It would also be nice if clone didn't abort with removal of the
repository on such errors. pull/push should return an error etc.
There are a bunch of basic usability issues in this area. This is made
worse by pull not being read-only...

Joerg
Stephan Beal
2014-04-16 16:40:51 UTC
Permalink
On Wed, Apr 16, 2014 at 6:35 PM, Joerg Sonnenberger <***@britannica.bec.de
> wrote:

> It would also be nice if clone didn't abort with removal of the
> repository on such errors. pull/push should return an error etc.
> There are a bunch of basic usability issues in this area. This is made
> worse by pull not being read-only...
>

i make no promises but will look into it.

--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
Rich Neswold
2014-04-16 20:40:23 UTC
Permalink
On Wed, Apr 16, 2014 at 11:35 AM, Joerg Sonnenberger
<***@britannica.bec.de> wrote:
> It would also be nice if clone didn't abort with removal of the
> repository on such errors. pull/push should return an error etc.
> There are a bunch of basic usability issues in this area. This is made
> worse by pull not being read-only...

It would be even nicer if it didn't throw away partial "pull" data on
a DB timeout:

I'm trying to pull the latest NetBSD changes (to pull in the
Heartbleed fixes) and my session keeps failing with the "fudge time"
error. Unfortunately, this means all the data it transferred
(sometimes over 1GB!) gets rolled back and I have to try again later.

It would be nice if fossil would break the "pull" into smaller
transactions which contain valid timeline commits so, if there's a
database timeout, the next time I try to "pull" it can continue where
it left off.

--
Rich
Stephan Beal
2014-04-17 15:03:19 UTC
Permalink
This post might be inappropriate. Click to display it.
Richard Hipp
2014-04-17 15:13:38 UTC
Permalink
On Thu, Apr 17, 2014 at 11:03 AM, Stephan Beal <***@googlemail.com>wrote:

> On Wed, Apr 16, 2014 at 10:40 PM, Rich Neswold <***@gmail.com>wrote:
>
>> It would be nice if fossil would break the "pull" into smaller
>> transactions which contain valid timeline commits so, if there's a
>> database timeout, the next time I try to "pull" it can continue where
>> it left off.
>>
>
> That's a very interesting idea. That's not something for a weekend hack
> (it would require bigger changes), but that would certainly be of benefit
> in libfossil once it is far enough along to sync. There's no specific
> reason why it has to internally track the transient sync data the same way
> fossil(1) does. e.g. it might make sense to buffer it all to an extra table
> and then feed that table to the part which does the real work.
>
>
Would this really require a big change? Seems like about all you have to
do is COMMIT after each round-trip to the server, rather than waiting to
COMMIT at the very end. Or, just COMMIT instead of ROLLBACK after getting
a server timeout.


--
D. Richard Hipp
***@sqlite.org
Stephan Beal
2014-04-17 15:16:04 UTC
Permalink
On Thu, Apr 17, 2014 at 5:13 PM, Richard Hipp <***@sqlite.org> wrote:

> That's a very interesting idea. That's not something for a weekend hack
>> (it would require bigger changes),
>>
>>
> Would this really require a big change?
>

i kinda made an conservative guess there ;).

Seems like about all you have to do is COMMIT after each round-trip to
> the server, rather than waiting to COMMIT at the very end. Or, just COMMIT
> instead of ROLLBACK after getting a server timeout.
>

Would that be a valid strategy? Couldn't we end up with a partial state
which we can't work from until the pull finishes to completion?

--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
Richard Hipp
2014-04-17 15:33:59 UTC
Permalink
On Thu, Apr 17, 2014 at 11:16 AM, Stephan Beal <***@googlemail.com>wrote:

>
> Seems like about all you have to do is COMMIT after each round-trip to
>> the server, rather than waiting to COMMIT at the very end. Or, just COMMIT
>> instead of ROLLBACK after getting a server timeout.
>>
>
> Would that be a valid strategy? Couldn't we end up with a partial state
> which we can't work from until the pull finishes to completion?
>
>
The logic (in manifest.c) is designed to be able to deal with partial state
transfers. I'm not saying there are definitely no bugs, but I'm pretty
sure it does work.


--
D. Richard Hipp
***@sqlite.org
Matt Welland
2014-04-17 16:08:45 UTC
Permalink
I'm not sure if this is relevant but I found with sqlite3 that in
situations with high contention for a database (multiple coincident
reads/writes) that backing off and trying again in a half second rather
than relying on the sqlite3 timeout seemed to increase overall throughput
and reliability. I suspect that if on the "server" side if there are
multiple concurrent readers and a concurrent writer that a brief release of
the read lock to allow any pending writers to complete their work might
improve overall throughput and decrease the number of sqlite3 errors.

One fossil I work with is getting over a hundred commits a day and the
sqlite3 failures result in a fork every few weeks. I'm glad to say the
database itself has proven resistant to corruption under this heavy load.

On Thu, Apr 17, 2014 at 8:33 AM, Richard Hipp <***@sqlite.org> wrote:

>
> On Thu, Apr 17, 2014 at 11:16 AM, Stephan Beal <***@googlemail.com>wrote:
>
>>
>> Seems like about all you have to do is COMMIT after each round-trip to
>>> the server, rather than waiting to COMMIT at the very end. Or, just COMMIT
>>> instead of ROLLBACK after getting a server timeout.
>>>
>>
>> Would that be a valid strategy? Couldn't we end up with a partial state
>> which we can't work from until the pull finishes to completion?
>>
>>
> The logic (in manifest.c) is designed to be able to deal with partial
> state transfers. I'm not saying there are definitely no bugs, but I'm
> pretty sure it does work.
>
>
> --
> D. Richard Hipp
> ***@sqlite.org
>
> _______________________________________________
> fossil-users mailing list
> fossil-***@lists.fossil-scm.org
> http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
>
>


--
Matt
-=-
90% of the nations wealth is held by 2% of the people. Bummer to be in the
majority...
Tony Papadimitriou
2014-04-18 09:01:42 UTC
Permalink
This is a trivial question because I can get around the problem but I need to know what’s the correct way. For example,

I have a global ignore-glob that applies to most fossil repos. But, on occasion, I need to have an empty ignore-glob list for a specific repository.

If I unset the local setting, then the global setting kicks in. But I want an empty local ignore-glob. Of course I could put any nonsense string that will never (hopefully) match anything, but that doesn’t seem the right way to do it. What is considered null value but that is different from unset? Two quotes perhaps? Since setting the ignore-glob does not require quotes, will the “” string be interpreted as null or as a string containing two quotes?

Thanks.
Stephan Beal
2014-04-18 09:17:48 UTC
Permalink
On Fri, Apr 18, 2014 at 11:01 AM, Tony Papadimitriou <***@acm.org> wrote:

> I have a global ignore-glob that applies to most fossil repos. But, on
> occasion, I need to have an empty ignore-glob list for a specific
> repository.
>
> If I unset the local setting, then the global setting kicks in. But I
> want an empty local ignore-glob. Of course I
>

Try this:

mkdir .fossil-settings
echo '' > .fossil-settings/ignore-glob
fossil add .fossil-settings/ignore-glob
fossil ci -m 'added ignore-glob setting' .fossil-settings/ignore-glob

then you'll have an empty ignore-glob for that repo. You can configure the
glob by changing the contents of that file.


--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
Andy Goth
2014-04-18 15:51:51 UTC
Permalink
This being Unix, there are a million ways to do things. Just for the
sake of curiosity, here are 0.0004% more of the possibilities. I only
bring this up because I know several of my coworkers don't know about
these tricks, so I imagine some others out there might not either.

On 4/18/2014 4:17 AM, Stephan Beal wrote:
> echo '' > .fossil-settings/ignore-glob

(Unless you're in DOS or the Windows shell) echo with zero arguments
just prints a newline. So you can skip the quotes.

$ echo > .fossil-settings/ignore-glob

In bash (not csh), you can skip the echo too:

$ > .fossil-settings/ignore-glob

This will trigger file creation which is a side effect of redirection,
but since there's no command, there's nothing to write.

csh explicitly forbids an empty command pipeline, so use the ":"
quasi-command which is a no-op ("does nothing, successfully"):

% : > .fossil-settings/ignore-glob

Or if the file doesn't already exist, touch will create it in the
process of updating its timestamp:

$ touch .fossil-settings/ignore-glob

--
Andy Goth | <andrew.m.goth/at/gmail/dot/com>
Ron Wilson
2014-04-18 19:55:22 UTC
Permalink
On Fri, Apr 18, 2014 at 11:51 AM, Andy Goth <***@gmail.com> wrote:

> which is a no-op ("does nothing, successfully"):
>

Being in the world of "bare silicon", this gave me a chuckle. "does
nothing, successfully" has a side effect, therefore actually does
something. Specifically, it declares success. For us, no-op is just "does
nothing", leaving the state the same as before. (Ok, technically, time has
passed and the instruction pointer register has incremented, so nothing
actually "does nothing".).
Andy Bradford
2014-05-05 04:44:21 UTC
Permalink
Thus said Richard Hipp on Thu, 17 Apr 2014 11:33:59 -0400:

> > Would that be a valid strategy? Couldn't we end up with a
> > partial state which we can't work from until the pull finishes to
> > completion?
> >
> The logic (in manifest.c) is designed to be able to deal with partial
> state transfers. I'm not saying there are definitely no bugs, but I'm
> pretty sure it does work.

I've made some changes on the per-round-trip-commit branch that
implement what you suggested, but I believe I've run into a potential
bug---though perhaps introduced by my most recent changes. It now
successfully COMMITs with each round-trip and if I interrupt the
transfer the next time I pull, it only gets those things that were
missed in the previous sync. This part works great as far as I can tell!

However, with checkin [1317331eed] it now allows errors that happen
during the http_exchange to actually be returned to the caller, so a
network failue will also result in a COMMIT (rather than a fatal error).
It seems that this latter behavior introduces (or uncovers) a serious
problem.

While the sync operation was running, if the network connection is
killed at just the right moment, the sync operation does indeed fail,
but then the last set of changes gets committed and the update continues
as follows:

$ fossil up
Autosync: http://***@remote:8080/
Round-trips: 11 Artifacts sent: 0 received: 101
server did not reply
Pull finished with 16512 bytes sent, 16267948 bytes received
Autosync failed
content missing for file.77
UPDATE file.1
...
REMOVE file.77
...
updated-to: 57487563d6208c04cbbeee3efa0280cb9166000a 2014-05-05 03:37:46 UTC
changes: 100 files modified.

Now, I may think that my repository is in a good state, but it is not,
as one file is entirely missing, and fossil update will not restore it.
If I do another update, Fossil shows that the rest of the missing
artifacts were pulled down, but it still does not restore the files
reported as having missing content.

Also, it does not even show that this file is missing (e.g. either a
MISSING file.1, nor a DELETE file.1), and if I make a change (say to
file.1) and then checkin, the commit actually includes file.77 as a
DELETED file even though there was no visual indication that this event
would occur.

I even tried a rebuild and it still will not restore the missing file.
If I close/open the local repository it actually does finally bring back
the file:

$ fossil op ../clone.fossil
file.77
project-name: <unnamed>
repository: /tmp/clone/../clone.fossil
local-root: /tmp/clone/
config-db: /home/amb/.fossil
project-code: 43748f4be07be41523019a2c4532effbc3f5a02f
checkout: 57487563d6208c04cbbeee3efa0280cb9166000a 2014-05-05 03:37:46 UTC
parent: a2330f3775d7a939d9f0dd448bca639c1208505d 2014-05-05 03:30:47 UTC
leaf: open
tags: trunk
comment: three (user: amb)
checkins: 4

Notice that the UUID matches that which was received during the sync
operation . Any ideas what might be going on here? Somtimes I've seen as
much as 75% of the files end up with ``content missing.'' If I compile
without [1317331eed] this particular problem doesn't happen, and all
http_exchange errors are treated as fatal which result in an eventual
ROLLBACK of the current round-trip.

What have I missed? Perhaps with the per round-trip commit it is not
really necessary to also have a COMMIT if the network drops that is
implemented in [1317331eed]?

Thanks,

Andy
--
TAI64 timestamp: 4000000053671747
Andy Bradford
2014-05-05 07:19:45 UTC
Permalink
Thus said "Andy Bradford" on 04 May 2014 22:44:21 -0600:

> What have I missed? Perhaps with the per round-trip commit it is not
> really necessary to also have a COMMIT if the network drops that is
> implemented in [1317331eed]?

Ok, as it turns out, one potential resolution is to simply call
fossil_fatal when autosync fails during update here:

http://www.fossil-scm.org/index.html/artifact/f90dabeaf78a319b0b3b8791c0dded8d2f6170ec?ln=132

This does have one side-effect when autosync is enabled, but it does
perhaps further distinguish update from checkout. The side-effect is
that if we call fossil_fatal here, it seems that it will not be possible
to update to a different revision than the current checkout while the
network is down. It will be possible, however, to do a checkout (or if
autosync is disabled, updates work).

Andy
--
TAI64 timestamp: 4000000053673bb2
Joerg Sonnenberger
2014-04-17 18:55:03 UTC
Permalink
On Thu, Apr 17, 2014 at 11:13:38AM -0400, Richard Hipp wrote:
> Would this really require a big change? Seems like about all you have to
> do is COMMIT after each round-trip to the server, rather than waiting to
> COMMIT at the very end. Or, just COMMIT instead of ROLLBACK after getting
> a server timeout.

Yes, please. Even for local syncs, the overhead should be small. For
remote operations, net latency should eat everything... That reminds me,
the other problem with the network protocoll is its synchronous nature.
Consider the case of having enough phantoms to issue the next round
before processing the answer of the server. Sending that request in
parallel while processing the answer would significantly increase
through put.

Joerg
Andy Bradford
2014-04-19 21:39:38 UTC
Permalink
Thus said Richard Hipp on Thu, 17 Apr 2014 11:13:38 -0400:

> Would this really require a big change? Seems like about all you have
> to do is COMMIT after each round-trip to the server, rather than
> waiting to COMMIT at the very end. Or, just COMMIT instead of ROLLBACK
> after getting a server timeout.

I think Fossil already does the latter; or I just read the code wrong.
At the end of client_sync() it calls db_end_transaction(0):

http://www.fossil-scm.org/index.html/artifact/dace4194506b2ea7?ln=1936

Which will cause a COMMIT to happen unless there are errors (with commit
hooks):

http://www.fossil-scm.org/index.html/artifact/17595c8a94256a4d?ln=162,185

Am I wrong?

Andy
--
TAI64 timestamp: 400000005352ed3d
Rich Neswold
2014-04-17 15:12:44 UTC
Permalink
On Wed, Apr 16, 2014 at 3:40 PM, Rich Neswold <***@gmail.com> wrote:
> It would be even nicer if it didn't throw away partial "pull" data on
> a DB timeout:
>
> I'm trying to pull the latest NetBSD changes (to pull in the
> Heartbleed fixes) and my session keeps failing with the "fudge time"
> error. Unfortunately, this means all the data it transferred
> (sometimes over 1GB!) gets rolled back and I have to try again later.
>
> It would be nice if fossil would break the "pull" into smaller
> transactions which contain valid timeline commits so, if there's a
> database timeout, the next time I try to "pull" it can continue where
> it left off.

I may be confused and I'm definitely ignorant of fossil internals.

The first few times that my "pull"s failed, there was no obvious
change to the timeline so I assumed none of the data was being saved.
After the last timeout, however, there were some new entries from the
NetBSD project. So maybe new pulls start were the previous left off
after all. (The heartbleed bug probably caused many changes to several
NetBSD branches, so there are probably many more entries to pull than
normal.)

I'll hit Mr. Sonnenberger's server a few more time throughout the day
and see if I can eventually complete a pull.

--
Rich
Joerg Sonnenberger
2014-04-17 18:46:16 UTC
Permalink
On Thu, Apr 17, 2014 at 10:12:44AM -0500, Rich Neswold wrote:
> The first few times that my "pull"s failed, there was no obvious
> change to the timeline so I assumed none of the data was being saved.
> After the last timeout, however, there were some new entries from the
> NetBSD project. So maybe new pulls start were the previous left off
> after all. (The heartbleed bug probably caused many changes to several
> NetBSD branches, so there are probably many more entries to pull than
> normal.)

Please note that while moving to a newer, faster server I also moved to
source to /cvsroot to match "real" CVS. That was responsible for quite a
few changes.

Joerg
Rich Neswold
2014-04-17 19:06:26 UTC
Permalink
On Thu, Apr 17, 2014 at 1:46 PM, Joerg Sonnenberger
<***@britannica.bec.de> wrote:
> Please note that while moving to a newer, faster server I also moved to
> source to /cvsroot to match "real" CVS. That was responsible for quite a
> few changes.

So I'm sync'ing a completely new repository on top of mine? A fossil
repository doesn't have a UUID to tell if I shouldn't pull from a
remote anymore? Like, if for some strange reason, Mr. Sonnenberger
decided to replace the NetBSD repo with the fossil repo, I'd be
pulling fossil source into my repo/ticket/wiki without a warning?

Or is my ignorance showing again? :)

--
Rich
Richard Hipp
2014-04-17 19:10:34 UTC
Permalink
On Thu, Apr 17, 2014 at 3:06 PM, Rich Neswold <***@gmail.com>wrote:

>
> So I'm sync'ing a completely new repository on top of mine?
>

Every project as a "project-id", which is supposed to be unique. Fossil
recognizes when the project-ids do not match and refuses to sync.

That said, there is nothing to prevent a clever individual, like Joerg,
from manually setting a duplicate project-id using raw SQL statements. But
on the other hand, why would he do that?


--
D. Richard Hipp
***@sqlite.org
Rich Neswold
2014-04-17 19:14:23 UTC
Permalink
On Thu, Apr 17, 2014 at 2:10 PM, Richard Hipp <***@sqlite.org> wrote:
> Every project as a "project-id", which is supposed to be unique. Fossil
> recognizes when the project-ids do not match and refuses to sync.
>
> That said, there is nothing to prevent a clever individual, like Joerg, from
> manually setting a duplicate project-id using raw SQL statements. But on
> the other hand, why would he do that?

Good! So the extra long pull times are simply due to Joerg doing some
housekeeping. Thanks for the information!

--
Rich
Andreas Kupries
2014-04-17 19:11:11 UTC
Permalink
On Thu, Apr 17, 2014 at 12:06 PM, Rich Neswold <***@gmail.com> wrote:
> On Thu, Apr 17, 2014 at 1:46 PM, Joerg Sonnenberger
> <***@britannica.bec.de> wrote:
>> Please note that while moving to a newer, faster server I also moved to
>> source to /cvsroot to match "real" CVS. That was responsible for quite a
>> few changes.
>
> So I'm sync'ing a completely new repository on top of mine? A fossil
> repository doesn't have a UUID to tell if I shouldn't pull from a
> remote anymore?

A project in fossil has a project id ... Called a project-code.

Example:
fossil info ; on my Tcl checkout
=>
project-name: Tcl Source Code
project-code: 1ec9da4c469c29f4717e2a967fe6b916d9c8c06e

Fossil will not push/pull between repos of different project codes.

My understanding of Joerg's mail was that the moved files around in
the repository to match a specific directory structure, but not that
he created a new project.

> Like, if for some strange reason, Mr. Sonnenberger
> decided to replace the NetBSD repo with the fossil repo, I'd be
> pulling fossil source into my repo/ticket/wiki without a warning?
>
> Or is my ignorance showing again? :)


--
Andreas Kupries
Senior Tcl Developer
Code to Cloud: Smarter, Safer, Faster(tm)
F: 778.786.1133
***@activestate.com
http://www.activestate.com
Learn about Stackato for Private PaaS: http://www.activestate.com/stackato

EuroTcl'2014, July 12-13, Munich, GER -- http://www.eurotcl.tcl3d.org/
21'st Tcl/Tk Conference: Nov 10-14, Portland, OR, USA --
http://www.tcl.tk/community/tcl2014/
Joerg Sonnenberger
2014-04-17 20:56:09 UTC
Permalink
On Thu, Apr 17, 2014 at 02:06:26PM -0500, Rich Neswold wrote:
> On Thu, Apr 17, 2014 at 1:46 PM, Joerg Sonnenberger
> <***@britannica.bec.de> wrote:
> > Please note that while moving to a newer, faster server I also moved to
> > source to /cvsroot to match "real" CVS. That was responsible for quite a
> > few changes.
>
> So I'm sync'ing a completely new repository on top of mine?

No, just that the original RCS files moved, which in some case changes
the way certain RCS keyword are expanded during the fossil conversion.

Joerg
Matt Welland
2014-04-18 15:52:09 UTC
Permalink
Just FYI, I'm seeing this kind of message quite often. This is due to
overlapping clone operations on large fossils on relatively slow disk.

Round-trips: 1 Artifacts sent: 0 received: 0 Round-trips: 1
Artifacts sent: 0 received: 109 Round-trips: 2
Artifacts sent: 0 received: 109 Round-trips: 2
Artifacts sent: 0 received: 773 Round-trips: 3
Artifacts sent: 0 received: 773 Round-trips: 3
Artifacts sent: 0 received: 895 Round-trips: 4
Artifacts sent: 0 received: 895 Error: Database error: database is locked:
{UPDATE event SET mtime=(SELECT m1 FROM time_fudge WHERE mid=objid) WHERE
objid IN (SELECT mid FROM time_fudge);} <#key_3_2>


On Thu, Apr 17, 2014 at 1:56 PM, Joerg Sonnenberger <***@britannica.bec.de
> wrote:

> On Thu, Apr 17, 2014 at 02:06:26PM -0500, Rich Neswold wrote:
> > On Thu, Apr 17, 2014 at 1:46 PM, Joerg Sonnenberger
> > <***@britannica.bec.de> wrote:
> > > Please note that while moving to a newer, faster server I also moved to
> > > source to /cvsroot to match "real" CVS. That was responsible for quite
> a
> > > few changes.
> >
> > So I'm sync'ing a completely new repository on top of mine?
>
> No, just that the original RCS files moved, which in some case changes
> the way certain RCS keyword are expanded during the fossil conversion.
>
> Joerg
> _______________________________________________
> fossil-users mailing list
> fossil-***@lists.fossil-scm.org
> http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
>



--
Matt
-=-
90% of the nations wealth is held by 2% of the people. Bummer to be in the
majority...
Matt Welland
2014-04-18 16:00:00 UTC
Permalink
Sorry for the multiple mails but I have a little more info.

I can reliably reproduce this. Just do two simultaneous clones via ssh from
a large fossil. This is on NFS. It happens very quickly so fossil is giving
up pretty fast.


On Fri, Apr 18, 2014 at 8:52 AM, Matt Welland <***@gmail.com> wrote:

> Just FYI, I'm seeing this kind of message quite often. This is due to
> overlapping clone operations on large fossils on relatively slow disk.
>
> Round-trips: 1 Artifacts sent: 0 received: 0 Round-trips: 1
> Artifacts sent: 0 received: 109 Round-trips: 2
> Artifacts sent: 0 received: 109 Round-trips: 2
> Artifacts sent: 0 received: 773 Round-trips: 3
> Artifacts sent: 0 received: 773 Round-trips: 3
> Artifacts sent: 0 received: 895 Round-trips: 4
> Artifacts sent: 0 received: 895 Error: Database error: database is
> locked: {UPDATE event SET mtime=(SELECT m1 FROM time_fudge WHERE mid=objid)
> WHERE objid IN (SELECT mid FROM time_fudge);} <#1457589b194a2723_key_3_2>
>
>
> On Thu, Apr 17, 2014 at 1:56 PM, Joerg Sonnenberger <
> ***@britannica.bec.de> wrote:
>
>> On Thu, Apr 17, 2014 at 02:06:26PM -0500, Rich Neswold wrote:
>> > On Thu, Apr 17, 2014 at 1:46 PM, Joerg Sonnenberger
>> > <***@britannica.bec.de> wrote:
>> > > Please note that while moving to a newer, faster server I also moved
>> to
>> > > source to /cvsroot to match "real" CVS. That was responsible for
>> quite a
>> > > few changes.
>> >
>> > So I'm sync'ing a completely new repository on top of mine?
>>
>> No, just that the original RCS files moved, which in some case changes
>> the way certain RCS keyword are expanded during the fossil conversion.
>>
>> Joerg
>> _______________________________________________
>> fossil-users mailing list
>> fossil-***@lists.fossil-scm.org
>> http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
>>
>
>
>
> --
> Matt
> -=-
> 90% of the nations wealth is held by 2% of the people. Bummer to be in the
> majority...
>



--
Matt
-=-
90% of the nations wealth is held by 2% of the people. Bummer to be in the
majority...
Stephan Beal
2014-04-18 16:21:44 UTC
Permalink
On Fri, Apr 18, 2014 at 6:00 PM, Matt Welland <***@gmail.com> wrote:

> I can reliably reproduce this. Just do two simultaneous clones via ssh
> from a large fossil. This is on NFS. It happens very quickly so fossil is
> giving up pretty fast.
>

NFS w/ db file == fundamentally bad idea.

db.c sets the default budy timeout to 5 seconds.

--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
Matt Welland
2014-04-18 16:47:33 UTC
Permalink
On Fri, Apr 18, 2014 at 9:21 AM, Stephan Beal <***@googlemail.com> wrote:

> NFS w/ db file == fundamentally bad idea.
>
> db.c sets the default budy timeout to 5 seconds.
>

So you are recommending we abandon fossil because of this? Storing the
files on local disk is not an option for us. Also, other than being a
little slow, storing fossils on NFS has not been an issue.

I did some more testing and this is unique to using ssh and it occurs on
local disk just as fast as on NFS.

Anyone sharing fossils using ssh will run into this sooner or later. This
is using 1.28
--
Matt
-=-
90% of the nations wealth is held by 2% of the people. Bummer to be in the
majority...
Stephan Beal
2014-04-18 17:03:00 UTC
Permalink
On Fri, Apr 18, 2014 at 6:47 PM, Matt Welland <***@gmail.com> wrote:

> So you are recommending we abandon fossil because of this? Storing the
> files on local disk is not an option for us. Also, other than being a
> little slow, storing fossils on NFS has not been an issue.
>

Search this page for "NFS":

http://sqlite.org/howtocorrupt.html


> I did some more testing and this is unique to using ssh and it occurs on
> local disk just as fast as on NFS.
>

Then you're lucky.


> Anyone sharing fossils using ssh will run into this sooner or later. This
> is using 1.28
>

SSH is not the problem - NFS is historically problematic when it comes to
file locking. i've seen apps slow down by a factor of a hundred when using
locking over NFS, and heard/read many horror stories of shared file
corruption over buggy NFSes.

--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
Matt Welland
2014-04-18 18:15:37 UTC
Permalink
On Fri, Apr 18, 2014 at 10:03 AM, Stephan Beal <***@googlemail.com>wrote:

> NFS is historically problematic when it comes to file locking.


This is true. However technology doesn't stop evolving. The locking on NFS
on the systems I use seems pretty rock solid. I push sqlite3 to extremes on
NFS and there have been challenges but all considered it is quite
remarkable how well it performs. As I mentioned in a previous email the
built in timeout mechanism in sqlite3 seems to tie up the database and
using shorter timeouts and delaying a short while before trying again
really seemed to improve throughput.

Overall I'd say that be cautious but don't hesitate to keep sqlite3 in your
tool chest even if you have to work on NFS.

Note that the issue I'm seeing is happening with no NFS.
--
Matt
-=-
90% of the nations wealth is held by 2% of the people. Bummer to be in the
majority...
Richard Hipp
2014-04-18 17:12:07 UTC
Permalink
On Fri, Apr 18, 2014 at 12:47 PM, Matt Welland <***@gmail.com> wrote:

>
> I did some more testing and this is unique to using ssh and it occurs on
> local disk just as fast as on NFS.
>

I don't have NFS set up anywhere so I cannot test that. But I can do
multiple ssh clones from a different machine and when I do, everything
works fine. I've tried as many three different, simultaneous clones of the
same repo, all running at the same time. I've used both an old mac and a
beagleboard as the server. (Client is always my linux desktop.) It
always works. I cannot recreate the problem.

Do you have any additional hints for me?

--
D. Richard Hipp
***@sqlite.org
Matt Welland
2014-04-18 17:32:26 UTC
Permalink
NFS is not needed to reproduce this. Simultaneous parallel cloning via ssh
from one file is giving me this every single time.

Could it be an OS dependency? I'm on SuSe Linux (SLES11).

I downloaded the binary from fossil-scm.org and tested again and get
exactly the same issue.

I do happen to be cloning from and to the same host.


On Fri, Apr 18, 2014 at 10:12 AM, Richard Hipp <***@sqlite.org> wrote:

>
>
>
> On Fri, Apr 18, 2014 at 12:47 PM, Matt Welland <***@gmail.com>wrote:
>
>>
>> I did some more testing and this is unique to using ssh and it occurs on
>> local disk just as fast as on NFS.
>>
>
> I don't have NFS set up anywhere so I cannot test that. But I can do
> multiple ssh clones from a different machine and when I do, everything
> works fine. I've tried as many three different, simultaneous clones of the
> same repo, all running at the same time. I've used both an old mac and a
> beagleboard as the server. (Client is always my linux desktop.) It
> always works. I cannot recreate the problem.
>
> Do you have any additional hints for me?
>
> --
> D. Richard Hipp
> ***@sqlite.org
>
> _______________________________________________
> fossil-users mailing list
> fossil-***@lists.fossil-scm.org
> http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
>
>


--
Matt
-=-
90% of the nations wealth is held by 2% of the people. Bummer to be in the
majority...
Richard Hipp
2014-04-18 17:39:49 UTC
Permalink
On Fri, Apr 18, 2014 at 1:32 PM, Matt Welland <***@gmail.com> wrote:

> NFS is not needed to reproduce this. Simultaneous parallel cloning via ssh
> from one file is giving me this every single time.
>
> Could it be an OS dependency? I'm on SuSe Linux (SLES11).
>
> I downloaded the binary from fossil-scm.org and tested again and get
> exactly the same issue.
>
> I do happen to be cloning from and to the same host.
>
>
Tried again here, running three simultaneous clones of the same repo, but
this time ssh to the same host. Still no errors.

--
D. Richard Hipp
***@sqlite.org
Matt Welland
2014-04-18 17:41:39 UTC
Permalink
How big is the repo? The one I'm cloning is 420 MB. Perhaps that is a
factor?


On Fri, Apr 18, 2014 at 10:39 AM, Richard Hipp <***@sqlite.org> wrote:

>
>
>
> On Fri, Apr 18, 2014 at 1:32 PM, Matt Welland <***@gmail.com> wrote:
>
>> NFS is not needed to reproduce this. Simultaneous parallel cloning via
>> ssh from one file is giving me this every single time.
>>
>> Could it be an OS dependency? I'm on SuSe Linux (SLES11).
>>
>> I downloaded the binary from fossil-scm.org and tested again and get
>> exactly the same issue.
>>
>> I do happen to be cloning from and to the same host.
>>
>>
> Tried again here, running three simultaneous clones of the same repo, but
> this time ssh to the same host. Still no errors.
>
> --
> D. Richard Hipp
> ***@sqlite.org
>
> _______________________________________________
> fossil-users mailing list
> fossil-***@lists.fossil-scm.org
> http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
>
>


--
Matt
-=-
90% of the nations wealth is held by 2% of the people. Bummer to be in the
majority...
Richard Hipp
2014-04-18 17:50:08 UTC
Permalink
On Fri, Apr 18, 2014 at 1:41 PM, Matt Welland <***@gmail.com> wrote:

> How big is the repo? The one I'm cloning is 420 MB. Perhaps that is a
> factor?
>
>
I was using SQLite, 55MB. The biggest repo I have at hand is
System.Data.SQLite at 264MB. I just did three simultaneous ssh clones of
it without any issues.


--
D. Richard Hipp
***@sqlite.org
Martin Gagnon
2014-04-18 18:42:45 UTC
Permalink
On Fri, Apr 18, 2014 at 01:50:08PM -0400, Richard Hipp wrote:
> On Fri, Apr 18, 2014 at 1:41 PM, Matt Welland <***@gmail.com> wrote:
>
> How big is the repo? The one I'm cloning is 420 MB. Perhaps that is a
> factor?
>
> I was using SQLite, 55MB.A The biggest repo I have at hand is
> System.Data.SQLite at 264MB.A I just did three simultaneous ssh clones of
> it without any issues.
>

May be delete mode VS wal mode ?

--
Martin G.
Andy Bradford
2014-04-19 00:39:19 UTC
Permalink
Thus said Matt Welland on Fri, 18 Apr 2014 10:32:26 -0700:

> Could it be an OS dependency? I'm on SuSe Linux (SLES11).

No, I can reproduce it on OpenBSD. I'm looking at it more closely to see
what might be causing it. Basically, you need a long commit in progress
and then try to sync. I can also reproduce it if I am committing via
HTTP and trying to pull via SSH.

Andy
--
TAI64 timestamp: 400000005351c5da
Andy Bradford
2014-04-19 00:56:09 UTC
Permalink
Thus said Matt Welland on Fri, 18 Apr 2014 10:41:39 -0700:

> How big is the repo? The one I'm cloning is 420 MB. Perhaps that is a
> factor?

No, the problem appears to be the difference between using test-http and
http as the remote command. The default behavior for the Fossil client
is to send a remote ``fossil test-http /path'' to the server. If instead
I force the Fossil client to talk to:

fossil http /path/to/fossil

Everything works as expected (e.g. no locking issues).

Andy
--
TAI64 timestamp: 400000005351c9cc
Andy Bradford
2014-04-19 01:39:31 UTC
Permalink
Thus said "Andy Bradford" on 18 Apr 2014 18:56:09 -0600:

> Everything works as expected (e.g. no locking issues).

I spoke too soon. If I give the Fossil user permissions (e.g. don't
clone as nobody) then the issue arises again.

It doesn't appear to be isolated to just SSH. I can cause locking
errors with HTTP too, including some interesting behavior like files
unexpectedly being removed from the cloned repository.

For example, after modifying a test repo and then starting a sync via
HTTP, I started a sync from another HTTP client which caused this:

$ f sync
Sync with http://***@fossil:8080/
Round-trips: 24 Artifacts sent: 24 received: 0
Error: Database error: database is locked: {COMMIT}
Round-trips: 24 Artifacts sent: 24 received: 0
Sync finished with 3840799 bytes sent, 65248 bytes received

After the second sync completed (having received a partial set of
artifacts), I did an update and suddenly files began being removed
(presumably due to the partial commit above).

After letting the original commit run to completion another time, I did
an update in the second but the files have not come back:

On the first committer:
$ f stat | grep checkout
checkout: 4393b959511a32ec949f32900049d1195226f8d4 2014-04-19 01:05:46 UTC
$ ls | wc -l
101

On the second:
$ f stat | grep checkout
checkout: 4393b959511a32ec949f32900049d1195226f8d4 2014-04-19 01:05:46 UTC
$ ls | wc -l
31

Only after closing and opening the local repository was I able to make
the files come back, but I wonder what would have happened had I tried
to commit files during the time that it was in this state.

It's possible that using SSH as a client makes it easier to cause
problems, but it does appear to be possible to run into issues with HTTP
as well. Once I even got told that a fork had happened (both clients had
autosync on). But it seems to be due in part to large commits and slow
disk.

I'll keep looking at it as I get time.

Andy
--
TAI64 timestamp: 400000005351d3f6
Andy Bradford
2014-04-19 00:28:28 UTC
Permalink
Thus said Matt Welland on Fri, 18 Apr 2014 08:52:09 -0700:

> Round-trips: 1 Artifacts sent: 0 received: 0 Round-trips: 1
> Artifacts sent: 0 received: 109 Round-trips: 2
> Artifacts sent: 0 received: 109 Round-trips: 2
> Artifacts sent: 0 received: 773 Round-trips: 3
> Artifacts sent: 0 received: 773 Round-trips: 3
> Artifacts sent: 0 received: 895 Round-trips: 4
> Artifacts sent: 0 received: 895 Error: Database error: database is locked:
> {UPDATE event SET mtime=(SELECT m1 FROM time_fudge WHERE mid=objid) WHERE
> objid IN (SELECT mid FROM time_fudge);} <#key_3_2>

What version of Fossil produced this output?

What version of fossil was on the remote side?

Thanks,

Andy
--
TAI64 timestamp: 400000005351c34f
Jan Danielsson
2014-05-02 15:39:20 UTC
Permalink
On 18/04/14 17:52, Matt Welland wrote:
> Just FYI, I'm seeing this kind of message quite often. This is due to
> overlapping clone operations on large fossils on relatively slow disk.
[---]
> Artifacts sent: 0 received: 895 Error: Database error: database is locked:
> {UPDATE event SET mtime=(SELECT m1 FROM time_fudge WHERE mid=objid) WHERE
> objid IN (SELECT mid FROM time_fudge);} <#key_3_2>

That error is the reason I had to switch over to the git port of the
netbsd fossil repository. First I thought it was a
fossil-on-bsd-problem, but I got it on Linux as well.

As you say, it is highly reproducible, but it requires quite a bit of
time to trigger sometimes.

I'm not running on NFS, but I get the exact same behavior.

--
Kind Regards,
Jan
Andy Bradford
2014-05-02 17:57:02 UTC
Permalink
Thus said Jan Danielsson on Fri, 02 May 2014 17:39:20 +0200:

> > Artifacts sent: 0 received: 895 Error: Database error: database is locked:
> > {UPDATE event SET mtime=(SELECT m1 FROM time_fudge WHERE mid=objid) WHERE
> > objid IN (SELECT mid FROM time_fudge);} <#key_3_2>
> [...]
> As you say, it is highly reproducible, but it requires quite a bit
> of time to trigger sometimes.

This particular error hasn't come up since this checkin (which didn't
make it into Fossil 1.28, so it's only in trunk or on branch-1.28):

http://www.fossil-scm.org/index.html/info/b4dffdac5e706980d911a0e672526ad461ec0640

I wonder if you could try again with a build from trunk?

Thanks,

Andy
--
TAI64 timestamp: 400000005363dc90
Jan Danielsson
2014-05-20 01:24:07 UTC
Permalink
On 02/05/14 19:57, Andy Bradford wrote:
>>> Artifacts sent: 0 received: 895 Error: Database error: database is locked:
>>> {UPDATE event SET mtime=(SELECT m1 FROM time_fudge WHERE mid=objid) WHERE
>>> objid IN (SELECT mid FROM time_fudge);} <#key_3_2>
>> [...]
>> As you say, it is highly reproducible, but it requires quite a bit
>> of time to trigger sometimes.
>
> This particular error hasn't come up since this checkin (which didn't
> make it into Fossil 1.28, so it's only in trunk or on branch-1.28):
>
> http://www.fossil-scm.org/index.html/info/b4dffdac5e706980d911a0e672526ad461ec0640
>
> I wonder if you could try again with a build from trunk?

I've been using later versions of fossil for both the NetBSD and
pkgsrc repositories since this discussion took place, and I had one
"{COMMIT}" error, but other than that it has worked great. I'm so happy
to be able to nuke the git repositories I have been using as a work-around.

I'm very, very happy about this fix -- it changes a lot for me
(exclusively to the better).

--
Kind Regards,
Jan
Rich Neswold
2014-05-02 15:10:46 UTC
Permalink
On Thu, Apr 17, 2014 at 10:12 AM, Rich Neswold <***@gmail.com> wrote:
> On Wed, Apr 16, 2014 at 3:40 PM, Rich Neswold <***@gmail.com> wrote:
>> It would be nice if fossil would break the "pull" into smaller
>> transactions which contain valid timeline commits so, if there's a
>> database timeout, the next time I try to "pull" it can continue where
>> it left off.
>
> The first few times that my "pull"s failed, there was no obvious
> change to the timeline so I assumed none of the data was being saved.
> After the last timeout, however, there were some new entries from the
> NetBSD project. So maybe new pulls start were the previous left off
> after all.

Although syncs/pulls appear to make progress, even when a failure
occurs, I'd still like to see a "fossil pull" breaking the request
into multiple smaller transactions. The "one transaction for the
entire request" doesn't scale at all.

My main NetBSD fossil repo is 11G. I want to keep a copy on another
machine so my local changes are stored in more than one location. Last
night, my backup was 2G (because I hadn't sync-ed in a while) so I
started a "fossil pull" and then went home. This morning, the pull was
aborted by a "signal 2" and my local directory showed the following:

[~/repo]$ ls -l
total 351919880
-rw-r--r-- 1 neswold 2335277056 May 1 16:03 netbsd.fossil
-rw-r--r-- 1 neswold 9273344 May 2 09:37 netbsd.fossil-shm
-rw-r--r-- 1 neswold 177838427024 May 2 06:32 netbsd.fossil-wal

That's right, my write-ahead file is 177 GB (16x the expected size of
the final repository!)

I'm doing a "fossil sqlite" and it's slowly trying to apply the
transaction, but I really don't have any hope it will succeed -- I'm
just curious what it will do. More than likely, I'll delete this repo
and clone it again.

There have to be points during a sync/pull that the target repository
is in a stable, consistent state. The transaction can be committed and
a new one started. Or maybe add a command-line option to pull/sync
which lets the user select how many artifacts to pull over and then
the user can run the command multiple times until nothing is left to
transfer.

--
Rich
Rich Neswold
2014-05-02 15:15:34 UTC
Permalink
On Fri, May 2, 2014 at 10:10 AM, Rich Neswold <***@gmail.com> wrote:
> That's right, my write-ahead file is 177 GB (16x the expected size of
> the final repository!)
>
> I'm doing a "fossil sqlite" and it's slowly trying to apply the
> transaction, but I really don't have any hope it will succeed -- I'm
> just curious what it will do.

It looks like "fossil sqlite" was simply verifying the integrity of
the database. It ended up deleting the 177 GB of work it did
overnight.

--
Rich
Andy Bradford
2014-05-07 05:59:38 UTC
Permalink
Thus said Rich Neswold on Wed, 16 Apr 2014 15:40:23 -0500:

> It would be nice if fossil would break the "pull" into smaller
> transactions which contain valid timeline commits so, if there's a
> database timeout, the next time I try to "pull" it can continue where
> it left off.

I've been working a bit on implementing a per round-trip commit as
suggested by Richard and it does commit in smaller transactions, though
not all of them will be valid timeline commits:

http://www.fossil-scm.org/index.html/info/d02f144d708e89299ae28a2b99eeb829a6799c5f

Basically it does a commit each round trip and defers execution of hooks
until the last round-trip happens. I'm not convinced if this is correct
behavior---specifically, should it execute them even if there is an
error during sync?

Also, there is one potential surprise factor involved after a partial
sync occurs but it's hard to predict how often it will actually happen.
It's possible that there are phantoms in the repository that will
manifest themselves if the resulting change is interrupted at just the
right time, and autosync is turned off, and one attempts to update to a
version that has those phantoms. It seems that this particular behavior
has been in Fossil since 2011, but perhaps difficult to expose because
Fossil would rollback the entire sync if there were any failures. It
won't result in data loss, but may make things confusing if a commit is
made while in this state as the files will show up as Deleted in the
checkin, even though there was no indication that they would be deleted
(except the warnings/REMOVE that happened during the update).

When the phantoms are encountered when running ``fossil update,'' you
will see a warning about ``content missing'' and Fossil will then remove
the file from the current checkout and report them as being REMOVEd.
fossil status, however, will not know about that and report that the
current checkout is up-to-date. Here's the relevant code:

http://www.fossil-scm.org/index.html/artifact/64d8e49634442edde612084f8b60f4185630d8be?ln=108,111

I'm not sure what the correct behavior should be. If we remove the
continue on line 110, fossil will not remove the files, but will attempt
to merge an empty file with whatever exists (or replace the current file
with a 0 byte file if it is current). Neither seems to be the optimal
way to handle this. Another option would be to have ``fossil update''
abort when it sees phantoms thus making it even more difficult to
accidentally checkin file deletes.

Also, I'm not sure how much it would take to only accept ``valid
timeline commits'' as you suggested.

Feedback is be appreciated.

Thanks,

Andy
--
TAI64 timestamp: 400000005369cbeb
Richard Hipp
2014-05-07 11:06:31 UTC
Permalink
On Wed, May 7, 2014 at 1:59 AM, Andy Bradford <amb-***@bradfords.org>wrote:

>
> Basically it does a commit each round trip and defers execution of hooks
> until the last round-trip happens.
>

That is scary.

The purpose of the hooks is to verify that all of the content in the
repository is still accessible. Before each commit, the hooks run to
verify that all of the artifacts can still be un-deltaed and uncompressed
and they survive those operations intact. Suppose some future change to
Fossil introduces a bug that causes the delta or compress operations to
lose information so that historical artifacts are no longer recoverable.
The hooks are intended to detect that problem *before* it can permanently
damage the repository.

Doing a commit without running the hooks disables that very important
safety mechanism.

--
D. Richard Hipp
***@sqlite.org
Andy Bradford
2014-05-07 14:32:47 UTC
Permalink
Thus said Richard Hipp on Wed, 07 May 2014 07:06:31 -0400:

> The purpose of the hooks is to verify that all of the content in the
> repository is still accessible. Before each commit, the hooks run to
> verify that all of the artifacts can still be un-deltaed and
> uncompressed and they survive those operations intact.

Hmm, that does indeed sound problematic to be disabled and it certainly
should not be done if it can compromise the integrity of the
artifacts. Perhaps I misunderstood the purpose of this block of code in
manifest_crosslink_end():

http://www.fossil-scm.org/index.html/artifact/05e0e4bec391ca300d1a6fc30fc19c0a12454be1?ln=1506,1518

It's a simple change to restore these lines to call
manifest_crosslink_end(MC_PERMIT_HOOKS) for each round-trip instead of
just once at the end:

http://www.fossil-scm.org/index.html/artifact/ab14c3fbb94acf319a0bf4e60ba8c8f8b98975e1?ln=1917,1922

Thanks,

Andy
--
TAI64 timestamp: 40000000536a4432
Richard Hipp
2014-05-07 15:02:55 UTC
Permalink
On Wed, May 7, 2014 at 10:32 AM, Andy Bradford <amb-***@bradfords.org>wrote:

> Thus said Richard Hipp on Wed, 07 May 2014 07:06:31 -0400:
>
> > The purpose of the hooks is to verify that all of the content in the
> > repository is still accessible. Before each commit, the hooks run to
> > verify that all of the artifacts can still be un-deltaed and
> > uncompressed and they survive those operations intact.
>
> Hmm, that does indeed sound problematic to be disabled and it certainly
> should not be done if it can compromise the integrity of the
> artifacts. Perhaps I misunderstood the purpose of this block of code in
> manifest_crosslink_end():
>
>
> http://www.fossil-scm.org/index.html/artifact/05e0e4bec391ca300d1a6fc30fc19c0a12454be1?ln=1506,1518
>

We might be talking about different hooks. I'm concerned about the
verify_before_commit hook implemented here:

http://www.fossil-scm.org/fossil/artifact/615e25ed6?ln=94-104

--
D. Richard Hipp
***@sqlite.org
Stephan Beal
2014-05-07 15:11:33 UTC
Permalink
On Wed, May 7, 2014 at 5:02 PM, Richard Hipp <***@sqlite.org> wrote:

>
> On Wed, May 7, 2014 at 10:32 AM, Andy Bradford <amb-***@bradfords.org>wrote:
>
>>
>> http://www.fossil-scm.org/index.html/artifact/05e0e4bec391ca300d1a6fc30fc19c0a12454be1?ln=1506,1518
>>
>
> We might be talking about different hooks. I'm concerned about the
> verify_before_commit hook implemented here:
>
> http://www.fossil-scm.org/fossil/artifact/615e25ed6?ln=94-104
>

My understanding (from having elided it in libfossil) is that
MC_PERMIT_HOOKS refers to commit hooks (TH1/TCL code).

Sidebar: the verify-before-commit hook was one of the first features
libfossil got because it's such a godsend to not have to worry so much
before writing to the db.

--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
Andy Bradford
2014-05-07 15:29:12 UTC
Permalink
Thus said Richard Hipp on Wed, 07 May 2014 11:02:55 -0400:

> We might be talking about different hooks. I'm concerned about the
> verify_before_commit hook implemented here:
>
> http://www.fossil-scm.org/fossil/artifact/615e25ed6?ln=94-104

Yes, it does appear that we were talking about different hooks. I did
not alter anything with content_put_ex or verify_before_commit. I must
admit, modifying this part of the code has been scary, so any code
review is welcome. :-)

>
Rich Neswold
2014-05-08 20:18:43 UTC
Permalink
On Wed, May 7, 2014 at 12:59 AM, Andy Bradford
<amb-sendok-***@bradfords.org> wrote:
> Thus said Rich Neswold on Wed, 16 Apr 2014 15:40:23 -0500:
>
>> It would be nice if fossil would break the "pull" into smaller
>> transactions which contain valid timeline commits so, if there's a
>> database timeout, the next time I try to "pull" it can continue where
>> it left off.
>
> I've been working a bit on implementing a per round-trip commit as
> suggested by Richard and it does commit in smaller transactions, though
> not all of them will be valid timeline commits:
>
> http://www.fossil-scm.org/index.html/info/d02f144d708e89299ae28a2b99eeb829a6799c5f
>
> Basically it does a commit each round trip and defers execution of hooks
> until the last round-trip happens. I'm not convinced if this is correct
> behavior---specifically, should it execute them even if there is an
> error during sync?

I was thinking of attacking the problem a little higher up (since I'm
way too nervous touching the low-level stuff):

The idea is to add a command line option to indicate that you want a
partial sync (e.g. --pull-limit 10000). This option would only be
honored for pulls -- if pushes are occurring, ignore the option
because it complicates finding an interruption point for both pulls
and pushes.

Process cards as they come in and decrement the counter when a card
that represents a "checkpoint" has been completed. When the counter is
zero, we break the outer loop (set 'go' to 0):

https://www.fossil-scm.org/index.html/artifact/dace4194506b2ea732ca27f68300b156816e403a?ln=1482

When the loop is exited, all the database closing hooks are done and
we simply haven't transferred all the history. Issuing another pull
will transfer N more artifacts. Eventually, the full history will be
transferred.

Of course, if the command line option isn't given, then process cards
until the sender says they're done sending.

--
Rich
Andy Bradford
2014-05-09 01:33:25 UTC
Permalink
Thus said Rich Neswold on Thu, 08 May 2014 15:18:43 -0500:

> I was thinking of attacking the problem a little higher up (since I'm
> way too nervous touching the low-level stuff):

So did I initially, though my first thought was simply to have autosync
try multiple times when failing (in the autosync-tries branch). Then
Richard mentioned that it could be done by simply committing more
frequently, and so I focused on that approach. I think it actually works
quite well and I even added some protections to handle corner cases
where a user might receive a partial sync, but then attempt to
update/merge to a checkin that is not complete:

http://www.fossil-scm.org/index.html/info/f2adddfe601d33c98974f9c645e8aceb9622aa86

One is free to force the update/merge if one desires with the
--force-missing option.

It would be interesting to get some actual testing with the repository
that was mentioned would rollback after a 1GB sync to see how it does.
Make sure it's a spare clone repository just in case, though I haven't
seen any problems in my testing.

Thoughts?

Andy
--
TAI64 timestamp: 40000000536c3088
Doug Franklin
2014-05-09 03:00:03 UTC
Permalink
On 2014-05-08 16:18, Rich Neswold wrote:
> On Wed, May 7, 2014 at 12:59 AM, Andy Bradford
> <amb-sendok-***@bradfords.org> wrote:
>> Thus said Rich Neswold on Wed, 16 Apr 2014 15:40:23 -0500:
>>
>>> It would be nice if fossil would break the "pull" into smaller
>>> transactions which contain valid timeline commits so, if there's a
>>> database timeout, the next time I try to "pull" it can continue where
>>> it left off.

Does SQLite support nested transactions? If so, that would seem to be
worth considering.

--
Thanks,
DougF (KG4LMZ)
Andy Bradford
2014-05-09 03:08:49 UTC
Permalink
Thus said Doug Franklin on Thu, 08 May 2014 23:00:03 -0400:

> Does SQLite support nested transactions? If so, that would seem to be
> worth considering.

It does appear to support them:

https://www.sqlite.org/lang_transaction.html

Andy
--
TAI64 timestamp: 40000000536c46e4
Rich Neswold
2014-05-09 05:30:22 UTC
Permalink
On Thu, May 8, 2014 at 10:08 PM, Andy Bradford <amb-***@bradfords.org> wrote:
> Thus said Doug Franklin on Thu, 08 May 2014 23:00:03 -0400:
>
>> Does SQLite support nested transactions? If so, that would seem to be
>> worth considering.
>
> It does appear to support them:
>
> https://www.sqlite.org/lang_transaction.html

I don't think nested transactions would help the problem I'm hoping
will get solved.

--
Rich
Stephan Beal
2014-05-09 09:36:38 UTC
Permalink
On Fri, May 9, 2014 at 5:08 AM, Andy Bradford <amb-***@bradfords.org>wrote:

> Thus said Doug Franklin on Thu, 08 May 2014 23:00:03 -0400:
>
> > Does SQLite support nested transactions? If so, that would seem to be
> > worth considering.
>
> It does appear to support them:
>
> https://www.sqlite.org/lang_transaction.html


It doesn't directly support them, but fossil/libfossil add a level of
abstraction which simulates them. The notable requirement is that one use
the [lib]fossil C APIs to begin/end transactions, as opposed to using
BEGIN/END directly. Fossil has an assertion in place to catch if COMMIT is
called directly from SQL code while C-initiated transaction is opened.

--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
Andy Bradford
2014-04-20 05:18:49 UTC
Permalink
Thus said Matt Welland on Wed, 16 Apr 2014 09:01:28 -0700:

> fossil commit cfgdat tests -m "Added another drc test"
> Autosync: ssh://host/path/project.fossil
> Round-trips: 1 Artifacts sent: 0 received: 0
> Error: Database error: database is locked: {UPDATE event SET mtime=(SELECT
> m1 FROM time_fudge WHERE mid=objid) WHERE objid IN (SELECT mid FROM
> time_fudge);}
> Round-trips: 1 Artifacts sent: 0 received: 0
> Pull finished with 360 bytes sent, 280 bytes received
> Autosync failed
> continue in spite of sync failure (y/N)? n

I've done a fair bit of profiling with this, and this seems to happen
primarily with the test-http command (the default sync method for SSH
clients). I don't know what the history is behind the test-http command,
but my guess is that it was really not intended to be a heavily used
sync method for shared repositories. I'm not really sure why this
particular database locking error happens so frequently with test-http,
but not at all with http. This is happening in manifest_crosslink_end()
when it's trying to fudge times.

If I force my SSH command to use http instead of test-http, this error
disappears entirely and I only ever see an occasional locking error due
to multiple committers when I try to commit large change sets (like a
10,000 line, 840K change set); same behavior as standard HTTP/HTTPS
transports in my environment (slow disk/cpu/network).

Are all your users using SSH to access shared repositories? Or do you
just have a few users using SSH?

Perhaps it would be better to switch to using SSH keys and forced
commands to cause fossil to use http instead of test-http? This does
require a bit more setup. For example, each .fossil has to have the
remote_user_ok configuration enabled so you can setup the REMOTE_USER
environment variable for them. This is because there currently is no
mechanism to use Fossil authentication while using SSH as the transport
and fossil http requires it if you want to commit.

I suppose an alternative configuration would be to give nobody/anonymous
users the ability to write, which if SSH authentication is the only
allowed sync method it may be acceptable. The only drawback that I see
there is that the rcvfrom information would show up as having come from
nobody, e.g.,

User: amb
Received From: nobody @ 192.168.1.9 on 2014-04-20 04:33:35

I think one thing I've learned from all this is that forks and database
locking errors occur much more frequently on slow hardware and large
change sets. Also, I seem to be able to cause forking that goes
undetected (without a warning). All of this probably explains why it is
difficult to reproduce except on older hardware.

As for making sync try harder, we could certainly just loop X number of
times if we think it is worth it (not sure how feasible it will be to
make it silent, or if there will be other side effects). Here I have it
loop for 10 times before bailing. As you can see it failed once, but
then succeeded the second time and received updates that indicate it is
out of sync:

$ fossil ci -m synctest2
Autosync: ssh://fossil/tmp/test.fossil
Round-trips: 1 Artifacts sent: 0 received: 0
Error: Database error: database is locked: {UPDATE event SET mtime=(SELECT m1 FROM time_fudge WHERE mid=objid) WHERE objid IN (SELECT mid FROM time_fudge);}
Round-trips: 1 Artifacts sent: 0 received: 0
Pull finished with 314 bytes sent, 280 bytes received
Autosync failed
Autosync: ssh://fossil/tmp/test.fossil
Round-trips: 3 Artifacts sent: 0 received: 102
Pull finished with 3451 bytes sent, 170661 bytes received
would fork. "update" first or use --allow-fork.

There was also a sync failure on the first committer after it
successfully committed the artifacts:

$ fossil ci -m synctest1
Autosync: ssh://fossil/tmp/test.fossil
Round-trips: 1 Artifacts sent: 0 received: 0
Pull finished with 316 bytes sent, 229 bytes received
New_Version: 04e7debfa4f29ee3c1635007e3f380f0a0630366
Autosync: ssh://fossil/tmp/test.fossil
Round-trips: 3 Artifacts sent: 101 received: 0
Error: Database error: database is locked: {UPDATE event SET mtime=(SELECT m1 FROM time_fudge WHERE mid=objid) WHERE objid IN (SELECT mid FROM time_fudge);}
Round-trips: 3 Artifacts sent: 101 received: 0
Sync finished with 179617 bytes sent, 3234 bytes received
Autosync failed
Autosync: ssh://fossil/tmp/test.fossil
Round-trips: 1 Artifacts sent: 0 received: 1
Sync finished with 4916 bytes sent, 2724 bytes received

Thoughts?

Andy
--
TAI64 timestamp: 40000000535358db
Andy Bradford
2014-04-21 07:38:32 UTC
Permalink
Thus said Matt Welland on Wed, 16 Apr 2014 09:01:28 -0700:

> Autosync: ssh://host/path/project.fossil
> Round-trips: 1 Artifacts sent: 0 received: 0
> Error: Database error: database is locked: {UPDATE event SET mtime=(SELECT
> m1 FROM time_fudge WHERE mid=objid) WHERE objid IN (SELECT mid FROM
> time_fudge);}

Have you tried running the latest from trunk on your fossil server? You
can test this easily without impacting existing users via SSH by
installing the new version to a different location on the server and
then cloning with a URL of:

fossil clone ssh://host/path/project.fossil?fossil=/path/to/new/fossil clone.fossil

I tried the latest from trunk and I don't see this particular error
anymore. If this also goes away for you, then you simply need to update
your servers (no client updates should be necessary).

Andy
--
TAI64 timestamp: 400000005354cb1b
Matt Welland
2014-04-21 16:26:25 UTC
Permalink
On Mon, Apr 21, 2014 at 12:38 AM, Andy Bradford <amb-***@bradfords.org>wrote:

> Thus said Matt Welland on Wed, 16 Apr 2014 09:01:28 -0700:
>
> > Autosync: ssh://host/path/project.fossil
> > Round-trips: 1 Artifacts sent: 0 received: 0
> > Error: Database error: database is locked: {UPDATE event SET
> mtime=(SELECT
> > m1 FROM time_fudge WHERE mid=objid) WHERE objid IN (SELECT mid FROM
> > time_fudge);}
>
> Have you tried running the latest from trunk on your fossil server?


Yes! This is fixed on latest! Any idea which commit fixes the problem?

I guess we should switch to this not-officially-released version but what
other issues am I likely to run into?

Strictly speaking I'd feel more comfortable with a version 1.28 patched
with whatever fixes the bug rather than taking on the myriad of changes
made since 1.28 was released.

What do people advise?


> You
> can test this easily without impacting existing users via SSH by
> installing the new version to a different location on the server and
> then cloning with a URL of:
>
> fossil clone ssh://host/path/project.fossil?fossil=/path/to/new/fossil
> clone.fossil
>
> I tried the latest from trunk and I don't see this particular error
> anymore. If this also goes away for you, then you simply need to update
> your servers (no client updates should be necessary).
>
> Andy
> --
> TAI64 timestamp: 400000005354cb1b
>
>
>


--
Matt
-=-
90% of the nations wealth is held by 2% of the people. Bummer to be in the
majority...
Stephan Beal
2014-04-21 16:31:12 UTC
Permalink
On Mon, Apr 21, 2014 at 6:26 PM, Matt Welland <***@gmail.com> wrote:

> What do people advise?
>

Historically speaking there has been little or no reason not to rely on the
tip of the trunk. Rarely, something gets put in which breaks the build, but
that doesn't happen often and is always fixed quickly.

i've used Fossil daily since the end of 2007, and the only copy of the
fossil binary on my machines is the one under my clone of the main repo.
That has occasionally bitten me (requiring me to go download a binary), but
only when i'm tinkering on fossil, can't compile it, and have already
cleaned up the old binary (so can't stash/revert my changes).

--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
Andy Bradford
2014-04-22 04:40:23 UTC
Permalink
Thus said Matt Welland on Mon, 21 Apr 2014 09:26:25 -0700:

> Yes! This is fixed on latest! Any idea which commit fixes the problem?

Will you tell me exactly which version of fossil it is? e.g. run
``fossil version'' with the fossil binary that exhibits the problem on
the server.

Thanks,

Andy
--
TAI64 timestamp: 400000005355f2da
Matt Welland
2014-04-22 04:49:34 UTC
Permalink
On Mon, Apr 21, 2014 at 9:40 PM, Andy Bradford <amb-***@bradfords.org>wrote:

> Thus said Matt Welland on Mon, 21 Apr 2014 09:26:25 -0700:
>
> > Yes! This is fixed on latest! Any idea which commit fixes the problem?
>
> Will you tell me exactly which version of fossil it is? e.g. run
> ``fossil version'' with the fossil binary that exhibits the problem on
> the server.
>

It is the version 1.28 downloaded from the downloads page. Which should be
3d49f04587

BTW, note that it is the same fossil binary accessible from both the client
and server perspective as it is off of NFS. I don't think this matters but
thought I'd mention it.





Thanks,
>
> Andy
> --
> TAI64 timestamp: 400000005355f2da
>
>
>


--
Matt
-=-
90% of the nations wealth is held by 2% of the people. Bummer to be in the
majority...
Andy Bradford
2014-04-22 05:36:21 UTC
Permalink
Thus said Matt Welland on Mon, 21 Apr 2014 09:26:25 -0700:

> Yes! This is fixed on latest! Any idea which commit fixes the problem?

I ran fossil bisect to figure out where the fix came into trunk [I must
say, this is the first time I've used fossil bisect and it was quite
handy!]. Here is the last BAD commit that had the problem:

http://www.fossil-scm.org/index.html/timeline?dp=ab00f2b007d5229d

And the commit just after that by drh fixes it [b4dffdac5e]

It was also merged into the 1.28 branch (branch-1.28):

http://www.fossil-scm.org/index.html/info/ebac09bcf72fbed9b389c07766a931264df9e304

So if you feel better sticking with with Fossil version 1.28, you can
update to the latest on branch-1.28.

Andy
--
TAI64 timestamp: 400000005355fff8
Andy Bradford
2014-06-13 04:01:43 UTC
Permalink
Thus said Matt Welland on Wed, 16 Apr 2014 09:01:28 -0700:

> Could fossil silently retry a couple times instead of giving up so
> easily?

Not silent, but it can retry:

http://www.fossil-scm.org/index.html/info/76bc297e96211b50d7b7e518ba45663c80889f1f

This still won't avoid the occasional fork if the user answers ``Yes''
to the question, but it will try as many times as you configure it to
try.

Andy
--
TAI64 timestamp: 40000000539a77ca
Loading...