Discussion:
Delay with `fossil ui' (related to backoffice processing?)
(too old to reply)
Florian Balmer
2018-07-21 17:48:42 UTC
Permalink
I've noticed something with the current tip version of Fossil (on Windows):

I see random delays for the `fossil ui' command, i.e. the web browser
is opened at once, but is displaying the loading/busy indicator and a
blank page for some time, before the default web page finally appears:

http://localhost:8080/timeline?c=current

When repeatedly entering `fossil ui' from the command prompt, followed
by immediately closing the web browser as soon as the /timeline web
page is displayed, then hitting Ctrl+C to stop the Fossil web server,
and restarting the cycle by running `fossil ui' again, the delay seems
to appear with every 5th to 10th cycle.

During the delay, I can see two fossil.exe processes with `tasklist |
findstr fossil', whereas with `fossil ui' without the delay, there's
only one fossil.exe process.

The %TEMP% directory has 3 or more files appearing instantaneously at
the beginning of the delay, named fossil_server_P8080_00000N_xxx.txt,
with N being low numbers, and xxx being 'in', 'out', and 'cmd',
containing the HTTP request and response, and a file/IP address list.

When terminating the delayed `fossil ui' server (responds to Ctrl+C)
and immediately relaunching it, it is as fast as usual, without any
delay. The delay takes approximately 60 seconds, and after that the
HTTP request completes normally with the display of the /timeline web
page, and the temporary files are cleared.

I tried to find a "last good" version by means of "bisecting", but
unfortunately no success, so far. All I can say is that the delay
never appears with Fossil version 2.6 [9718f3b078] 2018-05-04 12:56:42
UTC, the binary for Windows downloaded from the Fossil web site.

My tests were on Windows 10.0.17134.167 (64-bit), with MSVC
19.14.26430 (32-bit).

--Florian
Warren Young
2018-07-22 03:43:07 UTC
Permalink
I've noticed something with the current tip version of Fossil…
I tried to find a "last good" version by means of "bisecting", but…the delay never appears with Fossil 2.6.
I can see only two ways for those two sentences to be true at once, but it’s unclear which is the case from the way you phrased it:

1. The tip of trunk and *only* the tip of trunk exhibits the problem behavior.

2. Every checkin after 2.6 exhibits it.

Which is it?

If the problem first appeared with [76800769], the fix drh checked in to clean up the dangling journal files you were reporting, then I wonder if the delay is waiting out some SQLite lock.

Windows’ native locking behavior is much more restrictive than the default on pretty much every other platform, so I wonder if the new shutdown code path created a temporary deadlock that gets broken by a SQLite timeout.

Try backing that checkin out to see if the prior problem (dangling journal files) reappears, while the delay also goes away:

fossil up trunk # back out from the bisect
fossil revert # just to be sure
fossil merge --backout 76800769

Then build and test.
Warren Young
2018-07-22 03:45:06 UTC
Permalink
Also, does this behavior happen on only one machine? If it happens on many different Windows boxes within your reach, that has different implications than if it only happens on the one you happen to be using most often.
Florian Balmer
2018-07-22 07:18:12 UTC
Permalink
Thank very much for your feedback!
Post by Warren Young
Which is it?
With the random appearance of the delay, and the long duration of 60
seconds, I eventually gave up on "bisecting".

But with a more systematic approach, I've now got:

Last good: [06507038]
First bad: [947081aa]
Post by Warren Young
If the problem first appeared with [76800769] ...
Backing out [76800769] (and the dependent [d216ea9a]) does not fix the delay.
Post by Warren Young
Also, does this behavior happen on only one machine?
I can also see it on a Windows 7 (32-bit) VM.

Here's what I've found out in the meantime:

Waiting for 60 seconds for the delay to time out, until the default
/timeline web page is loaded in the browser, instead of terminating
the server with Ctrl+C, will cause another 60 seconds delay for any
links clicked in the web ui.

Reloading the delayed /timeline web page with F5, before the 60
seconds have elapsed, gives a speedy refresh, with no delay, so
multiple requests within the 60 seconds period are not delayed.

Waiting for 60 seconds before running `fossil ui' after the last
server was terminated, causes the delay to always happen.

The observed delay of 60 seconds seems somehow related to:

[5544931c] src/backoffice.c:59
#define BKOFCE_LEASE_TIME 60 /* Length of lease validity */

On Windows, the DbgView utility is useful to trace messages from
OutputDebugStringA(), handy if there's no attached console, and easier
than attaching WinDbg to spawned sub-processes:

https://live.sysinternals.com/Dbgview.exe

By adding calls to OutputDebugStringA() before and after the following
line of source code, I found that one of the fossil.exe processes
seems to be stuck here during the delay:

[5544931c] src/backoffice.c:240
sqlite3_sleep(1000*(x.tmCurrent - tmNow + 1));

This code path is only entered if the delay becomes manifest.

I don't understand all the details of the backoffice processing code,
yet, in particular not the condition to trigger the sleep. But the
sleep seems to block the web ui, somehow?

--Florian
Warren Young
2018-07-22 09:28:20 UTC
Permalink
Post by Florian Balmer
[5544931c] src/backoffice.c:240
That’s brand new code, less than a week old. It’s not surprising it’s not rock-solid yet.

I expect drh is now on the case. :)
Florian Balmer
2018-07-22 10:21:30 UTC
Permalink
That’s brand new code, less than a week old. It’s not surprising it’s
not rock-solid yet.
Yes, I agree, I no not expect complex new features to be flawless from
the beginning.

But I've stumbled into the issue while testing a fix from the current
tip, and decided to report it early.
I expect drh is now on the case. :)
The Windows HTTP server (and thus `fossil ui') works by spawning
sub-processes for each individual HTTP request, right? This results in
multiple sub-processes per web page (one for the HTML page, one for
CSS, another one for JavaScript, etc.)?

Maybe that these sub-processes concurrently try to "get the backoffice
lease" after successfully serving their HTTP request, resulting in a
race, and some of the sub-processes sleeping and waiting for 60
seconds for the lease to expire?

I noticed the following commit:

[a32a92d2] At the end of CGI processing, close the output pipe before
starting backoffice processing, in order to let higher levels know
that the CGI is finished.

http://fossil-scm.org/index.html/info/a32a92d2

Maybe this fixed the same (or a similar) problem with the HTTP server
on *nix, so the web browser quickly receives all the HTTP output,
before the HTTP serving process is turning to backoffice work?

But on Windows, g.httpOut seems to be a temporary file, and not a
pipe, and the `fossil ui' main process may not be waiting for the file
to become signaled when (any) I/O completes (which is a discouraged
technique anyway), but for the sub-process that generated the file to
exit? And this one sub-process sleeping is causing the delay in the
`fossil ui' main process?

Sorry for my prosaic speculation and no hard facts, but I find it all
very interesting!

Another minor point:

The following line of source code from the aforementioned commit:

g.httpOut = fossil_fopen("/dev/null", "wb");

always seems to result in g.httpOut==NULL on Windows, as "/dev/null"
is not a valid Windows device. This probably doesn't matter from the
logic of the code, as writing to a FILE* pointing to NULL may have
exactly the same effect as writing to a FILE* pointing to "/dev/null"?

--Florian
Richard Hipp
2018-07-22 14:44:48 UTC
Permalink
Post by Florian Balmer
The Windows HTTP server (and thus `fossil ui') works by spawning
sub-processes for each individual HTTP request, right? This results in
multiple sub-processes per web page (one for the HTML page, one for
CSS, another one for JavaScript, etc.)?
Every Fossil webpage is generated by a separate subprocess. This is
true for Windows, Mac, Linux, BSD, whatever. And it is true
regardless of how Fossil is launched.

The backoffice is a facility to do background processing - things like
sending email notifications, syncing with peers, maybe doing
aggressive compaction of the repository - anything that is not
directly interacting with the user. Backoffice runs after a webpage
is generated, in the same subprocess. Once the entire webpage has
been generated, Fossil closes the socket or file into which the
webpage is being written, and then attempts to do backoffice
processing. There can only be one backoffice process at a time, and
another backoffice that is "on deck" - ready to run when the current
one finishes.

In most systems, the webserver returns results back to the user as
soon as the output socket closes. So the user does not realize that
the process that generated the webpage continues running for up to 60
seconds of backoffice work. But on Windows for "fossil ui" and
"fossil server", the webpage is not returned to the user until the
subprocess actually terminates. Hence, if the subprocess gets
involves in backoffice work, that can delay the return of content to
the user.

The only comes up on Windows. I do not yet have a good work-around.
--
D. Richard Hipp
***@sqlite.org
Warren Young
2018-07-23 17:37:27 UTC
Permalink
Post by Florian Balmer
writing to a FILE* pointing to NULL may have
exactly the same effect as writing to a FILE* pointing to "/dev/null”?
You can open a file called NUL on Windows to get the same effect as /dev/null on POSIX type platforms.

Unlike on POSIX, NUL is just a special-cased file name, not the name of an actual device, so it’s possible to get around it with NTFS file path trickery:

https://superuser.com/questions/282194/

That’s no concern here, since the Fossil code will likely just say "NUL".
Florian Balmer
2018-07-22 10:32:06 UTC
Permalink
... writing to a FILE* pointing to NULL may have exactly the same
effect as writing to a FILE* pointing to "/dev/null"?
Well, unless there's a directory named "dev" in the root of the
current drive, which would result in creation (due to the passed "wb"
mode) of a file named "null", on Windows.

--Florian
Florian Balmer
2018-07-22 16:32:56 UTC
Permalink
Hence, if the subprocess gets involves in backoffice work, that can
delay the return of content to the user.
The only comes up on Windows. I do not yet have a good work-around.
I see, thank you very much for the detailed explanations.

I only have the obvious (inefficient?) ideas, like using a
command-line switch to direct the "HTTP renderer" to offload
backoffice processing to another separate process, and exit
immediately once the HTTP response is ready.

Looking forward to see what will be your solution. I can always learn
a lot of interesting things from the Fossil source code.

--Florian
Richard Hipp
2018-07-22 19:02:37 UTC
Permalink
Post by Florian Balmer
I see random delays for the `fossil ui' command,
Please test the latest trunk version and let me know if you are still
seeing issues.

Anybody who has the capability, please also verify that "fossil server
--scgi" is now working again. I do not have a SCGI webserver set up
on windows with which to test that command so I implemented the fixes
there without actually checking them.
--
D. Richard Hipp
***@sqlite.org
Florian Balmer
2018-07-22 22:06:47 UTC
Permalink
Post by Richard Hipp
Please test the latest trunk version and let me know if you are still
seeing issues.
Thank you very much for the fix!

I see no more delays with "fossil ui" on Windows.

Just a thought: would it make sense to have --nodelay as a global
option recognized by all commands, and maybe also through an
environment variable, and a CGI control option? So that any backoffice
processing could be temporarily disabled?

This would probably require an explicit "backoffice" command, which
could then also be set up to run as a task or cron job every few
hours.

This may be handy for fatal incidents that require "repository
surgery", or restoring from backups, where you don't want any
notifications to be sent before things are fixed.

But maybe this is overengineering.

--Florian
Richard Hipp
2018-07-22 22:46:35 UTC
Permalink
Post by Florian Balmer
Just a thought: would it make sense to have --nodelay as a global
option recognized by all commands, and maybe also through an
environment variable, and a CGI control option? So that any backoffice
processing could be temporarily disabled?
Everything is in flux right now. Let's iron out these details once I
get everything actually working. There is a lot more to be done
before we reach that point.
--
D. Richard Hipp
***@sqlite.org
Florian Balmer
2018-07-24 07:25:24 UTC
Permalink
Post by Warren Young
You can open a file called NUL on Windows to get the same effect as
/dev/null on POSIX type platforms.
A very interesting topic!

NULL
(void*)0, or nullptr (was used in a confusing manner, in my post)

\Device\Null
Windows NT (object manager) equivalent to /dev/null

\??\NUL
Windows NT global symlink to \Device\Null
Not accepted by Win32 APIs

\\.\NUL
\\?\NUL
Local Device and Root Local Device forms of \??\NUL
Works with Win32 APIs

(\\.\ and \\?\ both escape to \DosDevices, the latter without any path
name preprocessing, and on current versions of Windows NT, \DosDevices
is a is a symlink to \??, which is a virtual directory containing both
objects from the \GLOBAL?? and from the user session directories)

NUL
Win32 API magic file name mapped to \\.\NUL

\\?\C:\dev\NUL
Path to file C:\dev\NUL (must be specified as an absolute path)
Works with Win32 APIs

(\\?\ escapes to \DosDevices, which links to \??, where the symlink
named C: is found, pointing to \Device\HarddiskVolumeN, and finally
resolving to \Device\HarddiskVolumeN\dev\NUL)
Post by Warren Young
That’s no concern here, since the Fossil code will likely just say
"NUL".
Yes, I think the Win32 variant should be:

(a) g.httpOut = fossil_fopen("NUL", "wb");

instead of:

(b) g.httpOut = fossil_fopen("/dev/null", "wb");

Code snippet (a) returns a valid (i.e. non-zero) FILE*, on Windows.

If the root drive of the fossil.exe current (working) directory has a
directory named "dev", say "C:\dev", code snippet (b) creates the file
"C:\dev\null", on Windows.

Just tested with fossil_open().

--Florian

Loading...