[fossil-users] Mix of UTF-8 and CP1251 (Russian cyrillic) in project

Post by Ruslan Popov
I've tried to use Fossil on russian version of Windows 7. I made commit with
russian text in comment, when I run the UI and look at timeline, I saw that
russian text looks like squares.

Why don't just use text editor that supports UTF-8 and write your
comments in UTF-8 instead of cp1251? You can set/change default text
editor with "fossil set editor anything-else-than-notepad" (I am using
Notepad2, for example).

Question to Richard Hipp: can you please add any UTF-8 character(s) to the
following text to help text editors to auto-detect the right encoding?

# Enter comments on this check-in. Lines beginning with # are ignored.
# The check-in comment follows wiki formatting rules.

Ruslan Popov

2010-06-25 10:17:32 UTC

Sergey, now I use emacs and its mule-utf-8-unix encoding for commit buffer.

Post by Sergey Sfeli

Post by Ruslan Popov
I've tried to use Fossil on russian version of Windows 7. I made commit

with

Post by Ruslan Popov
russian text in comment, when I run the UI and look at timeline, I saw

that

Post by Ruslan Popov
russian text looks like squares.

Why don't just use text editor that supports UTF-8 and write your
comments in UTF-8 instead of cp1251? You can set/change default text
editor with "fossil set editor anything-else-than-notepad" (I am using
Notepad2, for example).
Question to Richard Hipp: can you please add any UTF-8 character(s) to the
following text to help text editors to auto-detect the right encoding?
# Enter comments on this check-in. Lines beginning with # are ignored.
# The check-in comment follows wiki formatting rules.
_______________________________________________
fossil-users mailing list
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

--
Ruslan Popov
phone: +7 916 926 1205

Michal Suchanek

2010-06-25 13:34:32 UTC

Post by Sergey Sfeli

Why don't just use text editor that supports UTF-8 and write your
comments in UTF-8 instead of cp1251? You can set/change default text
editor with "fossil set editor anything-else-than-notepad" (I am using
Notepad2, for example).
Question to Richard Hipp: can you please add any UTF-8 character(s) to the
following text to help text editors to auto-detect the right encoding?
# Enter comments on this check-in. Lines beginning with # are ignored.
# The check-in comment follows wiki formatting rules.

Perhaps fossil should have a "system encoding" which it would get from
the environment (locales, windows codepage) and mark all commit
messages with it.

This should mark the commits with the correct encoding on most
unix-like systems. On windows there is a "DOS codepage" and a "Windows
codepage" so there is no completely reliable way of determining the
encoding used on the system although the "Windows codepage" would be
what most windowed programs use. Still there should be a possibility
to set the encoding explicitly.

It is somewhat open question what to do when displaying timeline of a
repository with commits in multiple encodings, though.

Thanks

Michal

Michael Richter

2010-06-25 15:00:14 UTC

Post by Michal Suchanek
Perhaps fossil should have a "system encoding" which it would get from
the environment (locales, windows codepage) and mark all commit
messages with it.

I vote that this is an extraordinarily bad idea.

Fossil is a *distributed* SCM system. Potentially the distributed database
in question could be spread around the world. Do you really want the
nightmare (and impossibility!) of trying to keep track of which project is
in which encoding scheme on which machine? UTF-8 is a standard
*explicitly*designed to
*stop* this kind of confusion. It's also been around since 1993, so your
development tools have had plenty of time to catch on and actually use it.

--
"Perhaps people don't believe this, but throughout all of the discussions of
entering China our focus has really been what's best for the Chinese people.
It's not been about our revenue or profit or whatnot."
--Sergey Brin, demonstrating the emptiness of the "don't be evil" mantra.

Owen Shepherd

2010-06-25 16:09:17 UTC

The trouble is that UTF-8 is a poor standard. It bloats many texts, is
quite expensive to parse, and has only one redeeming feature: It never
creates embedded nulls. I suppose that it shares its encoding with
ASCII is a feature too, but only a minor one.

Personally, I think that most systems should adopt SCSU as their
storage encoding, but that's unlikely to happen until C strings and
MIME (two paragons of awfulness) die out.

Post by Michal Suchanek
Perhaps fossil should have a "system encoding" which it would get from
the environment (locales, windows codepage) and mark all commit
messages with it.

I vote that this is an extraordinarily bad idea.
Fossil is a distributed SCM system. Potentially the distributed database in
question could be spread around the world. Do you really want the nightmare
(and impossibility!) of trying to keep track of which project is in which
encoding scheme on which machine? UTF-8 is a standard explicitly designed
to stop this kind of confusion. It's also been around since 1993, so your
development tools have had plenty of time to catch on and actually use it.
--
"Perhaps people don't believe this, but throughout all of the discussions of
entering China our focus has really been what's best for the Chinese people.
It's not been about our revenue or profit or whatnot."
--Sergey Brin, demonstrating the emptiness of the "don't be evil" mantra.
_______________________________________________
fossil-users mailing list
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Michal Suchanek

2010-06-25 17:53:11 UTC

Post by Owen Shepherd
The trouble is that UTF-8 is a poor standard. It bloats many texts, is
quite expensive to parse, and has only one redeeming feature: It never
creates embedded nulls. I suppose that it shares its encoding with
ASCII is a feature too, but only a minor one.
Personally, I think that most systems should adopt SCSU as their
storage encoding, but that's unlikely to happen until C strings and
MIME (two paragons of awfulness) die out.

Post by Michal Suchanek
Perhaps fossil should have a "system encoding" which it would get from
the environment (locales, windows codepage) and mark all commit
messages with it.

The fact is that Windows is a supported platform and on Windows common
tools do not use UTF-8 for good or for bad. So there should at least
be the code to identify the system encoding and convert it to the repo
encoding.

Also note that UTF-8 and Unicode in general is not the encoding of
choice for CJK languages for various reasons. I guess it is acceptable
to convert from the system ancoding to UTF-8 on a best-effort basis
(which usually causes minimal loss of information if any) so that the
repository commit messages and other texts shown on the web can be
merged together without resorting to iframes or other similar
atrocities.

The tracked files themselves are, of course, free to be in any
encoding. Still displaying files in arbitrary encoding on an UTF-8 web
app is somewhat troublesome so it would be an advantage to have the
possibilty to start a repo in different encoding or to switch the web
encoding so that files in different encodings can be viewed easily.
Tagging the files with an encoding when they are interpreted as text
by fossil would be also useful.

Thanks

Michal

Owen Shepherd

2010-06-25 18:18:48 UTC

One of the reasons that I'm a fan of SCSU is that, with even a
relatively simple encoder, it produces output which is comparable in
efficiency to that of most legacy encodings.

Post by Michal Suchanek
Perhaps fossil should have a "system encoding" which it would get from
the environment (locales, windows codepage) and mark all commit
messages with it.

The fact is that Windows is a supported platform and on Windows common
tools do not use UTF-8 for good or for bad. So there should at least
be the code to identify the system encoding and convert it to the repo
encoding.
Also note that UTF-8 and Unicode in general is not the encoding of
choice for CJK languages for various reasons. I guess it is acceptable
to convert from the system ancoding to UTF-8 on a best-effort basis
(which usually causes minimal loss of information if any) so that the
repository commit messages and other texts shown on the web can be
merged together without resorting to iframes or other similar
atrocities.
The tracked files themselves are, of course, free to be in any
encoding. Still displaying files in arbitrary encoding on an UTF-8 web
app is somewhat troublesome so it would be an advantage to have the
possibilty to start a repo in different encoding or to switch the web
encoding so that files in different encodings can be viewed easily.
Tagging the files with an encoding when they are interpreted as text
by fossil would be also useful.
Thanks
Michal
_______________________________________________
fossil-users mailing list
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Andreas Kupries

2010-06-25 18:24:37 UTC

As an FYI I googled SCSU

http://en.wikipedia.org/wiki/Standard_Compression_Scheme_for_Unicode

Post by Owen Shepherd
One of the reasons that I'm a fan of SCSU is that, with even a
relatively simple encoder, it produces output which is comparable in
efficiency to that of most legacy encodings.

Post by Michal Suchanek
Perhaps fossil should have a "system encoding" which it would get from
the environment (locales, windows codepage) and mark all commit
messages with it.

--
Andreas Kupries
Senior Tcl Developer
ActiveState, The Dynamic Language Experts

P: 778.786.1122
F: 778.786.1133
***@activestate.com
http://www.activestate.com
Get insights on Open Source and Dynamic Languages at www.activestate.com/blog

Michal Suchanek

2010-06-25 18:36:37 UTC

SCSU is a horrendous encoding because it uses shifts. When the shift
is lost the text has completely different meaning. In UTF-8 if you
remove part of the text only that part is affected (if you cut
mid-character you create a bad character at worst but it can be
clearly detected).

Thanks

Michal

Owen Shepherd

2010-06-25 19:37:36 UTC

And how often do you lose a couple of bytes in the middle of a file?
More precisely, how often do you lose them and not have a checksum
fail (or some other error) notifying you of this?

It's a particularly egregious complaint in the context of Fossil -
where all records are hashed anyway! Additionally, if the same kind of
error were to occur to the SQLite file that the repository is
contained within, it would probably be trashed irretrievably.

Years of experience with binary and other modal file formats (XML and
HTML to name two very common) show that this is a complete non-issue.

SCSU is of course a poor choice for an in-memory format (Use UTF-16)
or interacting with the console (For backwards compatibility you're
probably going to have to use UTF-8). But for a storage format,
particularly one embedded within a database? It's pretty much perfect.

Michal Suchanek

2010-06-26 12:47:56 UTC

And how often do you lose a couple of bytes in the middle of a file?
More precisely, how often do you lose them and not have a checksum
fail (or some other error) notifying you of this?

If the file is a web page then quite often, and it does not have a checksum.

If the encoding is intended solely for storage then anything that is
easy to work with would do and SCSU does not seem to particularly
shine in that area, not compared to more well-known and widespread
encodings for which tools are more readily available.

Post by Owen Shepherd
It's a particularly egregious complaint in the context of Fossil -
where all records are hashed anyway! Additionally, if the same kind of
error were to occur to the SQLite file that the repository is
contained within, it would probably be trashed irretrievably.
Years of experience with binary and other modal file formats (XML and
HTML to name two very common) show that this is a complete non-issue.

It is not an issue if the partial data still makes sense which is not
the case with SCSU shifts which completely change the meaning of the
rest of the data.

Post by Owen Shepherd
SCSU is of course a poor choice for an in-memory format (Use UTF-16)
or interacting with the console (For backwards compatibility you're
probably going to have to use UTF-8). But for a storage format,
particularly one embedded within a database? It's pretty much perfect.

Anybody who suggests to use UTF-16 for anything has no idea about
useful encodings in my book. UTF-16 has no advantage whatsoever, only
disadvantages.

SCSU is not that useful for storage compression since fossil already
uses zlib and it has no other advantages I am aware of.

Thanks

Michal

Owen Shepherd

2010-06-26 16:05:09 UTC

And how often do you lose a couple of bytes in the middle of a file?
More precisely, how often do you lose them and not have a checksum
fail (or some other error) notifying you of this?

If the file is a web page then quite often, and it does not have a checksum.

In that case I'd have to question the quality of your networking
equipment and software. Losing a couple of bytes in the middle of a
web page is something that should not be possible under TCP (Unless,
perhaps, one is under attack from a malicious 3rd party, in which case
a bit of data loss is the least of your worries).

And HTML is also a file format with the equivalent of shifts; it just
calls them tags.

Post by Michal Suchanek
If the encoding is intended solely for storage then anything that is
easy to work with would do and SCSU does not seem to particularly
shine in that area, not compared to more well-known and widespread
encodings for which tools are more readily available.

When embedded inside some other file format (Such as a Fossil
repository, this is a non issue)

It is not an issue if the partial data still makes sense which is not
the case with SCSU shifts which completely change the meaning of the
rest of the data.

And yet we are discussing here Fossil - where the loss of a few bytes
will destroy the repository or abort the sync operation anyway.

Anybody who suggests to use UTF-16 for anything has no idea about
useful encodings in my book. UTF-16 has no advantage whatsoever, only
disadvantages.

Would you care to enumerate your points then?

Post by Michal Suchanek
SCSU is not that useful for storage compression since fossil already
uses zlib and it has no other advantages I am aware of.

Deflate compression is only applied to commits. Deflate has
significant overhead, and is inapplicable to smaller pieces of text
(such as commit strings) which can non-the-less contribute
significantly to size. On the other hand, SCSU performs better than
UTF-8 for the vast majority of real world texts, as has already been
enumerated.

Richard Hipp

2010-06-26 16:34:45 UTC

Post by Michal Suchanek
SCSU is not that useful for storage compression since fossil already
uses zlib and it has no other advantages I am aware of.

The checkin-comments in Fossil are contained in the manifest artifacts,
which are both delta-compressed and deflated prior to storage in the current
implementation.

Copies of checkin-comments are stored uncompressed in a separate table (the
EVENT table) for ease of access during queries such as "timeline". But the
amount of text stored there is small. In Fossil's self-hosting repository
(with 3409 events) there is 337KB of comment text, or about 2.3% of the
total repository space. In the 10-year history of SQLite there are 8664
events with 869KB of text, or 2.2% of the total repository space. In both
those examples, the comments are pure ASCII, so SCSU compression would make
no difference. But notice that we could store the text as UTF-32 and it
would still be less than 10% of the total repository.

In contrast, the delta- and deflate-compressed artifacts comprise about 70%
and 80% of the repository space for Fossil and SQLite, respectively. The
artifact compression is very effective, achieving compression rations of
19:1 for Fossil and 39:1 for SQLite.

--
---------------------
D. Richard Hipp
***@sqlite.org

Michal Suchanek

2010-06-26 19:59:37 UTC

And how often do you lose a couple of bytes in the middle of a file?
More precisely, how often do you lose them and not have a checksum
fail (or some other error) notifying you of this?

If the file is a web page then quite often, and it does not have a checksum.

Indeed, the loss is at the end in case of web pages, parts which are
missing in the middle are result of inserting different streams so
SCSU would not suffer more breakage than other encodings. Still there
is no apparent benefit in using it.

Post by Owen Shepherd
And HTML is also a file format with the equivalent of shifts; it just
calls them tags.

However, most HTML parsers are very well capable of parsing incomplete
HTML because the tags don't change the meaning of text except when it
is part of tag attribute.

Anybody who suggests to use UTF-16 for anything has no idea about
useful encodings in my book. UTF-16 has no advantage whatsoever, only
disadvantages.

Would you care to enumerate your points then?

UTF-8 is endianness independent and null-free, UTF-16 is not. In
transport losing a byte (or a packet with unknown, possibly odd number
of bytes) may corrupt at most one character of UTF-8, it may misalign
the whole stream of UTF-16.

UTF-32 is dword aligned, you can index into it as an array and every
position is a codepoint. UTF-16 has surrogate pairs so you have to
decode the whole string to get at codepoints.

I know no language for which UTF-16 is storage-efficient. For
languages using Latin UTF-8 or legasy encodings are about twice as
efficient. For Cyrrilic legacy encodings are much more efficient, I
don't know how UTF-16 compares to UTF-8 here. For CJK UTF-16 is about
2/3 of UTF-8 but more efficient alternative encodings exist and are in
widespread use.

If you know any advantage of UTF-16 then please enlighten me.

Thanks

Michal

Owen Shepherd

2010-06-26 21:47:55 UTC

Post by Michal Suchanek
Indeed, the loss is at the end in case of web pages, parts which are
missing in the middle are result of inserting different streams so
SCSU would not suffer more breakage than other encodings. Still there
is no apparent benefit in using it.

For storing many short strings, whether compiled into one bundle or
not, SCSU is ideal

Post by Owen Shepherd
And HTML is also a file format with the equivalent of shifts; it just
calls them tags.

However, most HTML parsers are very well capable of parsing incomplete
HTML because the tags don't change the meaning of text except when it
is part of tag attribute.

]]> begs to differ. But, again, we rarely experience this issue with
the omnipresent binary formats.

Anybody who suggests to use UTF-16 for anything has no idea about
useful encodings in my book. UTF-16 has no advantage whatsoever, only
disadvantages.

Would you care to enumerate your points then?

I said UTF-16 /in memory/. Not for transport. Whole different kettle of fish

Post by Michal Suchanek
UTF-32 is dword aligned, you can index into it as an array and every
position is a codepoint. UTF-16 has surrogate pairs so you have to
decode the whole string to get at codepoints.

You rarely need to index into it at code-point intervals. For most things
pointers are sufficient

And you should note that dword is a rather vague term; I somehow
presume you are referring to the x86' 32-bit double word (Which is not
even consistent in x86 documentation - the i386 SysV ABI used by all
unixlikes takes a word to be 32-bits).

(I could also mention that every index in a UTF-16 string is also
technically a codepoint, but lets not get into a battle of semantics;
the correct term for what you are referring to is a scalar value).

Post by Michal Suchanek
I know no language for which UTF-16 is storage-efficient. For
languages using Latin UTF-8 or legasy encodings are about twice as
efficient. For Cyrrilic legacy encodings are much more efficient, I
don't know how UTF-16 compares to UTF-8 here. For CJK UTF-16 is about
2/3 of UTF-8 but more efficient alternative encodings exist and are in
widespread use.

Said more efficient alternative encodings are not Unicode and should
not be considered a serialization of such. An endemic problem with
using them as such is that some have mapped characters over the ASCII
common set - a prime example being that Shift-JIS replaced the
backslash with a Yen. Those legacy encodings also often require
complex string search logic (Shift-JIS again being a prime example).

For Chinese, the recommended backwards-compatible encoding is GB
18030. This is a good effort but flawed (Decoding it is an absolute
nightmare), and should be converted to a more usable (e.g. UTF-16)
format for in memory use.

Post by Michal Suchanek
If you know any advantage of UTF-16 then please enlighten me.

UTF-16 is very efficient to work with. Its for this reason that many
languages which adopted Unicode post the expansion of the coding space
still picked it (Python for one). It is an effective tradeoff of space
and speed.

Michal Suchanek

2010-06-27 09:08:09 UTC

Anybody who suggests to use UTF-16 for anything has no idea about
useful encodings in my book. UTF-16 has no advantage whatsoever, only
disadvantages.

Would you care to enumerate your points then?

I said UTF-16 /in memory/. Not for transport. Whole different kettle of fish

You rarely need to index into it at code-point intervals. For most things
pointers are sufficient
And you should note that dword is a rather vague term; I somehow
presume you are referring to the x86' 32-bit double word (Which is not
even consistent in x86 documentation - the i386 SysV ABI used by all
unixlikes takes a word to be 32-bits).
(I could also mention that every index in a UTF-16 string is also
technically a codepoint, but lets not get into a battle of semantics;
the correct term for what you are referring to is a scalar value).

Well, it isn't any more than it is in UTF-8. Some braindead runtimes
expected that the (16bit) word in UTF-16 actually is a codepoint and
had to be fixed later, and there are issues with legacy code which
still expects the old behaviour.

It may be that on many CPUs it is more time efficient if the branch
which reads more than once from the string to get the codepoint is
executed rarely but then the ultimate efficiency is achieved with
UTF-32 which is 32-bit aligned which is by far the fastest on most
CPUs and needs no branching at all at this level. Also for short
strings reduced code complexity outweights any savings by string
compression by using more space-efficient encoding.

Said more efficient alternative encodings are not Unicode and should
not be considered a serialization of such. An endemic problem with
using them as such is that some have mapped characters over the ASCII
common set - a prime example being that Shift-JIS replaced the
backslash with a Yen. Those legacy encodings also often require
complex string search logic (Shift-JIS again being a prime example).
For Chinese, the recommended backwards-compatible encoding is GB
18030. This is a good effort but flawed (Decoding it is an absolute
nightmare), and should be converted to a more usable (e.g. UTF-16)
format for in memory use.

These are not Unicode, they are developed for Japanese and Chinese,
respectively.

And I don't see why you promote SCSU which maps just about anything
over ASCII and has shifts yet you bash SJIS for mapping Yen over
backslash and having shifts. There is an issue with legacy software
using Yen instead of backslash but that is not an issue with the
encoding itself, it's an issue with how it is abused. It's still a
valid reason to avoid the encoding since it makes the semantics of
these incorrectly used codes ambiguous.

Now SCSU might be easier to decode which makes the required code
smaller in comparison to other encodings but SCSU is not widely
supported and would require constant recoding from the system encoding
and web encoding which inflates the required code again.

Post by Michal Suchanek
If you know any advantage of UTF-16 then please enlighten me.

I don't see the efficiency. It's an approach in the middle which does
not really work well for any case and people hope it doesn't turn out
too bad. Now with fast processors and cheap RAM optimization is low
priority and pretty much anything goes as long as it works correctly.
The lack of need for efficiency does not make UTF-16 efficient, tough.

But if you strive for efficiency then I don't see any need for an
"internal encoding" in fossil. Fossil processes strings only once:
when you commit a changeset it cuts an excerpt of the commit message
and stores it for use in timeline and such.

Thanks

Michal

Andrey Cherezov

2010-06-25 19:07:19 UTC

UTF-8 in the database, any other encoding at the developer's console. SVN and GIT has this feature (variable encodings for commits and for console output) - this is per developer variables, not related to p2p sync encoding.
----- Original Message -----
From: Michael Richter
To: fossil-***@lists.fossil-scm.org
Sent: Friday, June 25, 2010 6:00 PM

Fossil is a distributed SCM system. Potentially the distributed database in question could be spread around the world. Do you really want the nightmare (and impossibility!) of trying to keep track of which project is in which encoding scheme on which machine?

Owen Shepherd

2010-06-25 15:53:25 UTC