Discussion:
Fossil interprets plain-text file as a binary file
(too old to reply)
Byron Sanchez
2017-03-28 00:44:13 UTC
Permalink
I'm tracking several plain-text files in a repository. These are emacs
org-mode files.

Fossil sees most of the files in this repo as normal plain-text files and
as such, they can be diffed via the fossil web interface.

Recently, however, fossil has started interpreting one of these org-mode
files as a binary file. Now, fossil prompts with it's binary-file warning
each time I update the file. In addition, this file can no longer be diffed
in the web interface, since fossil believes it to be a binary file.

I'm wondering what steps I should take to debug this, or if there are any
common causes for this sort of thing? Very long lines perhaps or possibly
unicode characters?

The file in question is about 3.3 megabytes in size, and as far as I am
aware, a normal plain-text org-mode file.

Any ideas would be very appreciated!

Thanks,

Byron Sanchez
Scott Robison
2017-03-28 00:57:03 UTC
Permalink
On Mar 27, 2017 6:44 PM, "Byron Sanchez" <***@gmail.com> wrote:

Recently, however, fossil has started interpreting one of these org-mode
files as a binary file. Now, fossil prompts with it's binary-file warning
each time I update the file. In addition, this file can no longer be diffed
in the web interface, since fossil believes it to be a binary file.

I'm wondering what steps I should take to debug this, or if there are any
common causes for this sort of thing? Very long lines perhaps or possibly
unicode characters?


Long lines, invalid unicode sequences, or many control codes.

What type of data is it? Source code, poetry?
Ross Berteig
2017-03-28 01:09:39 UTC
Permalink
Post by Byron Sanchez
I'm tracking several plain-text files in a repository. These are emacs
org-mode files.
Fossil sees most of the files in this repo as normal plain-text files
and as such, they can be diffed via the fossil web interface.
Recently, however, fossil has started interpreting one of these
org-mode files as a binary file. Now, fossil prompts with it's
binary-file warning each time I update the file. In addition, this
file can no longer be diffed in the web interface, since fossil
believes it to be a binary file.
I'm wondering what steps I should take to debug this, or if there are
any common causes for this sort of thing? Very long lines perhaps or
possibly unicode characters?
Try the command "fossil test-looks-like-utf" to see the conditions that
fossil tests for your file. That should help narrow down what to look
for in the file that caused it to suddenly smell binary. It usually
decides a file is binary if it has a line that is "too long", or has a
NUL byte and is not UTF-16.

I believe that a line is too long if it is more than about 8191 ASCII
characters, a restriction based on the size of the buffer used in the
diff engine.

The other thing that can happen is to accidentally save a text file in
an encoding other than UTF-8, with some character not included in the
base 7-bit ASCII set. In my experience this was usually some accented
letter from LATIN1, or a symbol such as 'µ' or '°'. Your editor will
likely calmly edit and save the file, everything looks fine, but the
saved file has bytes that make an invalid UTF-8 sequence. That does have
a different warning message than binary data (likely "invalid UTF-8") so
isn't your problem with this file.
Post by Byron Sanchez
The file in question is about 3.3 megabytes in size, and as far as I
am aware, a normal plain-text org-mode file.
Any ideas would be very appreciated!
--
Ross Berteig ***@CheshireEng.com
Cheshire Engineering Corp. http://www.CheshireEng.com/
+1 626 303 1602
Richard Hipp
2017-03-28 01:21:38 UTC
Permalink
Post by Ross Berteig
I believe that a line is too long if it is more than about 8191 ASCII
characters, a restriction based on the size of the buffer used in the
diff engine.
Technically, that restriction is due to the way hashes are computed on
individual lines during the diff. For diffing, the file is broken up
into individual lines, and every line is given a 32-bit hash that
helps to speed up locating the differences. The lower 13 bits of the
hash are the length of the line in bytes. The upper 19 bytes are the
actual hash.
--
D. Richard Hipp
***@sqlite.org
Ross Berteig
2017-03-28 20:54:34 UTC
Permalink
Post by Richard Hipp
Post by Ross Berteig
I believe that a line is too long if it is more than about 8191 ASCII
characters, a restriction based on the size of the buffer used in the
diff engine.
Technically, that restriction is due to the way hashes are computed on
individual lines during the diff. For diffing, the file is broken up
into individual lines, and every line is given a 32-bit hash that
helps to speed up locating the differences. The lower 13 bits of the
hash are the length of the line in bytes. The upper 19 bytes are the
actual hash.
Interesting. I didn't read further into the code than the definition of
LENGTH_MASK and the comment that describes it in diff.c. I did wonder
slightly at the name of that symbol, but it was described as the length
of a line so I just ran with it. In lookslike.c we have
UTF16_LENGTH_MASK which is described by the comment as being the same
quantity expressed for UTF16 chars.

But the comment and definition don't seem to agree. Richard, take a look at
https://www.fossil-scm.org/index.html/artifact?name=3ac38fafa91d274c&ln=220-226
Line 225 would compute UTF16_LENGTH_MASK to be 13-2-1 or 10, and get
1023 for UTF16_LENGTH_MASK. But the comment says 4096....

Either the code, the comment, or I are confused here. Since I'm poking
at test cases for this stuff. I'll see if I can add one that probes the
UTF16 line length question.
--
Ross Berteig ***@CheshireEng.com
Cheshire Engineering Corp. http://www.CheshireEng.com/
+1 626 303 1602
Byron Sanchez
2017-03-28 01:40:04 UTC
Permalink
That was it!

I ran the command and received the output:

Starts with UTF-8 BOM: no
Starts with UTF-16 BOM: no
Looks like UTF-8: no
Has flag LOOK_NUL: yes
Has flag LOOK_CR: no
Has flag LOOK_LONE_CR: no
Has flag LOOK_LF: yes
Has flag LOOK_LONE_LF: yes
Has flag LOOK_CRLF: no
Has flag LOOK_LONG: no
Has flag LOOK_INVALID: no
Has flag LOOK_ODD: no
Has flag LOOK_SHORT: no

I deleted the null characters. I didn't have to address any of the other
flags in my case, just the null characters. After that, fossil recognized
the file as plain text again.

Thank you for the help!
Post by Ross Berteig
Post by Byron Sanchez
I'm tracking several plain-text files in a repository. These are emacs
org-mode files.
Fossil sees most of the files in this repo as normal plain-text files and
as such, they can be diffed via the fossil web interface.
Recently, however, fossil has started interpreting one of these org-mode
files as a binary file. Now, fossil prompts with it's binary-file warning
each time I update the file. In addition, this file can no longer be diffed
in the web interface, since fossil believes it to be a binary file.
I'm wondering what steps I should take to debug this, or if there are any
common causes for this sort of thing? Very long lines perhaps or possibly
unicode characters?
Try the command "fossil test-looks-like-utf" to see the conditions that
fossil tests for your file. That should help narrow down what to look for
in the file that caused it to suddenly smell binary. It usually decides a
file is binary if it has a line that is "too long", or has a NUL byte and
is not UTF-16.
I believe that a line is too long if it is more than about 8191 ASCII
characters, a restriction based on the size of the buffer used in the diff
engine.
The other thing that can happen is to accidentally save a text file in an
encoding other than UTF-8, with some character not included in the base
7-bit ASCII set. In my experience this was usually some accented letter
from LATIN1, or a symbol such as 'µ' or '°'. Your editor will likely calmly
edit and save the file, everything looks fine, but the saved file has bytes
that make an invalid UTF-8 sequence. That does have a different warning
message than binary data (likely "invalid UTF-8") so isn't your problem
with this file.
The file in question is about 3.3 megabytes in size, and as far as I am
Post by Byron Sanchez
aware, a normal plain-text org-mode file.
Any ideas would be very appreciated!
--
Cheshire Engineering Corp. http://www.CheshireEng.com/
+1 626 303 1602
_______________________________________________
fossil-users mailing list
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Ross Berteig
2017-03-28 20:41:03 UTC
Permalink
Post by Byron Sanchez
That was it!
Starts with UTF-8 BOM: no
Starts with UTF-16 BOM: no
Looks like UTF-8: no
Has flag LOOK_NUL: yes
Has flag LOOK_CR: no
Has flag LOOK_LONE_CR: no
Has flag LOOK_LF: yes
Has flag LOOK_LONE_LF: yes
Has flag LOOK_CRLF: no
Has flag LOOK_LONG: no
Has flag LOOK_INVALID: no
Has flag LOOK_ODD: no
Has flag LOOK_SHORT: no
I deleted the null characters. I didn't have to address any of the
other flags in my case, just the null characters. After that, fossil
recognized the file as plain text again.
Unexpected NUL characters in a field of normal text will definitely
cause fossil to treat a file as binary.

If you really do need to store a NUL byte in a text file (as some sort
of delimiter, perhaps) fossil permits the over-long two byte UTF-8
encoding 0xC0 0x80 even though that is a technical violation of the
UTF-8 specification. Allowing that particular over-long encoding is a
common extension of UTF-8.

The other flags just indicate that you have normal *nix line endings
rather than CR LF endings used by DOS and Windows (and many many older
systems) or the CR only endings used by older Macs.
--
Ross Berteig ***@CheshireEng.com
Cheshire Engineering Corp. http://www.CheshireEng.com/
+1 626 303 1602
Continue reading on narkive:
Loading...