Discussion:
Memorable names for ticket IDs
(too old to reply)
David Given
2013-09-04 21:07:32 UTC
Permalink
Based on various conversations in the past I have been working on a
semi-interesting features: using mnemonic encoding to turn ticket hashes
into something that humans can remember.

This uses a built-in dictionary to convert a hex hash (e.g. '74a95e62')
into a series of words ('TOAST MOZART TULIP'). There are actually
several standard algorithms; the one I like best is mnemonic encoding,
and there's a sample implementation here:

https://github.com/singpolyma/mnemonicode

A four-byte hash can be stored as three words, which is just the right
size to fit in the human brain. However, the algorithm copes fine with
longer or shorter hashes (you get more or fewer words).

I've done a basic implementation of this, which is now in the
dg-codenames branch; the algorithm itself and its dictionary is in
src/mnemonic.c.

You can now do:

./fossil ticket history 'TOAST MOZART TULIP'

...to see an individual ticket.

If you want to list the memorable names in a ticket report, add this
line to the report's query statement:

mnemonic_encode(substr(tkt_uuid,1,8)) AS 'Id',

To play with the algorithm, do:

echo "select mnemonic_encode('12345678');" | ./fossil sqlite3
or echo "select mnemonic_decode('CRASH CHAPTER CASINO');" | ./fossil
sqlite3

...to encode and decode.

Right now the only thing I've changed to accept mnemonic encodings as
well as hashes is the "fossil ticket" command; I haven't touched the web
UI as it's rather complicated. So right now the only way to get the
mnemonic for a bug is to change the ticket report as described above or
to manually get the encoding. Here are a few from the Fossil ticket
database to get you started. This lets you see what sort of codename is
produced:

74a95e62cf TOAST MOZART TULIP
Visual indicator for which code check in being viewed
0c657fd35f ARNOLD COMMON BORIS
Allow aliases on commands
93c266d3ee SUMMER SIMPLE MILK
Difficulty to correctly add/list files/directories with accended
characters in name/path and see them listed in "Files" menu.
263b45306c PILGRIM PUMP ELITE
Add configuration to adjust length of version UUID
04a259be40 PEACE SHERIFF AMBER
operations upon all files
7636b10ddf CITIZEN ELASTIC VAMPIRE
option to abort checkin if comment is empty
2a34de01fc SOCIETY FELIX FASHION
"wiki unlist" as a possible solution to deleting wiki pages.

I'd appreciate any comments --- including any suggestions as to whether
this is actually a useful feature. I, personally, find these things
vastly easier to handle than 8-character hex strings. However I hear
rumours that other peoples' opinions are valid too...
--
┌───  ───── http://www.cowlark.com ─────
│
│ "Ripley's Law: Never go further for the cat than the cat would go for
│ you." --- Vexxarr Bleen (trans. Hunter Cressall)
Stephan Beal
2013-09-04 21:40:22 UTC
Permalink
Post by David Given
I'd appreciate any comments --- including any suggestions as to whether
this is actually a useful feature. I, personally, find these things
vastly easier to handle than 8-character hex strings. However I hear
rumours that other peoples' opinions are valid too...
As before, i find it really cool, if even if only for its conversational
value. i can actually imagine that i'd use it (but won't really know that
until trying it out). i will certainly be forking this into libfossil in
some form or other. You've already done the sqlite3 binding, leaving me
with little left to port :). If this can be "flavored" by clients (custom
word lists), i'd probably spend way to much time playing with it. libfossil
allows clients to plug in "crosslink handlers" which will, in theory, allow
them to do filtering like this for event entries (though whether that will
work in practice is as yet undetermined).
--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
Ross Berteig
2013-09-04 21:54:17 UTC
Permalink
Post by David Given
I'd appreciate any comments --- including any suggestions as to whether
this is actually a useful feature. I, personally, find these things
vastly easier to handle than 8-character hex strings. However I hear
rumours that other peoples' opinions are valid too...
As before, i find it really cool, if even if only for its conversational
value. i can actually imagine that i'd use it (but won't really know
that until trying it out). i will certainly be forking this into
libfossil in some form or other. You've already done the sqlite3
binding, leaving me with little left to port :). If this can be
"flavored" by clients (custom word lists), i'd probably spend way to
much time playing with it. libfossil allows clients to plug in
"crosslink handlers" which will, in theory, allow them to do filtering
like this for event entries (though whether that will work in practice
is as yet undetermined).
While thinking in the blue sky, it would also be neat if it did the
prefix-match trick that fossil does now for hex ids. That is, if
[TOAST], [TOAST MOZART], and [TOAST MOZART TULIP] all matched
[74a95e62cf]; and naturally supporting even longer phrases to
disambiguate past the first eight digits. Given the mapping of three
words to 32 bits, however, there is some subtlety to address there so it
may not be as trivial as just decoding what you got and shifting.

I'd also make sure decoding is case insensitive and allow the hyphenated
forms to make it easier to type both at the command line and to present
in an URL.
--
Ross Berteig ***@CheshireEng.com
Cheshire Engineering Corp. http://www.CheshireEng.com/
David Given
2013-09-05 10:27:40 UTC
Permalink
[...]
Post by Ross Berteig
Post by Stephan Beal
As before, i find it really cool, if even if only for its conversational
value. i can actually imagine that i'd use it (but won't really know
that until trying it out). i will certainly be forking this into
libfossil in some form or other. You've already done the sqlite3
binding, leaving me with little left to port :)
It was impressively straightforward --- the hardest part was figuring
out how the db init stuff works (which I have rationalised a bit, see
earlier bug report). Sqlite's nice and easy to bind to.
Post by Ross Berteig
Post by Stephan Beal
If this can be
"flavored" by clients (custom word lists), i'd probably spend way to
much time playing with it.
Theoretically yes, but in practice it's liable to be way too hard ---
the dictionary is huge, 1626 words (plus seven special words for
overflow), and there's a fair amount of subtlety to the design. For
example, no word is a prefix of another word; all words are unique when
truncated to five letters; all words sound reasonably different...
there's lots of information here:

http://web.archive.org/web/20090918202746/http://tothink.com/mnemonic/wordlist.html

Ideally the dictionary should go into the repository somewhere, which
should make it trivially modifiable, but I don't know how to do that. I
assume that adding 1673 named configuration variables is antisocial?

[...]
Post by Ross Berteig
While thinking in the blue sky, it would also be neat if it did the
prefix-match trick that fossil does now for hex ids. That is, if
[TOAST], [TOAST MOZART], and [TOAST MOZART TULIP] all matched
[74a95e62cf]; and naturally supporting even longer phrases to
disambiguate past the first eight digits. Given the mapping of three
words to 32 bits, however, there is some subtlety to address there so it
may not be as trivial as just decoding what you got and shifting.
Each word represents 10.7 bits. So one word encodes one byte; two words
encode two bytes; three words encode *either* three or four bytes
depending on the last word. (There are seven special words used to
encode the leftover three bits when encoding three bytes.) So
theoretically it's possible... however, playing with the encoder I see that:

12345678 = CRASH CHAPTER CASINO
123456 = ROVER TRIBAL EGO
1234 = JAMES ACTIVE
12 = ALCOHOL

This looks suspiciously like I'm encoding stuff in little-endian order
rather than big-endian order. I'll have a look at that.

Incidentally, the full ticket hash encodes to TOAST MOZART TULIP TOTAL
PERFECT CLONE MORPH SIZE BETTY ATLAS ARENA TRADE PLUME REMARK CORNER.
Which is still a mouthful, but I bet I can read that over the phone
faster than you can read the hex!

...

Out of interest, when do people using Fossil actually use hashes to
refer to things? It's fairly common to use the hash in git to refer to a
specific checkin, for example, but I've never found an urge to do that
in Fossil.
--
┌───  ───── http://www.cowlark.com ─────
│ "USER'S MANUAL VERSION 1.0: The information presented in this
│ publication has been carefully for reliability." --- anonymous
│ computer hardware manual
Stephan Beal
2013-09-05 10:58:09 UTC
Permalink
Post by David Given
Out of interest, when do people using Fossil actually use hashes to
refer to things? It's fairly common to use the hash in git to refer to a
specific checkin, for example, but I've never found an urge to do that
in Fossil.
i think we end up using them in email (pasted links) more than anything
else. i use them from the CLI occasionally, but probably 90%+ of such uses
has something to do with the dev of fossil or libfossil, where i'm working
with raw artifact data, so they "don't really count." In other source trees
i rarely need the UUIDs... i'm trying to think of an instance and no recent
ones comes to mind.
--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
Richard Hipp
2013-09-05 13:55:04 UTC
Permalink
Post by David Given
12345678 = CRASH CHAPTER CASINO
123456 = ROVER TRIBAL EGO
1234 = JAMES ACTIVE
12 = ALCOHOL
I think the above doesn't really work for an application like Fossil. I
think that a prefix of the SHA1 hash should encode to a prefix of the
mnemonic, and the other way arround too - a prefix of the mnemonic should
decode back to a prefix of the original hash. To do this, some changes
need to be made to the encoding mechanism. (1) You have to abandon the
seven 3-letter-words that are used for 24-bit values, since if the 24-bit
value is a prefix of a longer value, those 3-letter-words will not be used
when encoding the longer values. (2) The round-trip from hex to mnemonic
back to hex might truncate half-bytes off the end of the hex value.

The hex hash is encoded into mnemonic words in 4-byte or 8-character
chunks.

HHH -> WORD -> HH
HHHHHH -> WORD-WORD -> HHHHH
HHHHHHHH -> WORD-WORD-WORD -> HHHHHHHH

It takes a minimum of 3 hex character (12 bits) to encode a single mnemonic
word if the prefix property is to be preserved. But a single mnemonic will
only convert back to 2 hex characters. Similarly, 6 hex characters are
needed to generate two mnemonic words, but 2 mnemonic words will only give
5 hex characters of output. 8 hex characters converts to 3 mnemonic words
and then back to 8 hex characters, without truncation.

The encoding mechanism currently used never truncates. If you encode N hex
characters (N must be even) then you will get back N hex characters when
decoding. However, a prefix of the mnemonic encoding does not yield a
prefix of the original hex hash. I think that for this application, prefix
preservation is more important that avoidance of encode/decode truncation.
--
D. Richard Hipp
***@sqlite.org
David Given
2013-09-05 14:21:35 UTC
Permalink
Richard Hipp wrote:
[...]
Post by Richard Hipp
I think the above doesn't really work for an application like Fossil. I
think that a prefix of the SHA1 hash should encode to a prefix of the
mnemonic, and the other way arround too - a prefix of the mnemonic
should decode back to a prefix of the original hash.
[...]
Post by Richard Hipp
The hex hash is encoded into mnemonic words in 4-byte or 8-character
chunks.
HHH -> WORD -> HH
HHHHHH -> WORD-WORD -> HHHHH
HHHHHHHH -> WORD-WORD-WORD -> HHHHHHHH
Yes, that makes a lot of sense for Fossil's particular use case (which
the mnemonic encoding algorithm wasn't designed for, of course).

I'll have a go tonight and see what happens --- the encoder/decoder is
wrong and needs to be rewritten anyway.
--
┌───  ───── http://www.cowlark.com ─────
│ "USER'S MANUAL VERSION 1.0: The information presented in this
│ publication has been carefully for reliability." --- anonymous
│ computer hardware manual
Matt Welland
2013-09-05 15:52:52 UTC
Permalink
Can tickets be tagged? A quick browse of the tickets page and I conclude
the answer is probably no. If tickets could be tagged similar to the
timeline then tags could be used to give tickets a sync friendly short name
which could be used in the ticket listings and when discussing the tickets.

The mnemonic method is a good idea but because the names are long I'm not
sure I'd use them much.
Post by David Given
[...]
Post by Richard Hipp
I think the above doesn't really work for an application like Fossil. I
think that a prefix of the SHA1 hash should encode to a prefix of the
mnemonic, and the other way arround too - a prefix of the mnemonic
should decode back to a prefix of the original hash.
[...]
Post by Richard Hipp
The hex hash is encoded into mnemonic words in 4-byte or 8-character
chunks.
HHH -> WORD -> HH
HHHHHH -> WORD-WORD -> HHHHH
HHHHHHHH -> WORD-WORD-WORD -> HHHHHHHH
Yes, that makes a lot of sense for Fossil's particular use case (which
the mnemonic encoding algorithm wasn't designed for, of course).
I'll have a go tonight and see what happens --- the encoder/decoder is
wrong and needs to be rewritten anyway.
--
┌───  ───── http://www.cowlark.com ─────
│ "USER'S MANUAL VERSION 1.0: The information presented in this
│ publication has been carefully for reliability." --- anonymous
│ computer hardware manual
_______________________________________________
fossil-users mailing list
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
--
Matt
-=-
90% of the nations wealth is held by 2% of the people. Bummer to be in the
majority...
Stephan Beal
2013-09-05 17:56:56 UTC
Permalink
Post by Matt Welland
Can tickets be tagged? A quick browse of the tickets page and I conclude
the answer is probably no. If tickets could be tagged similar to the
timeline then tags could be used to give tickets a sync friendly short name
which could be used in the ticket listings and when discussing the tickets.
The short answer is yes, tickets can be tagged. A tag can be applied to ANY
blob, and a ticket is a blob. The longer answer is that fossil doesn't
internally do much with tags except for the ones we already know and love
(branches, bgcolor, checkin comments, etc.). One of the things i'll be
exploring with libfossil is using tags to implement comment threads,
primarily for code reviews/comments. i am currently of the opinion that
that can be done within fossil's current architecture (but would need some
new code), but haven't gotten to the point of trying it out yet.

The mnemonic method is a good idea but because the names are long I'm not
Post by Matt Welland
sure I'd use them much.
Agreed, but i'd try it out just because i haven't seen it done in an SCM
before, and i find it quite a novel idea. It would make for livelier
mailing list discussions:

me: see checkout BRAZEN MONKEY WHO RIDES WOLVES
you: what did you call me?!?

Or maybe not.
--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
Ron Wilson
2013-09-05 21:30:50 UTC
Permalink
Post by Matt Welland
Can tickets be tagged? A quick browse of the tickets page and I conclude
the answer is probably no. If tickets could be tagged similar to the
timeline then tags could be used to give tickets a sync friendly short name
which could be used in the ticket listings and when discussing the tickets.
As Stephan mentioned, yes they can be, but Fossil ticket handling doesn't
do anything with tags.

You can define additional fields for tickets, so you could add a "short
name" field to tickets already. You can also define ticket reports that can
select and/or sort tickets by this field.
David Given
2013-09-05 21:41:59 UTC
Permalink
[...]
Post by David Given
Post by Richard Hipp
HHH -> WORD -> HH
HHHHHH -> WORD-WORD -> HHHHH
HHHHHHHH -> WORD-WORD-WORD -> HHHHHHHH
[...]
Post by David Given
I'll have a go tonight and see what happens --- the encoder/decoder is
wrong and needs to be rewritten anyway.
After having attempted it, I think this is actually going to be really
hard, or at any rate beyond my math skills at 2200 at night.

The way the encoding works is that it divides the number up into three
base 1626 digits, and those digits are indices into the dictionary. This
means that each digit represents 10 2/3 bits of the hash.

However, truncating the hash by dropping hex digits involves truncating
it to a whole number of bits wide. This is equivalent to taking the
modulus of a power of 2. Unfortunately because this isn't a multiple of
1626, it's going to perturb all the base 1626 digits, and so change the
encoding.

A slightly more comprehensible example using bases 9 and 10 demonstrates
the problem:

123456789_10 % 10000000_9 = 123456789 % 4782969 = 3882564

If anyone can prove me wrong, please do...

I think, without a mathematical proof, that maintaining the ability to
take prefixes of an encoded name will require us to use a dictionary
that fits into a precise number of bits. Truncating the dictionary to
2^10 entries would be the simplest approach, but this means that our
three words no longer encode exactly 32 bits --- we only get 30 bits,
which is 7 hex characters. Four words gets us 40 bits, which is 10 hex
characters. We don't get anything in between.

I don't think seven hex characters is unique enough, based on the stats
other people have figured out. Could someone do the collision check on
the NetBSD repository, if they have a copy? (It's not worth checking
out, as it's kind of huge.)

(The alternative is a 2^11 word dictionary, which means coming up with
another 422 suitable words from somewhere, which I don't think is
feasible. The author claims the current dictionary took 300 hours to
compile.)
--
┌───  ───── http://www.cowlark.com ─────
│
│ "Ripley's Law: Never go further for the cat than the cat would go for
│ you." --- Vexxarr Bleen (trans. Hunter Cressall)
Stephan Beal
2013-09-05 21:58:06 UTC
Permalink
Post by David Given
I don't think seven hex characters is unique enough, based on the stats
other people have figured out. Could someone do the collision check on
the NetBSD repository, if they have a copy? (It's not worth checking
out, as it's kind of huge.)
i don't have netbsd, but tcl says 9 is the magic number:

sqlite> create temp table xxx as select count(*) n, substr(uuid,1,10) u
from blob group by u;
sqlite> select count(*) from xxx where n>1;
0
sqlite> drop table xxx; create temp table xxx as select count(*) n,
substr(uuid,1,8) u from blob group by u; select count(*) from xxx where n>1;
2
sqlite> drop table xxx; create temp table xxx as select count(*) n,
substr(uuid,1,9) u from blob group by u; select count(*) from xxx where n>1;
0

sqlite> select count(*) from xxx;
131732
--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
Edward Berner
2013-09-06 06:42:52 UTC
Permalink
Post by David Given
I think, without a mathematical proof, that maintaining the ability to
take prefixes of an encoded name will require us to use a dictionary
that fits into a precise number of bits. Truncating the dictionary to
2^10 entries would be the simplest approach, but this means that our
three words no longer encode exactly 32 bits --- we only get 30 bits,
which is 7 hex characters. Four words gets us 40 bits, which is 10 hex
characters. We don't get anything in between.
Maybe borrow an idea from old-school telephone numbers and use a word
(or two) followed by digits? If words are 10 bits, then word-word-[five
decimal digits] could encode 36 bits. Alternately, word-word-[four hex
digits] would also give 36 bits.
--
Edward
Ron Wilson
2013-09-06 17:26:04 UTC
Permalink
Post by David Given
I think, without a mathematical proof, that maintaining the ability to
take prefixes of an encoded name will require us to use a dictionary
that fits into a precise number of bits. Truncating the dictionary to
2^10 entries would be the simplest approach, but this means that our
three words no longer encode exactly 32 bits --- we only get 30 bits,
which is 7 hex characters. Four words gets us 40 bits, which is 10 hex
characters. We don't get anything in between.
Should still be possible to resolve the "endian" issue with the existing
word generation algorithm.

However, as someone pointed out, a 3 or 4 word (or even all the words)
"memorable key" could be pre-computed in Fossil and stored in a SQLite
table, then a prefix of the memorable key could be matched the same way
Fossil matches prefixes of hash keys.
David Mason
2013-09-06 20:35:46 UTC
Permalink
Post by David Given
(The alternative is a 2^11 word dictionary, which means coming up with
another 422 suitable words from somewhere, which I don't think is
feasible. The author claims the current dictionary took 300 hours to
compile.)
There's an 7776 common word list at
http://world.std.com/~reinhold/diceware.html

That would give you 2^12, and if you could find 416 more (possibly the
difference between the list you already have and the diceware list is
416 words).you'd get 2^13 which, based on the stats being reported
would almost always work.

../Dave
David Given
2013-09-07 22:43:26 UTC
Permalink
On 05/09/13 22:41, David Given wrote:
[...]
Post by David Given
I think, without a mathematical proof, that maintaining the ability to
take prefixes of an encoded name will require us to use a dictionary
that fits into a precise number of bits. Truncating the dictionary to
2^10 entries would be the simplest approach, but this means that our
three words no longer encode exactly 32 bits --- we only get 30 bits,
which is 7 hex characters. Four words gets us 40 bits, which is 10 hex
characters. We don't get anything in between.
I've done this. It's now prefix-friendly:

123456789a -> BINARY CABARET LUNAR COBRA
12345678 -> BINARY CABARET LUNAR
12345 -> BINARY CABARET
123 -> BINARY

With the new encoding scheme our old friend TOAST MOZART TULIP now
becomes MISTER SHERIFF JAVA TOKYO, which I reckon is pretty memorable.
It also now accepts lowercase, and - and _ as delimiters for
command-line friendliness.

I still haven't given up on encoding three words into 32 bits, but it'll
require more thought.

Here is our new bug list:

74a95e62cf MISTER SHERIFF JAVA TOKYO
0c657fd35f ASPIRIN PRISM SALSA DIVIDE
93c266d3ee POSTAL APOLLO MASTER RIVER
263b45306c CLOCK NEXT HAWAII CARAVAN
04a259be40 ALCOHOL PATRON RELAX PLASTIC
7636b10ddf MODULAR EDUCATE BELGIUM MONTANA
2a34de01fc CONCEPT CHAPTER FREEDOM OCEAN

...

(Replies to miscellaneous people summarised here)

I did look at the Diceware dictionary; it's large, but I don't think
it's very good quality. Not only does it lack a lot of the features I
like in the Mnemonic Encoding dictionary --- e.g. Diceware has
CLEAN/CLEAR/CLEAT, where ME words are all unique in the first five
letters and are fairly distinct from each other anyway --- but it also
contains a lot of nonwords like HJ, HK, HL, HM, A, AA, AAA, AAA etc. It
may be worth some work with awk scripts to try and add to the ME
dictionary, though. All I need is another 422 words and then I get 11
bits per word...

Abstract names (such as random syllables) are very dense, but I don't
think they're very memorable --- they lack the flavour of the English words.

OTOH, thinking about the actual use case for these, I'm not sure there's
going to be that many situations where someone needs to remember an
entire unambiguous hash. The main use case I'm thinking of here, which
is where someone from work comes over to my desk and says 'So, about bug
7188...' and I say 'huh?'. I want memorable bug names so that *I* can
disambiguate them in my rather faulty memory. So someone saying 'So,
about bug we-choo-fa...' is much more likely to get a coherent response
from me than 7188.

OTGH I'd still rather they referred to it as MISTER SHERIFF simply for
cool value!
--
┌───  ───── http://www.cowlark.com ─────
│
│ "Ripley's Law: Never go further for the cat than the cat would go for
│ you." --- Vexxarr Bleen (trans. Hunter Cressall)
Alaric Snell-Pym
2013-09-16 08:38:07 UTC
Permalink
Post by David Given
OTOH, thinking about the actual use case for these, I'm not sure there's
going to be that many situations where someone needs to remember an
entire unambiguous hash. The main use case I'm thinking of here, which
is where someone from work comes over to my desk and says 'So, about bug
7188...' and I say 'huh?'. I want memorable bug names so that *I* can
disambiguate them in my rather faulty memory. So someone saying 'So,
about bug we-choo-fa...' is much more likely to get a coherent response
from me than 7188.
OTGH I'd still rather they referred to it as MISTER SHERIFF simply for
cool value!
I concur. I think the memorable names are great for spoken usage around
the office, and even when copied and pasted into emails, you'll be more
able to look at the name and remember "Oh, that's the bug about the core
dump when you pass -v" than if presented with a hex hash; you'll be less
likely to have to look it up.

Also, tickets for feature requests rather than bugs automatically give
things code names, reminiscent of all these NSA project names Snowden's
leaked. BULLRUN, CHEESY NAME, etc :-D

Now, how apart names for commits, too? "This bug was introduced by
commit ABSENT FRIEND TROUSERS..."

ABS
--
Alaric Snell-Pym
http://www.snell-pym.org.uk/alaric/
Ross Berteig
2013-09-05 17:43:38 UTC
Permalink
Post by David Given
12345678 = CRASH CHAPTER CASINO
123456 = ROVER TRIBAL EGO
1234 = JAMES ACTIVE
12 = ALCOHOL
I think the above doesn't really work for an application like Fossil. I
think that a prefix of the SHA1 hash should encode to a prefix of the
mnemonic, and the other way around too - a prefix of the mnemonic
should decode back to a prefix of the original hash....
I'm not convinced we ever need to generate the memorable name from any
less information than the entire hash. We just need a prefix of the
memorable name to match the entire name. I don't see a case for
producing the name ad-hoc from a hash prefix like [12345] without the
entire hash also being available. I think granting that relaxes the
requirements enough that prefixes of memorable names can work just as
well as the prefix of a hash does now.

One way to do that (again, blue sky, from someone not intimate with the
insides of fossil) would be to cache the memorable names in a table
where they can be located by SQL just as the prefix of a hash is now. In
that case, you could easily match [CRASH-CHAP] too, even though CHAP
isn't in the word list. (That might need to be a constraint on the word
list: no word is a prefix of any other word.)

Another way to match a prefix from a (partial) memorable name is change
the encoder to a big-endian process so that it is stable, then match can
by decoding the words and trimming to just the number of bits
represented, perhaps rounding down to entire hex digits, then proceed as
usual for a hash prefix presented in hex.
--
Ross Berteig ***@CheshireEng.com
Cheshire Engineering Corp. http://www.CheshireEng.com/
Stephan Beal
2013-09-05 17:50:26 UTC
Permalink
Post by Ross Berteig
One way to do that (again, blue sky, from someone not intimate with the
insides of fossil) would be to cache the memorable names in a table where
they can be located by SQL just as the prefix of a hash is now.
FWIW: that part is simple. We have a central routine which resolves
"symbolic names" (including prefixes) to their full SHA1, and that would
just need to be extended to check another option. Easiest would be that it
just passes the user-provided data to the sqlite-bound word decoder
function and see if it returns a result.
--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
Doug Franklin
2013-09-05 20:39:26 UTC
Permalink
On 2013-09-05 13:43, Ross Berteig wrote:

First off, as you mentioned a few posts ago, you've got to get it
examining the bits in the same order and chunks regardless of the
"endianness" of the host. At that point, it seems like prefixing ought
to start working as expected.
Post by Ross Berteig
I'm not convinced we ever need to generate the memorable name from any
less information than the entire hash.
These are going to get long, fast, as someone else already noted.
However, in the past I've seen algorithms to generate "random" "words"
that are still pronounceable, one character at a time. The trouble here
is likely to be the comparative lack of vowels. Perhaps an algorithm
generating words and/or phrases from syllables instead of characters.
And perhaps some algorithm along those lines could mashup with the code
you already have to produce prefixable, pronounceable, memorable "phrase
tokens" for the hashes, that don't end up too long for people to want to
deal with them.
--
Thanks,
DougF (KG4LMZ)
Stephan Beal
2013-09-05 20:46:48 UTC
Permalink
On Thu, Sep 5, 2013 at 10:39 PM, Doug Franklin
These are going to get long, fast, as someone else already noted. However,
in the past I've seen algorithms to generate "random" "words" that are
still pronounceable, one character at a time. The trouble here is likely
to be the comparative lack of vowels. Perhaps an algorithm generating
words and/or phrases from syllables instead of characters. And perhaps some
algorithm along those lines could mashup with the code you already have to
produce prefixable, pronounceable, memorable "phrase tokens" for the
hashes, that don't end up too long for people to want to deal with them.
A long, long time ago i stole some Perl code which creates new words by
reading in client-provided text. So if you want Viking-sounding words, read
in something Scandinavian.

i've got a copy of it here:

http://wanderinghorse.net/gaming/randomwords/

there is a sample word list, and the main tarball contains flavor texts
taken from Project Gutenberg.
--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
Edward Berner
2013-09-06 06:32:36 UTC
Permalink
Post by Doug Franklin
These are going to get long, fast, as someone else already noted.
However, in the past I've seen algorithms to generate "random" "words"
that are still pronounceable, one character at a time. The trouble
here is likely to be the comparative lack of vowels.
FWIW, I've seen one program that generates random pronounceable
passwords by simply alternating consonants and vowels. They weren't
proper words but were fairly easy to remember.
--
Edward
Ross Berteig
2013-09-07 00:11:29 UTC
Permalink
Post by Edward Berner
Post by Doug Franklin
These are going to get long, fast, as someone else already noted.
However, in the past I've seen algorithms to generate "random" "words"
that are still pronounceable, one character at a time. The trouble
here is likely to be the comparative lack of vowels.
FWIW, I've seen one program that generates random pronounceable
passwords by simply alternating consonants and vowels. They weren't
proper words but were fairly easy to remember.
I was also wondering about how this might look, so I tossed together a
proof of concept encoder in Lua to try out some sample values. I didn't
put a lot of effort into building syllables, other than to combine
enough consonants and vowels to get a set of 256 distinct syllables.
That provides a one syllable per byte mapping, which eliminates most of
the ugly edge cases with a prefix.

If you really wanted to be able to represent an odd number of hex digits
this way, I'd probably handle that by using a trailing consonant without
a vowel to code the sixteen possible values. I didn't try mocking that
up, however, and the attached encoder simply discards an extra digit if
present.

Here are some sample values encoded into my list of 256 syllables:

00000000 ba-ba-ba-ba
12345678 di-ku-poo-wa
74a95e62 vu-che-roo-si
da39a3ee5e6b4b0d3255bfef95601890afd80709
ghi-le-clo-phoo-roo-to-no-cy-ki-py-frau-
phau-bly-sa-fa-bla-chau-gha-bau-ce

[74a95e62] is of course the TOAST MOZART TULIP from David's original
suggestion, and [da39a3ee5e] is the full name of an empty artifact.

I displayed them with hyphens for discussion, but the hyphens could be
removed and syllables grouped in other patterns. I'd likely want a
decoder to ignore case, white space, and punctuation so that Babababa
would be just as good as BABA BABA or ba-ba-ba-ba. I did not bother to
verify that it isn't possible for "unfortunate" words to appear in the
generated output, and I suppose that if adopted as a core feature that
would probably be a good idea.

I'm not sure this post reflects anything more than blue-sky musing, but
if someone is curious enough about the code it is attached under a CC0
public domain declaration.
--
Ross Berteig ***@CheshireEng.com
Cheshire Engineering Corp. http://www.CheshireEng.com/
Doug Franklin
2013-09-07 02:12:22 UTC
Permalink
Post by Ross Berteig
00000000 ba-ba-ba-ba
12345678 di-ku-poo-wa
74a95e62 vu-che-roo-si
da39a3ee5e6b4b0d3255bfef95601890afd80709
ghi-le-clo-phoo-roo-to-no-cy-ki-py-frau-
phau-bly-sa-fa-bla-chau-gha-bau-ce
That last one is too long for easy comprehension or handling. But the
four syllable ones for four bytes work reasonably well. Odd numbers of
digits could possibly be handled by having a separate set of sixteen
syllables used only in the case of a trailing odd hex digit, or forcing
sixteen specific syllables to be used in that case.
--
Thanks,
DougF (KG4LMZ)
Ron Wilson
2013-09-07 18:54:57 UTC
Permalink
Post by Ross Berteig
00000000 ba-ba-ba-ba
12345678 di-ku-poo-wa
74a95e62 vu-che-roo-si
da39a3ee5e6b4b0d3255bfef95601890afd80709
ghi-le-clo-phoo-roo-to-no-cy-ki-py-frau-
phau-bly-sa-fa-bla-chau-gha-bau-ce
I like this, but...

At least in English, 'c' can be pronounced as either 'k' or 's'. Also, when
used as a vowel, 'y' can be pronounced like an 'i'. Also, 'q' can sometimes
be pronounced as 'k'.

Speakers of other languages will likely have similar problems.

As I recall, there is an international phonetic code. Not sure how useful
this will be, though
Ross Berteig
2013-09-05 17:59:56 UTC
Permalink
Post by David Given
Out of interest, when do people using Fossil actually use hashes to
refer to things? It's fairly common to use the hash in git to refer to a
specific checkin, for example, but I've never found an urge to do that
in Fossil.
Personally I use them almost entirely in the form of links between
tickets and checkins, where I'm doing copy and paste in any case so the
hex works out fine.

A secondary use is as a unique id for "that bug" in email and
occasionally in conversation with either others on the team or my
customer's team. I'm not sure if writing "ticket [1234]" or "ticket
[crash-chapter]" is better, but the latter is likely a lot better if
trying to read a hash over the phone.

Mitigating the long hex problem for me is the fact that none of my
projects have lasted long enough to grow to the point were using more
than 4 to 6 digits of hash prefix feels dangerous.
--
Ross Berteig ***@CheshireEng.com
Cheshire Engineering Corp. http://www.CheshireEng.com/
Stephan Beal
2013-09-05 18:18:10 UTC
Permalink
Post by Ross Berteig
Mitigating the long hex problem for me is the fact that none of my
projects have lasted long enough to grow to the point were using more than
4 to 6 digits of hash prefix feels dangerous.
i'm not 100% sure my query is right (i'm no SQL guru), but in the mail
fossil repo we seem to have only 14 collisions (across 21k blobs) at 6
digits:

sqlite> create temp table xxx as select count(*) n, substr(uuid,1,6) u from
blob group by u;
sqlite> select * from xxx where n>1;
2|6fdf52
2|8d712d
2|940431
2|b652b9
2|ba837f
2|bdbf14
2|be32eb
2|c1b1ba
2|c8735d
2|d07537
2|d43165
2|e980ba
2|ef17fb
2|f94f7e
sqlite> select count(*) from xxx where n>1;
14

8 digits seems to be the magic number:

sqlite> create temp table xxx as select count(*) n, substr(uuid,1,10) u
from blob group by u;
sqlite> select count(*) from xxx where n>1;
0
sqlite> drop table xxx;
sqlite> create temp table xxx as select count(*) n, substr(uuid,1,8) u from
blob group by u;
sqlite> select count(*) from xxx where n>1;
0
sqlite> drop table xxx;
sqlite> create temp table xxx as select count(*) n, substr(uuid,1,7) u from
blob group by u;
sqlite> select count(*) from xxx where n>1;
1
sqlite> create temp table xxx as select count(*) n, substr(uuid,1,6) u from
blob group by u;
sqlite> select count(*) from xxx where n>1;
14
--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
Ross Berteig
2013-09-05 19:00:51 UTC
Permalink
Post by Stephan Beal
....
i'm not 100% sure my query is right (i'm no SQL guru), but in the mail
fossil repo we seem to have only 14 collisions (across 21k blobs) at 6
So ran this query over my current project's repo, which has only 2254
blobs, for a range of prefix length:

prefix collisions
------ ----------
2 256
3 449
4 47
5 2
6 0

Looks like 6 is safe for me, but 4 actually is more likely to cause a
confusion than I expected. Interesting.
--
Ross Berteig ***@CheshireEng.com
Cheshire Engineering Corp. http://www.CheshireEng.com/
Andy Bradford
2013-09-07 03:44:58 UTC
Permalink
Based on various conversations in the past I have been working on a
semi-interesting features: using mnemonic encoding to turn ticket
hashes into something that humans can remember.
What about using the ICAO alphabet? It be easy to read over the phone as
some have suggested. Might be somewhat easier to remember than just the
hash. And it can be easily reversed in the brain.

74-ALPHA-95-ECHO-62 I can easily derive 74a95e62. In fact, you might
even just tell me 74-ALPHA-95 and that will be sufficient.

Let's see:

$ fossil ticket show 7 | awk '{ print $ 2 }' | sed -e 's/[a-f]/-&-/g;s/a/ALPHA/g;s/b/BRAVO/g;s/c/CHARLIE/g;s/d/DELTA/g;s/e/ECHO/g;s/f/FOXTROT/g;s/--/-/g;s/^-//;s/-$//'
#
2-ALPHA-34-DELTA-ECHO-01-FOXTROT-CHARLIE
7636-BRAVO-10-DELTA-DELTA-FOXTROT
04-ALPHA-259-BRAVO-ECHO-40
263-BRAVO-45306-CHARLIE
500-ALPHA-94-BRAVO-9-ECHO-5
93-CHARLIE-266-DELTA-3-ECHO-ECHO
0-CHARLIE-657-FOXTROT-DELTA-35-FOXTROT
74-ALPHA-95-ECHO-62-CHARLIE-FOXTROT
1-DELTA-18-ALPHA-7-CHARLIE-110
83-ALPHA-FOXTROT-1-FOXTROT-5401
9-ALPHA-42-DELTA-DELTA-09-ECHO-0
BRAVO-92-CHARLIE-4160-FOXTROT-ECHO
CHARLIE-BRAVO-6-BRAVO-ALPHA-ALPHA-160-DELTA
ECHO-816200-FOXTROT-95
2-ECHO-3-DELTA-CHARLIE-63382
ECHO-65-DELTA-ALPHA-ALPHA-109-FOXTROT
BRAVO-DELTA-5614-BRAVO-02-DELTA
7-CHARLIE-88-DELTA-10923
13-DELTA-BRAVO-6-DELTA-CHARLIE-ECHO-79
ECHO-8-ALPHA-CHARLIE-83275-FOXTROT
2384764107
665995-ECHO-480
ALPHA-BRAVO-82-ALPHA-0-BRAVO-DELTA-21
FOXTROT-1-CHARLIE-8-FOXTROT-2-DELTA-33-CHARLIE
7-DELTA-2-BRAVO-DELTA-BRAVO-0-ECHO-23
8-ALPHA-41-DELTA-37834
BRAVO-95-ECHO-3635-FOXTROT-CHARLIE
BRAVO-7-FOXTROT-CHARLIE-3-FOXTROT-0569
700-DELTA-423-DELTA-2-ALPHA
DELTA-BRAVO-FOXTROT-7908517
4-CHARLIE-303443-BRAVO-3
FOXTROT-DELTA-42-FOXTROT-1-FOXTROT-CHARLIE-3-DELTA
DELTA-642-CHARLIE-3621-ALPHA
95-DELTA-893-CHARLIE-207
5-FOXTROT-9-FOXTROT-516-ALPHA-22
ECHO-4-DELTA-ALPHA-774-BRAVO-ECHO-BRAVO
974618-FOXTROT-ECHO-5-ALPHA
7-ALPHA-27-ECHO-10-FOXTROT-1-FOXTROT
DELTA-0-ECHO-0-BRAVO-CHARLIE-0-DELTA-ALPHA-5
ALPHA-CHARLIE-8-DELTA-CHARLIE-DELTA-394-CHARLIE
23-DELTA-314856-CHARLIE
7-ALPHA-CHARLIE-3-CHARLIE-7-BRAVO-ALPHA-1-BRAVO
07-ALPHA-38-FOXTROT-3809
0669-ECHO-17923
5-FOXTROT-BRAVO-944026-DELTA
BRAVO-074-ALPHA-7588-BRAVO
ALPHA-410962-BRAVO-8-ECHO
9-DELTA-ALPHA-FOXTROT-677325
1457156-BRAVO-27
DELTA-2-CHARLIE-9-ALPHA-2-FOXTROT-ECHO-01

Maybe? Maybe not?

Just a thought.

Andy
--
TAI64 timestamp: 40000000522aa15d
Continue reading on narkive:
Loading...