[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

string hash function in SRFI-13

This page is part of the web mail archives of SRFI 13 from before July 7th, 2015. The new archives for SRFI 13 contain all messages, not just those from before July 7th, 2015.

Brad Lucier pointed out that my reference implementation of string-hash
(and char-set-hash) doesn't match the comments above it. Here's an
explanation of the issue.

The basic idea for the string-hash function (this is not part of the spec;
it's just the particular implementation chosen for the reference
implementation) is taken from Java. We treat the string cn...c0 as
a radix-37 numeral (albeit one where the individual digits can be
larger then 36), and use that for the hash value:

    h = (c0 + c1*37 + c2*37^2 + ...) mod bound

where BOUND is the desired range of the hash function. Now, the expression
in parens is going to be a a really big number for a long string; we'd like
to compute H without using a bignum package. We *could* do this:

    h := 0
    for i = n to 0 do
      h := (c[i] + h*37) mod bound

That's a lot of mod operations, however. Instead, why not define MASK
to be a string of 1-bits long enough to "cover" BOUND, that is
   mask := 2^(ceiling(lg bound)) - 1
and then just keep these low bits in the intermediate computation, finishing
up with a final mod op:

    h := 0
    for i = n to 0 do
      h := (c[i] + h*37) and mask
    h := h mod bound

Great -- that's much more efficient. But it ain't what we originally set out
to compute -- it isn't computing the h value I originally wrote down. That
is what Brad spotted (which I had missed).

The question is: does it matter? Is this more-quickly-computed value
any less good as a hash value?

I don't think so. I have not thought it through carefully, however.

Anybody else have anything else to say?