[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: should hash function upper bound be exclusive?

This page is part of the web mail archives of SRFI 69 from before July 7th, 2015. The new archives for SRFI 69 contain all messages, not just those from before July 7th, 2015.



bear wrote:
On the other hand, the largest fixnum is likely to be either a power
of two, or one less than a power of two; as hashes are conventionally
implemented (via modulus operations) these are poor choices for a hash
range boundary.  Numerical theory indicates  that a much better choice
is the highest prime fixnum representable, or at least a fixnum with
no small factors.

I'm afraid I don't see your point.  A "default bound" is a generic
hash value that has a larger range than indexes for a hash table,
which means you *will* be doing a modulo operation to convert the
generic hash value to an index.

Yes, some people recommend that that *the number of buckets* be a
prime number, but I don't how well-founded that is.  If the hash
function is properly "random", does it matter?  This faq:
http://burtleburtle.net/bob/hash/hashfaq.html says:
  Anything with a modulo prime at the end is probably bad; a decent
  hash function could use a power of two and wouldn't need to rely on
  the modulo prime to further mix anything.

Otherwise you run into rehash problems where the
system may be able to find no valid rehash value even though the table
is not full.

If by "rehash" you mean linear probing (secondary hashing, IIRC),
then obviously the sequence of probes should be exhaustive,
and one easy way to do that is to have an increment that is
relatively prime to the number of buckets.  Howwever, if each bucket
contains a chain (as in the referernce implementation) that's
not an issue.

If by "rehash" you mean resizing (increasing the number of buckets),
I still don't see the relevance.

Definitely; when you have the capability of resizing the table, you
must be able to rehash everything in the table.  Usually you can make
that much faster by caching the hashes.  Caching the hashes also makes
it much faster (in the average case) to check and make sure that you
are looking at an actual *match* rather than something else that
hashed into the same bucket.

If we can resize (and don't want to recalculate hashes when resizing)
then the cached hash needs to be table-size-independent.  It should use
an implementation-tuned range, and a reasonable choice matches the
fixnum range (or the non-negative fixnum range).  It also makes
sense that this range match the default bound when the hash functions
are called with only one parameter.
--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/