[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: bear <bear@xxxxxxxxx>*Subject*: Re: should hash function upper bound be exclusive?*From*: Per Bothner <per@xxxxxxxxxxx>*Date*: Tue, 09 Aug 2005 21:52:10 -0700*Cc*: srfi-69@xxxxxxxxxxxxxxxxx*Delivered-to*: srfi-69@xxxxxxxxxxxxxxxxx*In-reply-to*: <Pine.LNX.4.58.0508091632550.969@xxxxxxxxxxxxxx>*References*: <42F8E871.5070903@xxxxxxxxxxx> <Pine.LNX.4.58.0508091632550.969@xxxxxxxxxxxxxx>*User-agent*: Mozilla Thunderbird 1.0.6-1.1.fc4 (X11/20050720)

bear wrote:

On the other hand, the largest fixnum is likely to be either a power of two, or one less than a power of two; as hashes are conventionally implemented (via modulus operations) these are poor choices for a hash range boundary. Numerical theory indicates that a much better choice is the highest prime fixnum representable, or at least a fixnum with no small factors.

I'm afraid I don't see your point. A "default bound" is a generic hash value that has a larger range than indexes for a hash table, which means you *will* be doing a modulo operation to convert the generic hash value to an index. Yes, some people recommend that that *the number of buckets* be a prime number, but I don't how well-founded that is. If the hash function is properly "random", does it matter? This faq: http://burtleburtle.net/bob/hash/hashfaq.html says: Anything with a modulo prime at the end is probably bad; a decent hash function could use a power of two and wouldn't need to rely on the modulo prime to further mix anything.

Otherwise you run into rehash problems where the system may be able to find no valid rehash value even though the table is not full.

If by "rehash" you mean linear probing (secondary hashing, IIRC), then obviously the sequence of probes should be exhaustive, and one easy way to do that is to have an increment that is relatively prime to the number of buckets. Howwever, if each bucket contains a chain (as in the referernce implementation) that's not an issue. If by "rehash" you mean resizing (increasing the number of buckets), I still don't see the relevance.

Definitely; when you have the capability of resizing the table, you must be able to rehash everything in the table. Usually you can make that much faster by caching the hashes. Caching the hashes also makes it much faster (in the average case) to check and make sure that you are looking at an actual *match* rather than something else that hashed into the same bucket.

If we can resize (and don't want to recalculate hashes when resizing) then the cached hash needs to be table-size-independent. It should use an implementation-tuned range, and a reasonable choice matches the fixnum range (or the non-negative fixnum range). It also makes sense that this range match the default bound when the hash functions are called with only one parameter. -- --Per Bothner per@xxxxxxxxxxx http://per.bothner.com/

**References**:**should hash function upper bound be exclusive?***From:*Per Bothner

**Re: should hash function upper bound be exclusive?***From:*bear

- Prev by Date:
**Re: should hash function upper bound be exclusive?** - Next by Date:
**Re: hash for eq?** - Previous by thread:
**Re: should hash function upper bound be exclusive?** - Next by thread:
**Re: should hash function upper bound be exclusive?** - Index(es):