c - Perfect/ideal hash to isolate anagrams -
in effort accelerate fast-out behaviour on testing strings anagrams, came with prime-based hashing scheme -- although looks wasn't first.
the basic idea map letters prime numbers, , compute product of these primes. rearrangement of letters have same product, , if result can arbitrarily large no combination of other letters can produce same result.
i had envisioned just hash. product overflow , start alias other letter combinations. however, mapping frequent letters smallest primes product grows , can avoid overflow altogether. in case perfect hash, giving both definite positive , negative results without additional testing.
what's notable doesn't fill coding space efficiently before overflowing. no result have prime factors greater 103, , distribution of small primes fixed , not great match letter frequency.
now i'm wondering if there's substantially better this. covers more results perfect hashes , has strong distribution in remaining cases.
the densest coding scheme can think of sort letters , pack them word entropy coder. in scheme letter frequency enormously biased because of range constraints applied each position (eg., likelihood of sorted array starting z substantially lower of sorted array ending z).
that sounds whole lot of work, though -- , can't see guaranteeing give distribution in overflow case.
perhaps there's better set of factors map letters to, , better way detect when risk of aliasing has started. or hashing scheme doesn't rely on multiplication? that's easy calculate?
so that's:
- a perfect hash real-world input possible (for sensible number of bits).
- a strong hash remaining cases, means of distinguishing 2 cases.
- easy calculate.
english language constraints (26 letters typical english-like word structure) fine. multi-byte coding schemes whole other problem.
c code preferred because understand it.
if using n-bit hashes alphabet of size m, can unique hash anagrams (n-m) characters long using approach described here. makes collision detection unnecessary limit word size depending on size of alphabet , available space.
to allow words of length, use n-1 bits hash words (n-m-1) characters in length, , save last bit signal word m characters or longer. in cases use remaining n-1 bits prime-number or other hashing algorithm, of course have collision detection anytime got multiple words in buckets. since in real-world application majority of words occupy shorter word lengths, you'll drastically cut collision detection needed longer words.
Comments
Post a Comment