c - Perfect/ideal hash to isolate anagrams -


in effort accelerate fast-out behaviour on testing strings anagrams, came with prime-based hashing scheme -- although looks wasn't first.

the basic idea map letters prime numbers, , compute product of these primes. rearrangement of letters have same product, , if result can arbitrarily large no combination of other letters can produce same result.

i had envisioned just hash. product overflow , start alias other letter combinations. however, mapping frequent letters smallest primes product grows , can avoid overflow altogether. in case perfect hash, giving both definite positive , negative results without additional testing.

what's notable doesn't fill coding space efficiently before overflowing. no result have prime factors greater 103, , distribution of small primes fixed , not great match letter frequency.

now i'm wondering if there's substantially better this. covers more results perfect hashes , has strong distribution in remaining cases.

the densest coding scheme can think of sort letters , pack them word entropy coder. in scheme letter frequency enormously biased because of range constraints applied each position (eg., likelihood of sorted array starting z substantially lower of sorted array ending z).

that sounds whole lot of work, though -- , can't see guaranteeing give distribution in overflow case.

perhaps there's better set of factors map letters to, , better way detect when risk of aliasing has started. or hashing scheme doesn't rely on multiplication? that's easy calculate?

so that's:

  • a perfect hash real-world input possible (for sensible number of bits).
  • a strong hash remaining cases, means of distinguishing 2 cases.
  • easy calculate.

english language constraints (26 letters typical english-like word structure) fine. multi-byte coding schemes whole other problem.

c code preferred because understand it.

if using n-bit hashes alphabet of size m, can unique hash anagrams (n-m) characters long using approach described here. makes collision detection unnecessary limit word size depending on size of alphabet , available space.

to allow words of length, use n-1 bits hash words (n-m-1) characters in length, , save last bit signal word m characters or longer. in cases use remaining n-1 bits prime-number or other hashing algorithm, of course have collision detection anytime got multiple words in buckets. since in real-world application majority of words occupy shorter word lengths, you'll drastically cut collision detection needed longer words.


Comments

Popular posts from this blog

css - Which browser returns the correct result for getBoundingClientRect of an SVG element? -

gcc - Calling fftR4() in c from assembly -

Function that returns a formatted array in VBA -