ChibiHash: Small, Fast 64 bit hash function

(nrk.neocities.org)

104 points thunderbong | 4 comments | 18 Nov 24 00:23 UTC | HN request time: 0s | source

Show context

teo_zero ◴[18 Nov 24 06:59 UTC] No.42170385[source]▶

While the benefit of processing chunks of 8 bytes is obvious, what's the purpose of grouping those into macrogroups of 4? Does it trigger any implicit parallelism I failed to spot? Or is it just to end this phase with the 4 h[] having had the same amount of entropy, and thus starting the next one with h[0]?

> The way it loads the 8 bytes is also important. The correct way is to load via shift+or > This is free of any UB, works on any alignment and on any machine regardless of it's endianness. It's also fast, gcc and clang recognize this pattern and optimize it into a single mov instruction on x86 targets.

Is a single MOV instruction still fast when the 8 bytes begin on an odd address?

replies(2): >>42171256 #>>42171768 #

xxs ◴[18 Nov 24 12:25 UTC] No.42171768[source]▶

>>42170385 #

like mentioned x86/64 is quite generous with non-aligned access, yet on architectures that require aligned loads, they will be aligned (all lowest bits being zero at the start), so it will continue being aligned with each 8 byte load.

replies(1): >>42177097 #

1. teo_zero ◴[18 Nov 24 21:12 UTC] No.42177097[source]▶

>>42171768 #

> they will be aligned (all lowest bits being zero at the start

I don't think so. At the start k will be equal to keyIn, which can point to any arbitrary memory location.

replies(1): >>42180803 #

2. rerdavies ◴[19 Nov 24 07:21 UTC] No.42180803[source]▶

>>42177097 (TP) #

But in practice, will always point to the start of a buffer allocated from the heap which will be aligned.

replies(1): >>42182400 #

3. teo_zero ◴[19 Nov 24 11:46 UTC] No.42182400[source]▶

>>42180803 #

Why? If you are hashing names, for example, or keywords, they can begin at whatever position inside the buffer.

The case where you allocate a buffer, populate it with random contents and hash the whole of it is an artificial setup for benchmarking, but far from real use cases.

replies(1): >>42183440 #

4. xxs ◴[19 Nov 24 13:54 UTC] No.42183440{3}[source]▶

>>42182400 #

i was about to answer pretty much as 'rerdavies' did. If you have such requirements the specific architectures would need wrappers to copy the data, or do pref/post fix up (ala crc32 & avx implementations on non perfect boundary)

↑