←back to thread

104 points thunderbong | 1 comments | | HN request time: 0.262s | source
Show context
teo_zero ◴[] No.42170385[source]
While the benefit of processing chunks of 8 bytes is obvious, what's the purpose of grouping those into macrogroups of 4? Does it trigger any implicit parallelism I failed to spot? Or is it just to end this phase with the 4 h[] having had the same amount of entropy, and thus starting the next one with h[0]?

> The way it loads the 8 bytes is also important. The correct way is to load via shift+or > This is free of any UB, works on any alignment and on any machine regardless of it's endianness. It's also fast, gcc and clang recognize this pattern and optimize it into a single mov instruction on x86 targets.

Is a single MOV instruction still fast when the 8 bytes begin on an odd address?

replies(2): >>42171256 #>>42171768 #
1. e4m2 ◴[] No.42171256[source]
> Is a single MOV instruction still fast when the 8 bytes begin on an odd address?

On x86, yes. There is no performance penalty for misaligned loads, except when the misaligned load also happens to straddle a cache line boundary, in which case it is slower, but only marginally so.