A leap year check in three instructions

1. qingcharles ◴[15 May 25 22:22 UTC] No.43999955[source]▶

I love these incomprehensible magic number optimizations. Every time I see one I wonder how many optimizations like this we missed back in the old days when we were writing all our inner loops in assembly?

Does anyone have a collection of these things?

replies(4): >>43999992 #>>44000134 #>>44000173 #>>44000633 #

2. masfuerte ◴[15 May 25 22:26 UTC] No.43999992[source]▶

>>43999955 (TP) #

We didn't miss them. In those days they weren't optimizations. Multiplications were really expensive.

replies(6): >>44000045 #>>44001019 #>>44001612 #>>44001634 #>>44001978 #>>44008837 #

3. kurthr ◴[15 May 25 22:33 UTC] No.44000045[source]▶

>>43999992 #

and divides were worse. (1 cycle add, 10 cycle mult, 60 cycle div)

replies(3): >>44000122 #>>44000234 #>>44000838 #

4. genewitch ◴[15 May 25 22:45 UTC] No.44000122{3}[source]▶

>>44000045 #

That's fair but mod is division, or no? So realistically the new magic number version would be faster. Assuming there is 32 bit int support. Sorry, this is above my paygrade.

replies(1): >>44001372 #

5. tylerhou ◴[15 May 25 22:48 UTC] No.44000134[source]▶

>>43999955 (TP) #

You should look at supercompilation.

replies(1): >>44002008 #

6. owl_vision ◴[15 May 25 22:55 UTC] No.44000173[source]▶

>>43999955 (TP) #

there is "Hacker's Delight" by Henry S. Warren, Jr.

https://en.wikipedia.org/wiki/Hacker's_Delight

replies(1): >>44000223 #

7. qingcharles ◴[15 May 25 23:01 UTC] No.44000223[source]▶

>>44000173 #

Looks awesome, thank you :)

8. qingcharles ◴[15 May 25 23:03 UTC] No.44000234{3}[source]▶

>>44000045 #

Yeah, I'm thinking more of ones that remove all the divs from some crazy math functions for graphics rendering and replace them all with bit shifts or boolean ops.

9. ryao ◴[16 May 25 00:18 UTC] No.44000633[source]▶

>>43999955 (TP) #

Here is a short list:

https://graphics.stanford.edu/~seander/bithacks.html

It is not on the list, but #define CMP(X, Y) (((X) > (Y)) - ((X) < (Y))) is an efficient way to do generic comparisons for things that want UNIX-style comparators. If you compare the output against 0 to check for some form of greater than, less than or equality, the compiler should automatically simplify it. For example, CMP(X, Y) > 0 is simplified to (X > Y) by a compiler.

The signum(x) function that is equivalent to CMP(X, 0) can be done in 3 or 4 instructions depending on your architecture without any comparison operations:

https://www.cs.cornell.edu/courses/cs6120/2022sp/blog/supero...

It is such a famous example, that compilers probably optimize CMP(X, 0) to that, but I have not checked. Coincidentally, the expansion of CMP(X, 0) is on the bit hacks list.

There are a few more superoptimized mathematical operations listed here:

https://www2.cs.arizona.edu/~collberg/Teaching/553/2011/Reso...

Note that the assembly code appears to be for the Motorola 68000 processor and it makes use of flags that are set in edge cases to work.

Finally, there is a list of helpful macros for bit operations that originated in OpenSolaris (as far as I know) here:

https://github.com/freebsd/freebsd-src/blob/master/sys/cddl/...

There used to be an Open Solaris blog post on them, but Oracle has taken it down.

Enjoy!

replies(3): >>44001468 #>>44002802 #>>44002875 #

10. ryao ◴[16 May 25 00:58 UTC] No.44000838{3}[source]▶

>>44000045 #

Division still is worse:

https://github.com/ridiculousfish/libdivide

11. godelski ◴[16 May 25 01:28 UTC] No.44001019[source]▶

>>43999992 #

Related, Computerphile had a video a few months ago where they try to put compute time relative to human time, similar to the way one might visualize an atom by making the proton the size of a golfball. I think it can help put some costs into perspective and really show why branching maters as well as the great engineering done to hide some of the slowdowns. But definitely some things are being marked simply by the sheer speed of the clock (like how the small size of a proton hides how empty an atom is)

  https://youtube.com/watch?v=PpaQrzoDW2I

12. bobmcnamara ◴[16 May 25 02:39 UTC] No.44001372{4}[source]▶

>>44000122 #

Many compiles will compute div-by-a-constant using the invert, multiply, and shift off the remainder trick. Once you have that, you can do mod-by-a-constant as a derivative and usually still beat 1-bit or 2-bit division.

13. JdeBP ◴[16 May 25 03:03 UTC] No.44001468[source]▶

>>44000633 #

For an entire book on this stuff, see Henry S. Warren Jr's Hackers Delight. The "three valued compare function" is in chapter 2, for example.

14. ◴[16 May 25 03:35 UTC] No.44001612[source]▶

>>43999992 #

15. JdeBP ◴[16 May 25 05:00 UTC] No.44001978[source]▶

>>43999992 #

Multiplications of this word length, one should clarify. It's not that multiplication was an inherently more expensive or different operation back then (assuming from context here that the "old days" of coding inner loops in assembly language pre-date even the 32-bit ALU era). Binary multiplication has not changed in millennia. Ancient Egyptians were using the same binary integer multiplication logic 5 millennia ago as ALUs do today.

It was that generally the fast hardware multiplication operations in ALUs didn't have very many bits in the register word length, so multiplications of wider words had to be done with library functions that did long multiplication in (say) base 256.

So this code in the headlined article would not be "three instructions" but three calls to internal helper library functions used by the compiler for long-word multiplication, comparison, and bitwise AND; not markedly more optimal than three internal helper function calls for the three original modulo operations, and in fact less optimal than the bit-twiddled modulo-powers-of-2 version found halfway down the headlined article, which would only need check the least significant byte and not call library functions for two of the 32-bit modulo operations.

Bonus points to anyone who remembers the helper function names in Microsoft BASIC's runtime library straight off the top of xyr head. It is probably a good thing that I finally seem to have forgotten them. (-: They all began with "B$" as I recall.

replies(3): >>44002816 #>>44004271 #>>44007422 #

16. mshockwave ◴[16 May 25 05:08 UTC] No.44002008[source]▶

>>44000134 #

sometimes also known as superoptimization, which many of them also use SMT solvers like Z3 mentioned in the article

replies(1): >>44002057 #

17. tylerhou ◴[16 May 25 05:19 UTC] No.44002057{3}[source]▶

>>44002008 #

Yes, sorry, superoptimization is the correct term.

18. eru ◴[16 May 25 07:49 UTC] No.44002802[source]▶

>>44000633 #

> It is not on the list, but #define CMP(X, Y) (((X) > (Y)) - ((X) < (Y))) is an efficient way to do generic comparisons for things that want UNIX-style comparators. If you compare the output against 0 to check for some form of greater than, less than or equality, the compiler should automatically simplify it. For example, CMP(X, Y) > 0 is simplified to (X > Y) by a compiler.

I guess this only applies when the compiler knows what version of > you are using?

Eg it might not work in C++ when < and > are overloaded for eg strings?

replies(3): >>44003877 #>>44008285 #>>44008439 #

19. eru ◴[16 May 25 07:51 UTC] No.44002816{3}[source]▶

>>44001978 #

> Multiplications of this word length, one should clarify. It's not that multiplication was an inherently more expensive or different operation back then (assuming from context here that the "old days" of coding inner loops in assembly language pre-date even the 32-bit ALU era). Binary multiplication has not changed in millennia. Ancient Egyptians were using the same binary integer multiplication logic 5 millennia ago as ALUs do today.

Well, we can actually multiply long binary numbers asymptotically faster than Ancient Egyptians.

See eg https://en.wikipedia.org/wiki/Karatsuba_algorithm

20. kmoser ◴[16 May 25 08:02 UTC] No.44002875[source]▶

>>44000633 #

There's also this classic: https://en.wikipedia.org/wiki/Fast_inverse_square_root

replies(1): >>44006975 #

21. trollbridge ◴[16 May 25 11:00 UTC] No.44003877{3}[source]▶

>>44002802 #

The compiler would resolve that before the optimiser.

replies(1): >>44011881 #

22. kruador ◴[16 May 25 11:52 UTC] No.44004271{3}[source]▶

>>44001978 #

Most 8-bit CPUs didn't even have a hardware multiply instruction. To multiply on a 6502, for example, or a Z80, you have to add repeatedly. You can multiply by a power of 2 by shifting left, so you can get a bigger result by switching between shifting and adding or subtracting. Although, again, on these earlier CPUs you can only shift by one bit at a time, rather than by a variable number of bits.

There's also the difference between multiplying by a hard-coded value, which can be implemented with shifts and adds, and multiplying two variables, which has to be done with an algorithm.

The 8086 did have multiply instructions, but they were implemented as a loop in the microcode, adding the multiplicand, or not, once for each bit in the multiplier. More at https://www.righto.com/2023/03/8086-multiplication-microcode.... Multiplying by a fixed value using shifts and adds could be faster.

The prototype ARM1 did not have a multiply instruction. The architecture does have a barrel shifter which can shift one of the operands by any number of bits. For a fixed multiplication, it's possible to compute multiplying by a power of two, by (power of two plus 1), or by (power of two minus 1) in a single instruction. The latter is why ARM has both a SUB (subtract) instruction, computing rd := rs1 - Operand2, and a RSB (Reverse SuBtract) instruction, computing rd := Operand2 - rs1. The second operand goes through the barrel shifter, allowing you to write an instruction like 'RSB R0, R1, R1, #4' meaning 'R0 := (R1 << 4) - R1', or in other words '(R1 * 16) - R1', or R1 * 15.

ARMv2 added in MUL and MLA (MuLtiply and Accumulate) instructions. The hardware ARM2 implementation uses a Booth's encoder to multiply 2 bits at a time, taking up to 16 cycles for 32 bits. It can exit early if the remaining bits are all 0s.

Later ARM cores implemented an optional wider multiplier (that's the 'M' in 'ARM7TDMI', for example) that could multiply more bits at a time, therefore executing in fewer cycles. I believe ARM7TDMI was 8-bit, completing in up to 4 cycles (again, offering early exit). Modern ARM cores can do 64-bit multiplies in a single cycle.

replies(1): >>44004441 #

23. cbm-vic-20 ◴[16 May 25 12:08 UTC] No.44004441{4}[source]▶

>>44004271 #

The base RISC-V instruction set does not include hardware multiply instructions. Most implementations do include the M (or related) extensions that provide them, but if you are building a processor that doesn't need it, you don't need to include it.

24. ryao ◴[16 May 25 15:57 UTC] No.44006975{3}[source]▶

>>44002875 #

That is an approximation. If approximations are acceptable, then here is a trick you might like. In loops that call cosf(i * C) and/or sinf(i * C), where i is incremented by 1 on each iteration and C is some constant expression, you can call cosf() and sinf() once (or twice if i starts at something other than 0 or 1) outside of the loop and use the angle addition formula to do accumulation via multiplication and addition inside the loop. The loop will run significantly faster.

Even if you only need one of cosf() or sinf(), many CPUs calculate both values at the same time, so taking the other is free. If you only need single precision values, you can do this in double precision to avoid much of the errors you would get by doing this in single precision.

This trick can be used to accelerate the RoPE relative positional encoding calculations used in inference for llama 3 and likely others. I have done this and seen a measurable speed up, although these calculations are such a small part of inference that it was a small improvement.

25. kens ◴[16 May 25 16:39 UTC] No.44007422{3}[source]▶

>>44001978 #

> Binary multiplication has not changed in millennia. Ancient Egyptians were using the same binary integer multiplication logic 5 millennia ago as ALUs do today.

It turns out that multiplication in modern ALUs is very different. The Pentium, for instance, does multiplication using base-8, not base-2, cutting the number of additions by a factor of 3. It also uses Booth's algorithm, so much of the time it is subtracting, not adding.

26. ◴[16 May 25 18:08 UTC] No.44008285{3}[source]▶

>>44002802 #

27. ryao ◴[16 May 25 18:25 UTC] No.44008439{3}[source]▶

>>44002802 #

My comment had been meant for C, but it should apply to C++ too even when operator overloading is used, provided the comparisons are simple and inlined. If you add overloads for the > and < operators in your string example to a place where they would inline, and the overload compares .length(), this should simplify. For example, godbolt shows that CMP(X, Y) == 0 is optimized to one mov instruction and one cmp instruction despite operator overloads when I implement your string example:

https://godbolt.org/z/nGbPhz86q

If you did not inline the operator overloads and had them in another compilation unit, do not expect this to simplify (unless you use LTO).

If you have compound comparators in the operator overloads (such that on equality in one field, it considers a second for a tie breaker), I would not expect it to simplify, although the compiler could surprise me.

replies(1): >>44011879 #

28. Someone ◴[16 May 25 19:08 UTC] No.44008837[source]▶

>>43999992 #

And branches were cheaper without pipelining

29. eru ◴[17 May 25 03:42 UTC] No.44011879{4}[source]▶

>>44008439 #

I was more thinking of eg lexicographic comparisons of strings, not just comparing by length.

Yes, if you have a smart enough compiler, or a simple enough comparison, this will simplify.

replies(1): >>44023696 #

30. eru ◴[17 May 25 03:43 UTC] No.44011881{4}[source]▶

>>44003877 #

I'd like to see that resolved for eg lexicographic comparison of strings.

31. ryao ◴[18 May 25 19:18 UTC] No.44023696{5}[source]▶

>>44011879 #

You could use CMP(A, B) as part of your lexographic comparison and then have it output the result of the first non-zero result (unless you find both strings are equal, in which case, you would output zero) when comparing characters.

If you implement the operators, you can use CMP(A, B) to turn it into a three value output, since it works solely using Boolean logic, but I would be surprised if it simplified. I am half prepared to be surprised since the compiler might do some CSE after inlining and then do some other transformation. That said, you really only want to use CMP(A, B) for numerical comparisons.

replies(1): >>44026955 #

32. eru ◴[19 May 25 06:22 UTC] No.44026955{6}[source]▶

>>44023696 #

Yes, you can definitely manually define it. I was talking about what we can reasonably expect the compiler to figure out on its own.