A leap year check in three instructions

(hueffner.de)

1. drewg123 ◴[15 May 25 22:47 UTC] No.44000124[source]▶

I tend to be of the opinion that for modern general purpose CPUs in this era, such micro-optimizations are totally unnecessary because modern CPUs are so fast that instructions are almost free.

But do you know what's not free? Memory accesses[1]. So when I'm optimizing things, I focus on making things more cache friendly.

[1] http://gec.di.uminho.pt/discip/minf/ac0102/1000gap_proc-mem_...

replies(14): >>44000191 #>>44000255 #>>44000266 #>>44000351 #>>44000378 #>>44000418 #>>44000430 #>>44000433 #>>44000478 #>>44000639 #>>44000687 #>>44001113 #>>44001140 #>>44001975 #

2. drewg123 ◴[15 May 25 22:57 UTC] No.44000191[source]▶

>>44000124 (TP) #

To temper this slightly, these sorts of optimizations are useful on embedded CPUs for device firmware, IOT, etc. I've worked on smart NIC CPUs where cycles were so precious we'd do all kinds of crazy unreadable things.

replies(2): >>44000245 #>>44000547 #

3. sitzkrieg ◴[15 May 25 23:05 UTC] No.44000245[source]▶

>>44000191 #

on the flip side of the topic, trying to do any datetime handling on the edge of embedded compute is going to be wrong 100% of the time anyway

replies(1): >>44000441 #

4. kreco ◴[15 May 25 23:06 UTC] No.44000255[source]▶

>>44000124 (TP) #

> such micro-optimizations are totally unnecessary because modern CPUs are so fast that instructions are almost free.

I'm amazed by the fact there is always someone who will say that such optimization are totally unnecessary.

replies(2): >>44000462 #>>44001066 #

5. andrepd ◴[15 May 25 23:08 UTC] No.44000266[source]▶

>>44000124 (TP) #

> I tend to be of the opinion that for modern general purpose CPUs in this era, such micro-optimizations are totally unnecessary because modern CPUs are so fast that instructions are almost free.

What does this mean? Free? Optimisations are totally unnecessary because... instructions are free?

The implementation in TFA is probably on the order of 5x more efficient than a naive approach. This is time and energy as well. I don't understand what "free" means in this context.

Calendar operations are performed probably trillions of times every second across all types of computers. If you can make them more time- and energy-efficient, why wouldn't you?

If there's a problem with modern software it's too much bloat, not too much optimisation.

replies(2): >>44000312 #>>44000329 #

6. jdlshore ◴[15 May 25 23:15 UTC] No.44000312[source]▶

>>44000266 #

GP made an important point that you seemed to have missed: in modern architectures, it’s much more important to minimize memory access than to minimize instructions. They weren’t saying optimization isn’t important, they were describing how to optimize on modern systems.

7. drewg123 ◴[15 May 25 23:18 UTC] No.44000329[source]▶

>>44000266 #

If this is indeed done trillions of times a second, which I frankly have a hard time believing, then sure, it might be worth it. But on a modern CPU, focusing on an optimization like this is a poor use of developer resources. There are likely several other optimizations related to cache locality that you could find in less time than it would take to do this, and those other optimizations would probably give several orders of magnitude more improvement.

Not to mention that the final code is basically a giant WTF for anybody reading it. It will be an attractive nuisance that people will be drawn to, like moths to a flame, any time there is a bug around calendar operations.

replies(2): >>44000402 #>>44000448 #

8. croes ◴[15 May 25 23:21 UTC] No.44000351[source]▶

>>44000124 (TP) #

> The world could run on older hardware if software optimization was a priority

https://news.ycombinator.com/item?id=43971464

replies(1): >>44002989 #

9. PaulKeeble ◴[15 May 25 23:27 UTC] No.44000378[source]▶

>>44000124 (TP) #

The thing is about these optimisations (assuming they test as higher performance) is that they can get applied in a library and then everyone benefits from the speedup that took some hard graft to work out. Very few people bake their own date API nowadays if they can avoid it since it already exists and techniques like this just speed up every programme whether its on the critical path or not.

replies(1): >>44000391 #

10. codexb ◴[15 May 25 23:30 UTC] No.44000391[source]▶

>>44000378 #

That's basically compilers these days. It used to be that you could try and optimize your code, inline things here and there, but these days, you're not going to beat the compiler optimization.

replies(6): >>44000550 #>>44000584 #>>44000692 #>>44000889 #>>44000980 #>>44001055 #

11. wtetzner ◴[15 May 25 23:32 UTC] No.44000402{3}[source]▶

>>44000329 #

> But on a modern CPU, focusing on an optimization like this is a poor use of developer resources.

How many people are rolling their own datetime code? This seems like a totally fine optimization to put into popular datetime libraries.

12. achierius ◴[15 May 25 23:36 UTC] No.44000418[source]▶

>>44000124 (TP) #

Just because CPU performance is increasing faster than DRAM speeds doesn't mean that CPU performance is "free" while memory is "expensive". One thing that you're ignoring is the impact of caches and prefetching logic which have significantly improved the performance of memory-bound workloads in the last 5-10 years. DRAM might be slow, but if you avoid going out to DRAM...

More broadly, it 100% depends on your workload. You'd be surprised at how many workloads are compute-bound, even today: LLM inference might be memory bound (thus how it's possible to get remotely good performance on a CPU), but training, esp. prefill, is very much not. And once you get out of LLMs, I'd say that most applications of ML tend to be compute bound.

13. GuB-42 ◴[15 May 25 23:39 UTC] No.44000430[source]▶

>>44000124 (TP) #

Integer division (and modulo) is not cheap on most CPUs. Along with memory access and branch prediction, it is something worth optimizing for.

And since you are talking about memory. Code also goes in memory. Shorter code is more cache friendly.

I don't see a use case where it matters for this particular application (it doesn't mean there isn't) but well targeted micro-optimizations absolutely matter.

14. klysm ◴[15 May 25 23:39 UTC] No.44000433[source]▶

>>44000124 (TP) #

I think about branches a lot too when optimizing

15. RaoulP ◴[15 May 25 23:41 UTC] No.44000441{3}[source]▶

>>44000245 #

Would you mind elaborating?

16. andrepd ◴[15 May 25 23:42 UTC] No.44000448{3}[source]▶

>>44000329 #

> There are likely several other optimizations related to cache locality that you could find in less time than it would take to do this, and those other optimizations would probably give several orders of magnitude more improvement.

How is cache / memory access relevant in a subroutine that performs a check on a 16bit number?

> Not to mention that the final code is basically a giant WTF for anybody reading it. It will be an attractive nuisance that people will be drawn to, like moths to a flame, any time there is a bug around calendar operations.

1: comments are your friend

2: a unit test can assert that this function is equivalent to the naive one in about half a millisecond.

17. recursive ◴[15 May 25 23:45 UTC] No.44000462[source]▶

>>44000255 #

Some people have significant positions on CPU manufacturers, so there will always be at least a few.

18. adonovan ◴[15 May 25 23:48 UTC] No.44000478[source]▶

>>44000124 (TP) #

Quite right. If you use a compiled language (e.g. Go) the difference between the two implementations is indeed negligible.

https://go.dev/play/p/i72xCyhqRkC

19. bigiain ◴[16 May 25 00:02 UTC] No.44000547[source]▶

>>44000191 #

I suspect most IOT device manufacturers expect/design their device to be landfill before worrying about leap year math. (In my least optimistic moments, I suspect some of them may intentionally implement known broken algorithms that make their eWaste stop working correctly at some point in the near future that's statistically likely to bear beyond the warranty period.)

replies(1): >>44001037 #

20. ◴[16 May 25 00:02 UTC] No.44000550{3}[source]▶

>>44000391 #

21. ryao ◴[16 May 25 00:08 UTC] No.44000584{3}[source]▶

>>44000391 #

Meanwhile, GCC will happily implement bsearch() without cmov instructions and the result will be slower than a custom implementation on which it emits cmov instructions. I do not believe anyone has filed a bug report specifically about the inefficient bsearch(), but the bug report I filed a few years ago on inefficient code generation for binary search functions is still open, so I see no point in bothering:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110001

Binary searches on OpenZFS B-Tree nodes are faster in part because we did not wait for the compiler:

https://github.com/openzfs/zfs/commit/677c6f8457943fe5b56d7a...

Eliminating comparator function overhead via inlining is also a part of the improvement, which we would not have had because the OpenZFS code is not built with LTO, so even if the compiler fixes that bug, the patch will still have been useful.

22. kragen ◴[16 May 25 00:18 UTC] No.44000639[source]▶

>>44000124 (TP) #

It's true that this code was optimized from 2.6ns down to 0.9ns, a saving of 1.7ns, while an L2 cache miss might be 80ns. But 1.7ns is still about 2% of the 80ns, and it's about 70% of the 2.6ns. You don't want to start optimizing by reducing things that are 2% of your cost, but 2% isn't insignificant.

The bigger issue is that probably you don't need to do leap-year checks very often so probably your leap-year check isn't the place to focus unless it's, like, sending a SQL query across a data center or something.

23. __turbobrew__ ◴[16 May 25 00:27 UTC] No.44000687[source]▶

>>44000124 (TP) #

This is why linear arrays are the fastest datastructure unless proven otherwise.

24. kragen ◴[16 May 25 00:27 UTC] No.44000692{3}[source]▶

>>44000391 #

That is a meme that people repeat a lot, but it turns out to be wrong:

https://cr.yp.to/talks/2015.04.16/slides-djb-20150416-a4.pdf (though see https://blog.regehr.org/archives/1515: "This piece (...) explains why Daniel J. Bernstein’s talk, The death of optimizing compilers (audio [http://cr.yp.to/talks/2015.04.16/audio.ogg]) is wrong", citing https://news.ycombinator.com/item?id=9397169)

https://blog.royalsloth.eu/posts/the-compiler-will-optimize-...

http://lua-users.org/lists/lua-l/2011-02/msg00742.html

https://web.archive.org/web/20150213004932/http://x264dev.mu...

replies(1): >>44007896 #

25. matheusmoreira ◴[16 May 25 01:07 UTC] No.44000889{3}[source]▶

>>44000391 #

These days optimizing compilers are your number one enemy.

They'll "optimize" your code by deleting it. They'll "prove" your null/overflow checks are useless and just delete them. Then they'll "prove" your entire function is useless or undefined and just "optimize" it to a no-op or something. Make enough things undefined and maybe they'll turn the main function into a no-op.

In languages like C, people are well advised to disable some problematic optimizations and explicitly force the compiler to assume some implementation details to make things sane.

replies(1): >>44001068 #

26. nerdralph ◴[16 May 25 01:21 UTC] No.44000980{3}[source]▶

>>44000391 #

I was disappointed with how bad avr-gcc was at optimizing code. Here's one of several examples I found. https://nerdralph.blogspot.com/2015/03/fastest-avr-software-...

27. crote ◴[16 May 25 01:34 UTC] No.44001037{3}[source]▶

>>44000547 #

"Is year divisible by four" will work perfectly fine for the next 75 years. Random consumer devices are definitely not going to survive that long, so being capable of dealing with it adds exactly zero value to the product.

It's hard to imagine being in a situation where the cost of the additional "year divisible by 100" check, or even the "year divisible by 400" check is too much to bear, and it's trivial enough that the developer overhead is negligible, but you never know when you need those extra handful of bytes of memory I guess...

replies(1): >>44012244 #

28. godelski ◴[16 May 25 01:38 UTC] No.44001055{3}[source]▶

>>44000391 #

Definitely not true.

You aren't going to beat the compiler if you have to consider a wide range of inputs and outputs but that isn't a typical setting and you can actually beat them. Even in general settings this can be true because it's still a really hard problem for the compiler to infer things you might know. That's why C++ has all those compiler hints and why people optimize with gcc flags other than -O.

It's often easy to beat Blas in matrix multiplication if you know some conditions on that matrix. Because Blas will check to find the best algo first but might not know (you could call directly of course and there you likely won't win, but you're competing against a person not the compiler).

Never over estimate the compiler. The work the PL people do is unquestionably useful but they'll also be the first to say you can beat it.

You should always do what Knuth suggested (the often misunderstood "premature optimization" quote) and get the profile.

29. godelski ◴[16 May 25 01:39 UTC] No.44001066[source]▶

>>44000255 #

I'm amazed by the fact there is always someone who misinterprets Knuth's "premature optimization", reading as "don't optimize" instead of "pull out the profiler"

30. ryao ◴[16 May 25 01:39 UTC] No.44001068{4}[source]▶

>>44000889 #

If they prove a NULL check is always false, it means you have dead code.

For example:

  if (p == NULL) return;

  if (p == NULL) doSomething();

It is safe to delete the second one. Even if it is not deleted, it will never be executed.

What is problematic is when they remove something like memset() right before a free operation, when the memset() is needed to sanitize sensitive data like encryption keys. There are ways of forcing compilers to retain the memset(), such as using functions designed not to be optimized out, such as explicit_bzero(). You can see how we took care of this problem in OpenZFS here:

https://github.com/openzfs/zfs/pull/14544

replies(2): >>44001488 #>>44002944 #

31. godelski ◴[16 May 25 01:48 UTC] No.44001113[source]▶

>>44000124 (TP) #

  > modern CPUs are so fast that instructions are almost free.

Please don't.

These things compound. You especially need to consider typical computer usage involves using more than one application at a time. There's a tragedy of the commons issue that's often ignored. It can be if you're optimizing your code (you're minimizing your share!) but it can't be if you're not.

I guarantee you we'd have a lot of faster things if people invested even a little time (these also compound :). Two great examples might be Llama.cpp and FlashAttention. Both of these have had a huge impact of people (among a number of other works) but don't get nearly the same attention as other stuff. These are popular instances but I promise you that there's a million problems like these waiting to be solved. It's just not flashy, but hey plumbers and garbagemen are pretty critical jobs too

replies(1): >>44001211 #

32. ◴[16 May 25 01:52 UTC] No.44001140[source]▶

>>44000124 (TP) #

33. EnPissant ◴[16 May 25 02:05 UTC] No.44001211[source]▶

>>44001113 #

You haven't refuted the parent comment at all. They asserted that instructions are insignificant, and other things, such as memory accesses, dominate.

replies(2): >>44003142 #>>44010019 #

34. matheusmoreira ◴[16 May 25 03:10 UTC] No.44001488{5}[source]▶

>>44001068 #

They just think you have lots of dead code because of silly undefined behavior nonsense.

  char *allocate_a_string_please(int n)
  {
      if (n + 1 < n)
          return 0; // overflow

      return malloc(n + 1); // space for the NUL
  }

This code seems okay at first glance, it's a simple integer overflow check that makes sense to anyone who reads it. The addition will overflow when n equals INT_MAX, it's going to wrap around and the function will return NULL. Reasonable.

Unfortunately, we cannot have nice things because of optimizing compilers and the holy C standard.

The compiler "knows" that signed integer overflow is undefined. In practice, it just assumes that integer overflow cannot ever happen and uses this "fact" to "optimize" this program. Since signed integers "cannot" overflow, it "proves" that the condition always evaluates to false. This leads it to conclude that both the condition and the consequent are dead code.

Then it just deletes the safety check and introduces potential security vulnerabilities into the software.

They had to add literal compiler builtins to let people detect overflow conditions and make the compiler actually generate the code they want it to generate.

Fighting the compiler's assumptions and axioms gets annoying at some point and people eventually discover the mercy of compiler flags such as -fwrapv and -fno-strict-aliasing. Anyone doing systems programming with strict aliasing enabled is probably doing it wrong. Can't even cast pointers without the compiler screwing things up.

replies(2): >>44001867 #>>44006492 #

35. oguz-ismail ◴[16 May 25 04:35 UTC] No.44001867{6}[source]▶

>>44001488 #

>if (n + 1 < n)

No one does this

replies(1): >>44004104 #

36. timewizard ◴[16 May 25 04:59 UTC] No.44001975[source]▶

>>44000124 (TP) #

They're not free at all. They're unlikely to create noticeable latency; however, CPU clock speeds and thus power consumption haven't been constant for years now. You are paying for that lack of optimization and mobile users would thank you to do it.

There is no silver bullet.

37. Dylan16807 ◴[16 May 25 08:15 UTC] No.44002944{5}[source]▶

>>44001068 #

If they prove a check is always false, it means you have dead code or you made a mistake.

It is very very hard to write C without mistakes.

When not-actually-dead code gets removed, the consequences of many mistakes get orders of magnitudes worse.

38. Dylan16807 ◴[16 May 25 08:22 UTC] No.44002989[source]▶

>>44000351 #

I disagree.

This is the kind of optimization that makes you need a 1.5MHz CPU instead of a 1MHz CPU, but saves devs weeks of effort. It's the kind of thing you give up when you move optimization from priority 1 to 2, or from 2 to 3. It would still run blazingly fast on a 30 year old computer. It's a perfectly good tradeoff.

The stuff that bogs down modern hardware is optimization being priority 8 or not even on the list of considerations.

39. windward ◴[16 May 25 08:48 UTC] No.44003142{3}[source]▶

>>44001211 #

They do, until you have a tough problem that's still too slow after it's cache efficient

40. matheusmoreira ◴[16 May 25 11:33 UTC] No.44004104{7}[source]▶

>>44001867 #

Oh people absolutely do this.

Here's a 2018 example.

https://github.com/mruby/mruby/commit/180f39bf4c5246ff77ef71...

https://github.com/mruby/mruby/issues/4062

  while (l >= bsiz - blen) {
      bsiz *= 2;

      if (bsiz < 0)
          mrb_raise(mrb, E_ARGUMENT_ERROR, "too big specifier");
  }

> bsiz*=2 can become negative.

> However with -O2 the mrb_raise is never triggered, since bsiz is a signed integer.

> Signed integer overflows are undefined behaviour and thus gcc removes the check.

People have even categorized this as a compiler vulnerability.

https://www.kb.cert.org/vuls/id/162289

> C compilers may silently discard some wraparound checks

And they aren't wrong.

The programmer wrote reasonable code that makes sense and perfectly aligns with their mental model of the machine.

The compiler took this code and screwed it up because it violates compiler assumptions about some abstract C machine nobody really cares about.

replies(2): >>44004726 #>>44006869 #

41. Joker_vD ◴[16 May 25 12:39 UTC] No.44004726{8}[source]▶

>>44004104 #

I propose we rewrite everything in my yet-unnamed new low-level language:

    loop
        while l >= bsize - blen;
        bsiz, ovf := bsiz * 2;
        if ovf <> 0 then
            mrb_raise(mrb, E_ARGUMENT_ERROR, ("too big specifier\x00"));
        end
    end

replies(1): >>44006586 #

42. ryao ◴[16 May 25 15:16 UTC] No.44006492{6}[source]▶

>>44001488 #

I would consider the use of signed integers for sizes to be wrong, but if you insist on using them in this example, just test for (n == INT_MAX). malloc itself uses size_t, which is unsigned.

I have been known to write patches converting signed integers to unsigned integers in places where signed arithmetic makes no sense.

replies(1): >>44016456 #

43. ryao ◴[16 May 25 15:23 UTC] No.44006586{9}[source]▶

>>44004726 #

Just stop using signed integers to hold sizes. malloc itself takes size_t, which is unsigned.

44. ◴[16 May 25 15:49 UTC] No.44006869{8}[source]▶

>>44004104 #

45. kragen ◴[16 May 25 17:25 UTC] No.44007896{4}[source]▶

>>44000692 #

I guess I should mention that https://blog.regehr.org/archives/1515 doesn't dispute that people can pretty much always beat the shit out of optimizing compilers; Regehr explicitly says, "of course there’s plenty of hot code that wants to be optimized by hand." Rather, where he disagrees is whether it's worthwhile to use optimizing compilers for other code.

Daniel Berlin's https://news.ycombinator.com/item?id=9397169 does kind of disagree, saying, "If GCC didn't beat an expert at optimizing interpreter loops, it was because they didn't file a bug and give us code to optimize," but his actual example is the CPython interpreter loop, which is light-years from the kind of hand-optimized assembly interpreter Mike Pall's post is talking about, and moreover it wasn't feeding an interpreter loop to GCC but rather replacing interpretation with run-time compilation. Mostly what he disagrees about is the same thing Regehr disagrees about: whether there's enough code in the category of "not worth hand-optimizing but still runs often enough to matter", not whether you can beat a compiler by hand-optimizing your code. On the contrary, he brings up whole categories of code where compilers can't hope to compete with hand-optimization, such as numerical algorithms where optimization requires sacrificing numerical stability. mpweiher's comment in response discusses other scenarios where compilers can't hope to compete, like systems-level optimization.

It's worth reading the comments by haberman and Mike Pall in the HN thread there where they correct Berlin about LuaJIT, and kjksf also points out a number of widely-used libraries that got 2–4× speedups over optimized C by hand-optimizing the assembly: libjpeg-turbo, Skia, and ffmpeg. It'd be interesting to see if the intervening 10 years have changed the situation, because GCC and LLVM have improved in that time, but I doubt they've improved by even 50%, much less 300%.

46. godelski ◴[16 May 25 21:42 UTC] No.44010019{3}[source]▶

>>44001211 #

  >> You especially need to consider typical computer usage involves using more than one application at a time. There's a tragedy of the commons issue

These resources include:

  - disk/ssd/long term memory
  - RAM/System memory
  - Cache
  
  BUT ALSO
  - Registers
  - CPU Cores
  - Busses/Lanes/Bandwith
  - Locks
  - Network

My point is that I/O only dominates when you're actually acting efficiently. This is dominating in the case of measuring a single operating program.

You're forgetting that when multiple programs are running that there's a lot more going on. There's a lot more communication going on too. The caches are super tiny and in high competition. To handle interlacing all those instructions. Even a program's niceness can dramatically change total performance. This is especially true when we're talking about unoptimized programs because all those little things that the OS has to manage pile up.

Get out your computer architecture book and do a skim to refresh. Even Knuth's Book (s)[1] discuss much of this because to write good programs you gotta understand the environment they're running in. Otherwise I'd be like trying to build a car but not knowing if you're building it for the city, Antarctica, or even the moon. The environment is critical to the assumptions you can make.

[0] https://en.wikipedia.org/wiki/Nice_(Unix)

[1] https://www-cs-faculty.stanford.edu/~knuth/taocp.html

replies(1): >>44011985 #

47. godelski ◴[17 May 25 04:09 UTC] No.44011985{4}[source]▶

>>44010019 #

Actually had a good real world example today. My TV had an update a month ago and since then most of the apps don't actually work. Netflix plays with like 3 pixels, Hulu just hangs. Luckily I use the TV as a monitor 99% of the time. But I think we all know how slow a lot of these systems get. Things getting slower over time... it works fine when things ship but like I suggested, issues build over time. One app here, another there, and before you know it you're buying a new TV, computer, phone, whatever

48. bigiain ◴[17 May 25 05:28 UTC] No.44012244{4}[source]▶

>>44001037 #

> but you never know when you need those extra handful of bytes of memory I guess...

I guess on all those crappiest of 32 bit devices with 2.6 based kernels (which are practically unupgradeable due to vendor patches[1]) which will never make it to 2100 to the "year divisible by 100" check, because their time_t overflows in 2038.

On the plus side, that overflow turns the date into 1901, so perhaps they won't actually blow up due to leapyear bugs for another couple of years? That'll be a fun vintage computing archeological bug fix for someone.

[1] I am at least partly responsible for one of these products from a failed hardware startup I work at in 2012/13.

49. matheusmoreira ◴[17 May 25 19:39 UTC] No.44016456{7}[source]▶

>>44006492 #

The real problem is the fact compilers constantly screw up perfectly good code by optimizing it based on unreasonable "letter of the law" assumptions.

The issue of signed versus unsigned is tangential at best. There are good arguments in favor of the use of signed and unsigned integers. Neither type should cause compilers to screw code up beyond recognition.

The fact is signed integer overflow makes sense to programmers and works just fine on the machines people care to write code for. Nobody really cares that the C standard says signed integer overflow is undefined. That's just an excuse. If it's undefined, then simply define it.

Of course you can test for INT_MAX. The problem is you have to somehow know that you must do it that way instead of trying to observe the actual overflow. People tend to learn that sort of knowledge by being burned by optimizing compilers. I'd very much rather not have adversarial compilers instead.

replies(1): >>44016552 #

50. ryao ◴[17 May 25 19:55 UTC] No.44016552{8}[source]▶

>>44016456 #

I have been known to fix bugs involving signed integer overflow in C code others wrote.

In any case, send your feedback to the C standard committee. They can change the standard. It would not be a bad change to make.

↑