Beautiful branchless binary search

1. SleepyMyroslav ◴[28 Apr 23 07:34 UTC] No.35738552[source]▶

If being branchless is important property of the algorithm then it is better to enforce it. Or at least test for it. If his GCC version will get an update and it will stop producing assembly that he wants no-one will ever know.

Which brings us back to regular discussion: C ( and C++ ) does not match hardware anymore. There is no real control over important properties of generated code. Programmers need tools to control what they write. Plug and pray compilation is not a solid engineering approach.

replies(6): >>35738629 #>>35738633 #>>35738677 #>>35738943 #>>35741009 #>>35745436 #

2. gsliepen ◴[28 Apr 23 07:46 UTC] No.35738629[source]▶

>>35738552 (TP) #

What you gain is hardware independence. There is a lot of variation in CPUs, even if you stick with one vendor they will have in-order efficiency cores and out-of-order performance cores, and an algorithm optimized for one might not work as great on the other. I think it's better if time is spent by compiler engineers to produce good assembly on all CPUs, instead of giving tools to programmers to optimize their code for one particular CPU.

3. pjmlp ◴[28 Apr 23 07:47 UTC] No.35738633[source]▶

>>35738552 (TP) #

The myth that they match has been busted since at very least Pentium came to be.

A good read of Michael Abrash books explains that quite well, as does playing around Intel's VTune.

replies(2): >>35738709 #>>35738794 #

4. TuringTest ◴[28 Apr 23 07:54 UTC] No.35738677[source]▶

>>35738552 (TP) #

Wasn't C created to avoid matching hardware? I.e. not caring about word size, number of registers, instruction set, etc.

I thought the whole point of writing programs in C was originally being able to write portable software (and a portable OS) that could be executed at different machines? It only became specialized in giving machine-specific instructions when other languages took over.

replies(3): >>35739004 #>>35739054 #>>35740495 #

5. touisteur ◴[28 Apr 23 08:00 UTC] No.35738709[source]▶

>>35738633 #

For those looking for it https://www.jagregory.com/abrash-black-book

If some/most of the actual tricks are not up to date (ahem) the whole book is filled with techniques, stories, concepts... It's more than ever a Zen of optimization opus.

Can someone on HN close to him tell Michael Abrash, should he write again, whatever he wants, even gardening or vulkanstuff wrangling, he has guaranteed readers.

replies(1): >>35739282 #

6. SleepyMyroslav ◴[28 Apr 23 08:14 UTC] No.35738794[source]▶

>>35738633 #

If that myth has been busted 20 years ago when I have started working close in time with Pentiums then HN crowd never got the memo.

Even right now i have two replies above yours that completely ignore the point of discussed article. Which is that algorithm is 'branchless'.

PS. I agree with comment in this topic from 'mgaunard' that algorithm should have been written as branchless explicitly.

replies(1): >>35738885 #

7. pjmlp ◴[28 Apr 23 08:28 UTC] No.35738885{3}[source]▶

>>35738794 #

HN crowd has never got the memo in many subjects, another one is how the game development culture differs from FOSS.

8. Arnt ◴[28 Apr 23 08:40 UTC] No.35738943[source]▶

>>35738552 (TP) #

We lost that control when CPUs became increasingly multilayered and complicated, with 20-step pipelines, enough parallelism to run a hundred instructions at the same time, with micro-ops and microcode, with branch predictors good enough to unroll some loops completely. What programmer today understands the branch prediction logic used on production system? Well enough to understand what difference branchlessness makes?

And it doesn't seem to hurt. I at least have both worked on systems where I can read and write assembly and on one where I can't, and my "need to control" isn't such that the lack of assembly knowledge makes a significant impact on my effectiveness on the latter platform.

9. mcv ◴[28 Apr 23 08:53 UTC] No.35739004[source]▶

>>35738677 #

Exactly. I'd rather have a way that guarantees this algorithm to be branchless regardless of the compiler and underlying hardware. Although I suppose it depends on the underlying hardware whether or not being branchless offers any advantage at all.

Of course compilers should be optimising for the hardware they're compiling for, but this article shows that can be very hit-and-miss.

10. pjmlp ◴[28 Apr 23 09:03 UTC] No.35739054[source]▶

>>35738677 #

The whole point of creating C was to make UNIX portable, there were already other efforts with the same purpose going on the industry since JOVIAL creation in 1958.

11. liendolucas ◴[28 Apr 23 09:44 UTC] No.35739282{3}[source]▶

>>35738709 #

Slightly off-topic... There's was a post in HN about how to set up a nice retro DOS development environment in linux. Despite searching for it I can't find it... If a gentle soul remembers it, much appreciated. I think a good DOS environment is a must in order to follow the book.

replies(1): >>35753660 #

12. segfaultbuserr ◴[28 Apr 23 12:31 UTC] No.35740495[source]▶

>>35738677 #

> Wasn't C created to avoid matching hardware?

Not exactly. C code from the early Research Unix days often makes very specific assumptions of how the hardware behaves. As a starter, C was created in an age when 36-bit mainframes still ruled the world, yet it decided to only use 8-bit integers as its base word size, not 6-bit - because the PDP-11 is a 16-bit machine.

More appropriately, you can say C was created to avoid writing assembly code. Or you can say, the original purpose of C was to create a minimum high-level programming language that hits the lowest bar of being portable, and not more. C itself is a paradox, it's sometimes known as "portable assembly" which further reflects this paradox.

On one hand, it provided a lightweight abstraction layer that allows basic high-level programming, while still being simple enough to write or port a compiler to different platforms easily (for early C at least).

On the other hand, C was in fact intimately related to the hardware platform it was running on. Originally, the operations in C were designed with compiling directly to its original hardware platform PDP-11 in mind, rather than being defined by some formal mathematical specifications. So the behavior of C was basically, "the most natural result of the platform it's running on." This is why C has a ton of undefined behaviors - But paradoxically, this is also what made C portable - it could be matched directly to hardware without heavy abstractions, and thus C was simple.

Today we like to see C as portable, so "never rely on unspecified and undefined behaviors" is the rule, language lawyers tell us that C should be seen as an abstract machine in the symbolic sense. Compilers are performing increasingly complicated and aggressive logic and symbolic transformations for optimization and vectorization, with the assumption that there's no undefined behavior.

But if you read early C programs on Unix, you would see that developers made liberal use of unspecified and undefined behaviors, with very specific assumptions of their machines - in early C, undefined behaviors were arguably a feature, not a bug. C didn't support floating-point numbers until a hardware FPU was installed to the PDP-11, and even then, it only supported double-precision math, not single-precision, simply because the PDP-11 FPU had a global mode, making mode-switching messy, so Unix developers didn't want to manage it in the kernel. The famous Unix code, "you're not expected to understand this" went as far as depending on the assembly code generated by the compiler (to be fair, it was only a temporarily hack and was later removed, but it just shows how C was capable of being used). Meanwhile, under today's portability requirements, C programmers are not even supposed to assume signed integers use 2's complement encoding, and signed overflow is undefined (before C23)!

So there's an inherent contradiction that exists inside C on whether it's portable or it's machine-dependent, simultaneously.

The original C was "it is what the hardware is" (but it's still portable at large, because of its simplicity), and today's C is "it is what the abstract machine is, as defined by esoteric rules by language lawyers."

To show this conflict, I would quote Linus Torvalds:

> Yeah, let's just say that the original C designers were better at their job than a gaggle of standards people who were making bad crap up to make some Fortran-style programs go faster.

I don't exactly agree with Linus, and I don't believe today's heavy symbolic transformation and auto-vectorization should be taken away from C, I don't believe we should go back to "pcc" in which the compiler did nothing more than straight translation. I think it's reasonable to demand highly optimized code, of course. I'm just saying that there is a mismatch between C's hacker-friendly root and its role as a general-purpose language in the industry after it took over the world (ironically, exactly due to its hacker-friendless). The original hacker-friendly design is just not the most appropriate tool for this job. It was not designed to do this to begin with, so it has created this unfortunate situation.

So C in today's form is neither hacker-friendly nor production-friendly. But its old "hacker-friendly" image is still deeply attractive, even if it's illusory.

replies(1): >>35750731 #

13. dist1ll ◴[28 Apr 23 13:23 UTC] No.35741009[source]▶

>>35738552 (TP) #

> Which brings us back to regular discussion: C ( and C++ ) does not match hardware anymore. There is no real control over important properties of generated code. Programmers need tools to control what they write. Plug and pray compilation is not a solid engineering approach.

Not sure how one relates to the other. Do you want more fine-grained control over emitted assembly? Or do you want a general-purpose CPU that exposes microarchitecture details (i.e. "matching hardware")?

Only the former can be solved by tooling.

replies(1): >>35741788 #

14. waynecochran ◴[28 Apr 23 14:38 UTC] No.35741788[source]▶

>>35741009 #

C++ could add a an explicit conditional move I suppose. `x` and `y` types would have to be restrictive:

      x = std::cmove(y,flag);

The compiler would be slightly mrve compelled to use a hardware conditional move than in the following case:

      if (flag) x = y;

The other option is to do something more like CUDA / SIMD kernels do ... every line gets executed but each each instruction inside the "false branch" becomes a no-op. Of course this requires hardware support.

replies(1): >>35753963 #

15. eklitzke ◴[28 Apr 23 18:46 UTC] No.35745436[source]▶

>>35738552 (TP) #

Every C and C++ compiler has supported inline asm for decades, so that's what you should use if you really need to control the assembly output. The fact that you can switch between the two within the same function is one of the selling points of both languages.

replies(1): >>35746161 #

16. muricula ◴[28 Apr 23 19:43 UTC] No.35746161[source]▶

>>35745436 #

msvc does not support inline asm for x86_64 code. Additionally, inline asm can be quite fragile, can impede other optimization opportunities, and it is easy to misstate your inline asm's invariants and side effects to the compiler.

17. rocqua ◴[29 Apr 23 07:06 UTC] No.35750731{3}[source]▶

>>35740495 #

I always believed the difference between unspecified and undefined behavior was specifically tuned so that unspecified behavior was fine to use un-portably, whereas undefined behavior was never meant to be used. Hence the wording of the standard beinh carefully chosen.

Of course, this makes it surprising that relying on 2s complement is undefined rather than unspecified. As well as other examples like left shifting 2^31.

From this story, it sounds like the distinction between undefined and unspecified behavior is newer than I thought, and perhaps more about optimization from the beginning?

18. touisteur ◴[29 Apr 23 15:12 UTC] No.35753660{4}[source]▶

>>35739282 #

Yes a DOS environment is necessary if you want to follow through code examples, and tackle some of the optimization challenges (but mostly only try to beat the book if you have an actual 8086/286/386/486/Pentium I processor - or a cycle-precise emulator).

I read the whole book(s) thrice cover to cover without touching a computer (these were the days) in my teens and it was easy to follow, and full of nuggets for a young aspiring programmer. I had third-hand 8086, 386 and pentium 75 boxes at the time, but didn't open Turbo C before I'd finished the book, and it was to try and implement a bsp tree, then a whole 3d stereo (anaglyphs) software renderer (inspired by the book).

19. touisteur ◴[29 Apr 23 15:40 UTC] No.35753963{3}[source]▶

>>35741788 #

Pushing this a bit more, one could read about SPMD and ISPC (originally by Matt Pharr) https://pharr.org/matt/blog/2018/04/30/ispc-all

I've used it (sparringly) for vectorizable, branchy code and it's mostly been a simple process, with very efficient binaries produced (often beating hand written intrinsics by intermediate level coder - themselves beating the autovectorizer).

Don't know about using it in prod on multi generational hardware though.