The Weird Concept of Branchless Programming

1. zackmorris ◴[28 Sep 25 20:32 UTC] No.45407721[source]▶

I need something like this for a switch() command (technically a list of arbitrary functions). Sort of like up to N branches in one step.

The idea is to take some number of inputs A, B, C, ... and conceptually perform all of the possible functions simultaneously, then keep the one that's desired and throw the rest away. For any arbitrary logic. Ideally using fewer operations than all of that, but that's optional. Driven by one variable, it would look like:

  // branch version
  switch(var1)
  {
    case 1:
      var4 = var2 + var3;
      break;
    case 2:
      var4 = var2 - var3;
      break;
    case 3:
      var4 = var2 * var3;
      break;
    case 4:
      var4 = var2 / var3;
      break;
    // ...
    default:
      var4 = 0; // optional and arbitrary
      break;
  }
  
  // branchless version
  var4 = magic(var1, var2, var3);

I don't know how to do this outside of programmable hardware like an FPGA. The problem is that it's extremely challenging to write/solve the formulas that map ordinary arithmetic and bitwise functions to the larger set of functions.

So for now, I may just use a switch() command in a shader and let it figure out any reusable logic internally. I don't know if shaders have come far enough to allow that performantly, or if they end up just calculating everything and throwing away the paths not taken. Which would suggest that the max speed would be O(1/N) for the number of cases.

Does anyone know? Maybe truth tables? Or a SAT solver? I'm also not sure how this would work for floating point, but that's optional too.

Edit: I updated the cases to show how var1 controls the math performed on var2, var3 and var4.

replies(6): >>45407851 #>>45407994 #>>45408378 #>>45408469 #>>45411069 #>>45463140 #

2. judicious ◴[28 Sep 25 20:47 UTC] No.45407851[source]▶

>>45407721 (TP) #

I’m by no means an expert on this topic, but what about using predefined hash map of value to function call/closure unless that also yields branched programming underneath? Additionally, maybe you want to use macros of some kind to generate the “magic” function you’re looking for, where it constructs a single branchless statement up to N terms, whether it’s procedural macros in Rust or C macros.

3. AnotherGoodName ◴[28 Sep 25 21:05 UTC] No.45407994[source]▶

>>45407721 (TP) #

Think of a way you can make the number of each case 0 or 1 then you have a single sum of all possibilities where that 0 or 1 multiplies the portion of the switch statement it represents (the 0 will mean not active the 1 includes it).

Eg. If we had just 1 or 2 as options for the case statements note that test1 = xor(var1 & 0x1, var1>>1 & 0x1) & var1 will, with integer rounding, equal ‘1’ if var1 is one or ‘0’ if var1 is 2,3 or 4. (If it goes to 5 you need another xor at the next highest bit).

Likewise test2 = xor(var1 & 0x1, var1>>1 & 0x1) & var1<<1 equals ‘0’ if var1 is 1, 3 or 4 but it equals ‘1’ if it’s two.

So (test1 * (value on var1 being 1) + (test2 * (value on var1 being 2)) will get you a branchless version for the simple either 1 or 2 case. This will pretty much always be much slower than the branching version but some types of systems out there don’t really do branching and are linear. This type of zeroing trick is really common on matrix multipliers.

Essentially your creating binary logic gates that give 0 or 1 for the exact value you want and blindly multiplying that by the value it would have produced if 1.

4. secondcoming ◴[28 Sep 25 21:46 UTC] No.45408378[source]▶

>>45407721 (TP) #

Compilers would turn that into a jump table:

https://godbolt.org/z/14h5djYe8

Although this looks branchless if you don't care about divide-by-zero. The ALU might not be too happy:

https://godbolt.org/z/9nqG3xMPY

5. fwip ◴[28 Sep 25 22:00 UTC] No.45408469[source]▶

>>45407721 (TP) #

Maybe I'm missing something obvious, but how about:

    vars[1] = var2 + var3;
    vars[2] = var2 - var3;
    ...
    vars[0] = 0;
    var4 = vars[var1];

6. yccs27 ◴[29 Sep 25 07:08 UTC] No.45411069[source]▶

>>45407721 (TP) #

The most general approach is to calculate all the different cases, and then do a branchless selection:

    var4 = (var1==1)*(var2+var3) | (var1==2)*(var2-var3) | ...

Of course this is basically the slowest possible option, but it might work as a starting point for a general-purpose optimizer to find a faster solution. If it doesn't, this will likely compile to conditional move instructions.

It might help if you replace the comparison and multiplication with an equivalent expression made from bitwise operations, but I believe most compilers already know how to do this transformation.

7. zackmorris ◴[03 Oct 25 14:01 UTC] No.45463140[source]▶

>>45407721 (TP) #

Thank you everyone for your answers, they got me thinking in new ways.

In case anyone ever finds this, I realized that a brute force solution would be a lookup table of all combinations. For three 8 bit variables, that's 2^24=16,777,216 or 16 MB of EPROM.

Then we could apply truth table simplification techniques like Karnaugh maps, sum of products (SOP), products of sums (POS), etc:

https://workforce.libretexts.org/Bookshelves/Electronics_Tec...

https://tma.main.jp/logic/index_en.html

https://karnaughmapsolver.com/

I asked AI, and if the bits approximate a pseudorandom pattern, then we would need approximately log2(2^24)=24 levels of 2-input boolean logic gates. For NAND or NOR gates, apparently that's something like 3x2^27=400 million gates. If we could manage 24-input gates somehow, that might be reduced to ceil(log2(24))=5 levels, but internally that would occupy about the same amount of die area. Perhaps using a hybrid of gates, multiplexers and EPROMS could result in a much smaller circuit at the cost of higher propagation delay (latency), which is the goal of FPGAs.

We could then play with the ordering of the control variable var1 to explore all 256 possibilities and see if any result in a reduced number of logic levels (by maximizing the size of Karnaugh map loops). I suspect that might only save a few levels at most though, but even that could cut the number of gates by 50-90%.

Thankfully the control variable var1 might only have 16-64 possibilities, which lowers it to 4-6 bits or just 2^20=1,048,576 to 2^22=4,194,304 or 1-4 MB of EPROM, but still about 3*2^25=100 million gates worst case.

For 16, 32 or 64 bit integers or floats, it might be better to just calculate all cases with dedicated hardware and select the right one.