Most active commenters

bsenftner(3)
bob1029(3)

Popular/hot comments

>>45663377 #

←back to thread

The death of thread per core

(buttondown.com)

Show context

bob1029 ◴[21 Oct 25 07:32 UTC] No.45653379[source]▶

>>45649510 (OP) #

I look at cross core communication as a 100x latency penalty. Everything follows from there. The dependencies in the workload ultimately determine how it should be spread across the cores (or not!). The real elephant in the room is that oftentimes it's much faster to just do the whole job on a single core even if you have 255 others available. Some workloads do not care what kind of clever scheduler you have in hand. If everything constantly depends on the prior action you will never get any uplift.

You see this most obviously (visually) in places like game engines. In Unity, the difference between non-burst and burst-compiled code is very extreme. The difference between single and multi core for the job system is often irrelevant by comparison. If the amount of cpu time being spent on each job isn't high enough, the benefit of multicore evaporates. Sending a job to be ran on the fleet has a lot of overhead. It has to be worth that one time 100x latency cost both ways.

The GPU is the ultimate example of this. There are some workloads that benefit dramatically from the incredible parallelism. Others are entirely infeasible by comparison. This is at the heart of my problem with the current machine learning research paradigm. Some ML techniques are terrible at running on the GPU, but it seems as if we've convinced ourselves that GPU is a prerequisite for any kind of ML work. It all boils down to the latency of the compute. Getting data in and out of a GPU takes an eternity compared to L1. There are other fundamental problems with GPUs (warp divergence) that preclude clever workarounds.

replies(7): >>45660423 #>>45661402 #>>45661430 #>>45662310 #>>45662427 #>>45662527 #>>45667568 #

1. bsenftner ◴[21 Oct 25 19:23 UTC] No.45660423[source]▶

>>45653379 #

Astute points. I've worked on an extremely performant facial recognition system (tens of millions of face compares per second per core) that lives in L1 and does not use the GPU for the FR inference at all, only for the display of the video and the tracked people within. I rarely even bother telling ML/DL/AI people it does not use the GPU, because I'm just tired of the argument that "we're doing it wrong".

replies(4): >>45663377 #>>45663434 #>>45663730 #>>45666183 #

2. zipy124 ◴[22 Oct 25 00:04 UTC] No.45663377[source]▶

>>45660423 (TP) #

How are you doing tens of millions of faces per second per core, first of all assuming a 5ghz processor, that gives you 500 cycles per image if you do ten million a second, that's not nearly enough to do anything image related. Second of all L1 cache is at most in the hundreds of kilobytes, so the faces aren't in L1 but must be retrieved from elsewhere...??

replies(4): >>45663801 #>>45663834 #>>45663907 #>>45666117 #

3. rdedev ◴[22 Oct 25 00:11 UTC] No.45663434[source]▶

>>45660423 (TP) #

Could you tell me a bit about how you were able to ensure the model is close to the cache?

replies(1): >>45663780 #

4. kcb ◴[22 Oct 25 00:53 UTC] No.45663730[source]▶

>>45660423 (TP) #

No shot are you doing tens of millions of anything useful per second per core. That's like beyond HFT numbers.

replies(1): >>45666053 #

5. izabera ◴[22 Oct 25 01:00 UTC] No.45663780[source]▶

>>45663434 #

the secret is to keep things ˢᵐᵒˡ

6. Keyframe ◴[22 Oct 25 01:04 UTC] No.45663801[source]▶

>>45663377 #

You can't look at it like _that_. Biometrics has its own "things". I don't know what OP is actually doing, but it's probably not classical image processing. Most probably facial features are going through some "form of LGBPHS binarized and encoded which is then fed into an adaptive bloom filter based transform"[0].

Paper quotes 76,800 bits per template (less compressed) and with 64-bit words it's what, 1200 64-bit bitwise ops. at 4.5 Ghz it's 4.5b ops per second / 1200 ops per per comparison which is ~3.75 million recognitions per second. Give or take some overhead, it's definitely possible.

[0] https://www.christoph-busch.de/files/Gomez-FaceBloomFilter-I...

Cache locality is a thing. Like in raytracing and the old confucian adage that says "Primary rays cache, secondary trash".

replies(1): >>45664454 #

7. dudeofea ◴[22 Oct 25 01:09 UTC] No.45663834[source]▶

>>45663377 #

I don't know the application, but just guessing that you don't need to compare an entire full-resolution camera image, but perhaps some smaller representation like an embedding space or pieces of the image

8. ekidd ◴[22 Oct 25 01:22 UTC] No.45663907[source]▶

>>45663377 #

Back in the old days of "Eigenfaces", you could project faces into 12- or 13-dimensional space using SVD and do k-nearest-neighbor. This fit into cache even back in the 90s, at least if your faces were pre-cropped to (say) 100x100 pixels.

9. reactordev ◴[22 Oct 25 02:57 UTC] No.45664454{3}[source]▶

>>45663801 #

Correct, it’s probably distance of a vector or something like that after the bloom. Take the facial points as a vec<T> as you only have a little over a dozen and it’s going to fit nicely in L1.

replies(1): >>45668775 #

10. bob1029 ◴[22 Oct 25 07:56 UTC] No.45666053[source]▶

>>45663730 #

You can handle hundreds of millions of transactions per second if you are thoughtful enough in your engineering. ValueDisruptor in .NET can handle nearly half a billion items per second per core. The Java version is what is typically used to run the actual exchanges (no value types), so we can go even faster if we needed to without moving to some exotic compute or GPU technology.

replies(1): >>45667894 #

11. bob1029 ◴[22 Oct 25 08:05 UTC] No.45666117[source]▶

>>45663377 #

> assuming a 5ghz processor, that gives you 500 cycles per image if you do ten million a second

Modern CPUs don't quite work this way. Many instructions can be retired per clock cycle.

> Second of all L1 cache is at most in the hundreds of kilobytes, so the faces aren't in L1 but must be retrieved from elsewhere...??

Yea, from L2 cache. It's caches all the way down. That's how we make it go really fast. The prefetcher can make this look like magic if the access patterns are predictable (linear).

replies(1): >>45667555 #

12. immibis ◴[22 Oct 25 08:16 UTC] No.45666183[source]▶

>>45660423 (TP) #

Do you work for Flock?

13. whizzter ◴[22 Oct 25 11:35 UTC] No.45667555{3}[source]▶

>>45666117 #

The keyword is CAN, there can also be huge penalties (random main-memory accesses are over a cycles typically), the parent was probably considering a regular image transform/comparison and 20 pixels per cycle even for low resolution 100x100 images is way above what we do today.

As others have mentioned, they're probably doing some kind of embedding like search primarily and then 500 cycles per face makes more sense, but it's not a full comparison.

14. jcelerier ◴[22 Oct 25 12:16 UTC] No.45667894{3}[source]▶

>>45666053 #

It's so sad to see how many people not knowing how incredibly fast our CPUs are

15. bsenftner ◴[22 Oct 25 13:24 UTC] No.45668775{4}[source]▶

>>45664454 #

NDA prevents me from saying anything beyond the compares are minimal representatives of a face template, and those stream through the core's caches.

replies(2): >>45669495 #>>45672890 #

16. reactordev ◴[22 Oct 25 14:15 UTC] No.45669495{5}[source]▶

>>45668775 #

Queue the “If I were to build it…” ;)

17. bsenftner ◴[22 Oct 25 18:04 UTC] No.45672890{5}[source]▶

>>45668775 #

A public report from the employer about the tech https://cyberextruder.com/wp-content/uploads/2022/06/Accurac... (I no longer work there.)

↑