> 1. Compose the program into several threads of execution, traditionally scheduled and ran by the operating system
The step 0 is missing:
Compose the program into several lanes of execution, traditionally executed via SIMD.
This is a massive piece of performance left on the table on modern computer architectures, by assuming threading is the first manifestation of concurrency.
replies(1):