I think the actually interesting non-obvious part starts at "Redesigning Algorithms For Uniform Work Distribution". All the prior stuff done you basically get for free in a functional language that has some thread pool or futures built in, doing FP. The real question is how you write algorithms or parts of programs in a way that they lend themselves to be run in parallel as small units with results that easily merge again (parallel map reduce) or maybe don't even need to be merged again. That is the real difficult part, aside from transforming some mutating program into FP style and having appropriate data structures.
And then of course the heuristics start to become important. How much parallelism, before overhead eats the speedup?
Another question is energy efficiency. Is it more important to finish calculation as quickly as possible, or would it be OK to need some longer time, but in total calculate less, due to less overhead and no/less merging?