Unsure how this is defined (*) but graph cutting approaches to concurrent task scheduling is both pessimistic (poor utilization of available resources) and (iirc) NP-hard, so you pay an big cost upfront.
On the other hand, if you know the indegree/outdegree of each node at the time they are visited (meaning the graph is static) you can run Kahn's algorithm concurrently and put each node into a shared work queue. You can optimize that further by having per-thread work queues and stealing between them. Depending on what the nodes represent there are even more optimizations and heuristics, concurrent task scheduling is a hard problem.
* imagine the graph
(a, b) (a, c) (b, d) (c, d)
Is it possible to get nodes b and c in parallel "subsets" in your library?