←back to thread

110 points ww520 | 1 comments | | HN request time: 0.365s | source

I believe the best way to learn a language is by doing an in-depth project. This is my first Zig project intended for learning the ropes on publishing a Zig package. It turns out to be quite solid and performant. It might be a bit over-engineered.

This little library is packed with the following features:

  - Building dependency graph from dependency data.
  - Performing topological sort on the dependency graph.
  - Generating dependence-free subsets for parallel processing.
  - Cycle detection and cycle reporting.
Show context
jitl ◴[] No.43552526[source]
For something like task scheduling in a build system, if you use an up-front partition of the tasks like this, you may have a task that has all its dependencies fulfilled because tasks may have different durations, but still remains unscheduled.

For example, with this input:

    $ cat data.file
    root: parentA parentB
    parentA: C D
    parentB: E F
Asking the package tool for the sorted sets gives me:

    $ ./zig-out/bin/toposort-cli --data data.file
    Processing succeeded.
      Topologically sorted sets: [ { C D E F }  { parentA parentB }  { root }  ]
      Topologically sorted list: [ C D E F parentA parentB root ]
      Nodes: [ root parentA parentB C D E F ]
      Dependency tree:
      [ C D E F ]
        C -> [ parentA ]
          parentA -> [ root ]
            root ->
        D -> [ parentA ]
          parentA -> [ root ]
            root ->
        E -> [ parentB ]
          parentB -> [ root ]
            root ->
        F -> [ parentB ]
          parentB -> [ root ]
            root ->
        
If we follow the README's advice to parallel process set-by-set, we can easily have starvation: once `C` and `D` are finished, we are ready to start `parentA`, even if E and F are still work-in-progress.

How would I use the API of this package to detect and start the task `parentA` as soon as `C` and `D` are finished? I guess when a task is finished, I can ask for the list of dependent tasks, and for each of them, check if all of their dependencies are finished, and if so start the task. But this real-world need feels un-met; it feels odd to me to focus on the sorted sets instead of more practical scheduling.

That is kind of doing Kahn's algorithm iteratively during the build. It would be cool to try and optimize that for maximum performance.

replies(1): >>43552842 #
ww520 ◴[] No.43552842[source]
That's a great point! Unfortunately Topological Sort generates a linear order, forcing nodes to run one after another. This library attempts to bring some parallel processing into the picture by grouping dependence-free nodes together. This produces a linear batches.

  {C D E F}, {parentA parentB}, {root} 
Within a batch, nodes can run parallel. Between batches, they still need to run one after another.

What you're asking for is to partition the dependency graph according to node dependency with minimum span.

  { {C D} parentA } \
  { {E F} parentB } - {root}
Or with a shared leader.

  { {C D S} parentA } \
  { {E F S} parentB } - {root}
And on top of that, add node weight into the consideration. That's hard.

For now, you can send notification to a dependent when the leading task finishes. E.g. When C finishes, notifies parentA. When D finishes, notifies parentA. When parentA notices that all its leading tasks have done, it can start.

The library can help in maintaining the dependency relationship and let the task query its leaders and dependents.

For task running, it would be a separate library using the TopoSort library and specifically geared toward scheduling tasks.

replies(1): >>43554474 #
genewitch ◴[] No.43554474[source]
Did you use this library on the advent of code 2024? I'd never heard of topological sorting prior to that problem - and it was real early in the game.
replies(2): >>43555508 #>>43557346 #
1. rk06 ◴[] No.43555508[source]
Topological sorting is just "depth first traversal" in a trench coat. I have implemented it thrice in my day job.

It is actually more commonly implemented than any other algorithm in CS course