Add DaggerMPI subpackage for MPI integrations #356

jpsamaroo · 2022-09-01T20:47:05Z

Building on Dagger's unified hashing framework, DaggerMPI.jl allows DAGs to execute efficiently under an MPI cluster. Per-task hashes are used to "color" the DAG, disabling execution of each task on all but one MPI worker. Data movement is typically peer-to-peer using MPI Send and Recv, and is coordinated by using tags computed from the same coloring scheme. This scheme allows Dagger's scheduler to remain unmodified and unaware of the existence of an MPI cluster, while still providing "exactly once" execution semantics for each task in the DAG.

Implements a "semantic" hashing algorithm which hashes Thunks based on the functional behavior of the code being executed. The intention is to have a hash which has an identical value across different Julia sessions for tasks which compute the same value. This is important for implementing a "headless" worker-worker cluster, where there is no coordinating head worker, and all workers can see the entire computational program. Hashes are computed automatically and can be queried with `get_task_hash()` while running in a task context, or directly as `get_task_hash(task)` for any Dagger task type. Hashes are also provided within `Dagger.move` calls, where the input task's hash is also available.

Building on Dagger's unified hashing framework, DaggerMPI.jl allows DAGs to execute efficiently under an MPI cluster. Per-task hashes are used to "color" the DAG, disabling execution of each task on all but one MPI worker. Data movement is typically peer-to-peer using MPI Send and Recv, and is coordinated by using tags computed from the same coloring scheme. This scheme allows Dagger's scheduler to remain unmodified and unaware of the existence of an MPI cluster, while still providing "exactly once" execution semantics for each task in the DAG.

jpsamaroo · 2022-10-28T14:34:53Z

This PR is mostly good-to-go, however there is one aspect I'm not happy about: all nodes must spawn all the same tasks in the same order, or else we get a hang. This is currently necessary because we do one-sided "blind" sends and receives, assuming that the counterpart will be posted.

A partial fix for this problem might involve asynchronously exchanging task hashes between nodes, and when we find a task hash that hasn't been registered on a given node, we "vote" to assign it to a node which does have it (and ensure all nodes are aware of that decision for downstream tasks which depend on that task's result). We can initiate this vote from any node which has tasks that are stalled waiting on data (send or receive); it's basically a more active way to ensure that the data becomes available, by ensuring that some node will eventually post/consume the result.

That fix is really only a workaround; a more complete solution might involve providing a way to inform the cluster that there is conditional logic around a task spawn point, and using that as a cue to initiate an early vote, or let the user explicitly select which node(s) to consider.

…ntations

jpsamaroo added 2 commits September 1, 2022 15:29

jpsamaroo added needs tests scheduler performance data movement processors needs docs labels Sep 1, 2022

jpsamaroo requested a review from vchuravy September 1, 2022 20:47

jpsamaroo mentioned this pull request Sep 1, 2022

Improve checkpointing #197

Open

Changing the receive and yield function to accomodate new MPI impleme…

3ee8a42

…ntations

jpsamaroo marked this pull request as draft July 30, 2024 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DaggerMPI subpackage for MPI integrations #356

Add DaggerMPI subpackage for MPI integrations #356

jpsamaroo commented Sep 1, 2022

jpsamaroo commented Oct 28, 2022

Add DaggerMPI subpackage for MPI integrations #356

Are you sure you want to change the base?

Add DaggerMPI subpackage for MPI integrations #356

Conversation

jpsamaroo commented Sep 1, 2022

jpsamaroo commented Oct 28, 2022