GitHub - It4innovations/hyperqueue: Scheduler for sub-node tasks for HPC systems with batch scheduling

HyperQueue is a tool designed to simplify execution of large workflows (task graphs) on HPC clusters. It allows you to execute a large number of tasks in a simple way, without having to manually submit jobs into batch schedulers like Slurm or PBS. You specify what you want to compute and HyperQueue automatically asks for computational resources and dynamically load-balances tasks across all allocated nodes and resources. HyperQueue can also work without Slurm/PBS as a general distributed task execution engine.

Documentation

If you find a bug or a problem with HyperQueue, please create an issue. For more general discussion or feature requests, please use our discussion forum. If you want to chat with the HyperQueue developers, you can use our Zulip server.

If you use HyperQueue in your research, please consider citing it.

This image shows how HyperQueue can work on a distributed cluster that uses Slurm or PBS:

Features

Complex resource management
- Load balancing of tasks across all available (HPC) resources
- Automatic submission of Slurm/PBS jobs on behalf of the user
- Complex and arbitrary task resource requirements (# of cores, GPUs, memory, FPGAs, ...)
  - Non-fungible resources (tasks are assigned specific resources, e.g. a GPU with ID 1)
  - Fractional resources (tasks can require e.g. 0.5 of a GPU)
  - Resource variants (tasks can require e.g. 1 GPU and 4 CPU cores OR 16 CPU cores)
  - Related resources (tasks can require e.g. 4 CPU cores in the same NUMA node)
High performance
- Scales to hundreds of nodes/workers and millions of tasks
- Overhead per one task is below 0.1ms
- Allows streaming of stdout/stderr from tasks to avoid creating many small files on distributed filesystems
Simple user interface
- Task graphs can be defined via a CLI, TOML workflow files or a Python API
- Cluster utilization can be monitored with a real-time dashboard
Easy deployment
- Provided as a single, statically linked binary without any runtime dependencies (apart from libc)
- No admin access to a cluster is needed for its usage

Installation

Download the latest binary distribution from this link.

Unpack the downloaded archive:

$ tar -xvzf hq-<version>-linux-x64.tar.gz

That's it! Just use the unpacked hq binary.

If you want to try the newest features, you can also download a nightly build.

Submitting a simple task

Start a server (e.g. on a login node, a cluster partition, or simply on your PC)
```
$ hq server start &
```
Submit a job (command echo 'Hello world' in this case)
```
$ hq submit echo 'Hello world'
```
Ask for computing resources
- Either start a worker manually
```
$ hq worker start &
```
- Or configure automatic submission of workers into PBS/SLURM
  - PBS:
```
$ hq alloc add pbs --time-limit 1h -- -q <queue>
```
  - Slurm:
```
$ hq alloc add slurm --time-limit 1h -- -p <partition>
```

See the result of the job once it finishes

$ hq job wait last
$ hq job cat last stdout

What's next?

Check out the documentation.

You can find FAQ (frequently asked questions) here.

HyperQueue team

We are a group of researchers working at IT4Innovations, the Czech National Supercomputing Center. We welcome any outside contributions.

Publications

HyperQueue: Efficient and ergonomic task graphs on HPC clusters (paper @ SoftwareX journal)
HyperQueue: Overcoming Limitations of HPC Job Managers (poster @ SuperComputing'21)

If you want to cite HyperQueue, you can use the following BibTex entry:

@article{hyperqueue,
  title = {HyperQueue: Efficient and ergonomic task graphs on HPC clusters},
  journal = {SoftwareX},
  volume = {27},
  pages = {101814},
  year = {2024},
  issn = {2352-7110},
  doi = {https://doi.org/10.1016/j.softx.2024.101814},
  url = {https://www.sciencedirect.com/science/article/pii/S2352711024001857},
  author = {Jakub Beránek and Ada Böhm and Gianluca Palermo and Jan Martinovič and Branislav Jansík},
  keywords = {Distributed computing, Task scheduling, High performance computing, Job manager}}
}

Acknowledgement

This work was supported by the LIGATE project. This project has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 956137. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and Italy, Sweden, Austria, the Czech Republic, Switzerland.
This work was supported by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90140).

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1,461 Commits
.github		.github
benchmarks		benchmarks
crates		crates
docs		docs
examples		examples
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
nedoc.conf		nedoc.conf
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Installation

Submitting a simple task

What's next?

HyperQueue team

Publications

Acknowledgement

License

About

Releases 40

Contributors 12

Languages

License

It4innovations/hyperqueue

Folders and files

Latest commit

History

Repository files navigation

Features

Installation

Submitting a simple task

What's next?

HyperQueue team

Publications

Acknowledgement

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 40

Contributors 12

Languages