Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute #121

Merged
merged 14 commits into from
May 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

This flow is a simple linear workflow that verifies your cloud configuration. The
`start` and `end` steps will run locally, while the `hello` step will [run
remotely](/scaling/remote-tasks/introduction). After [configuring
remotely](/scaling/remote-tasks/requesting-resources). After [configuring
Metaflow](/getting-started/infrastructure) to run in the cloud, data and metadata about
your runs will be stored remotely. This means you can use the client to access
information about any flow from anywhere.
Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,12 @@ projects.
- [Creating Flows](metaflow/basics)
- [Inspecting Flows and Results](metaflow/client)
- [Debugging Flows](metaflow/debugging)
- [Visualizing Results](metaflow/visualizing-results/) ✨*New Features*✨
- [Visualizing Results](metaflow/visualizing-results/)

## II. Scalable Flows

- [Introduction to Scalable Compute and Data](scaling/introduction)
- [Executing Tasks remotely](scaling/remote-tasks/introduction)
- [Computing at Scale](scaling/remote-tasks/introduction) ✨*Updated and Expanded*✨
- [Managing Dependencies](scaling/dependencies)
- [Dealing with Failures](scaling/failures)
- [Loading and Storing Data](scaling/data)
Expand Down
6 changes: 6 additions & 0 deletions docs/introduction/metaflow-resources.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,17 @@

## Videos

- [Metaflow Office Hours](https://www.youtube.com/watch?v=76k5r6s6M1Q&list=PLUsOvkBBnJBcdF7SScDIndbnkMnFOoskA) -

Check warning on line 39 in docs/introduction/metaflow-resources.md

View workflow job for this annotation

GitHub Actions / Run linters

Line length: Expected: 100; Actual: 114
cool stories from companies using Metaflow.
- [Metaflow on YouTube](https://www.youtube.com/results?search_query=metaflow+ml).
- You can start with [this recent
overview](https://www.youtube.com/watch?v=gZnhSHvhuFQ).

## Blogs

Find [a more comprehensive list of Metaflow users and posts here](https://outerbounds.com/stories/). Here's

Check warning on line 47 in docs/introduction/metaflow-resources.md

View workflow job for this annotation

GitHub Actions / Run linters

Line length: Expected: 100; Actual: 107
a small sample:

- 23andMe: [Developing safe and reliable ML products at
23andMe](https://medium.com/23andme-engineering/machine-learning-eeee69d40736)
- AWS: [Getting started with the open source data science tool Metaflow on
Expand All @@ -50,6 +55,7 @@
CNN](https://medium.com/cnn-digital/accelerating-ml-within-cnn-983f6b7bd2eb)
- Latana: [Brand Tracking with Bayesian Statistics and AWS
Batch](https://aws.amazon.com/blogs/startups/brand-tracking-with-bayesian-statistics-and-aws-batch/)
- Netflix: [Supporting Diverse ML Systems at Netflix](https://netflixtechblog.com/supporting-diverse-ml-systems-at-netflix-2d2e6b6d205d)
- Netflix: [Open-Sourcing Metaflow, a Human-Centric Framework for Data
Science](https://netflixtechblog.com/open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9)
- Netflix: [Unbundling Data Science Workflows with Metaflow and AWS Step
Expand Down
5 changes: 3 additions & 2 deletions docs/metaflow/visualizing-results/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@

In contrast, cards are not meant for building complex, interactive dashboards or for
ad-hoc exploration that is a spot-on use case for notebooks. If you are curious, you can read
more about the motivation for cards in [the original release blog post](https://outerbounds.com/blog/integrating-pythonic-visual-reports-into-ml-pipelines/) and
more about the motivation for cards in [the original release blog
post](https://outerbounds.com/blog/integrating-pythonic-visual-reports-into-ml-pipelines/) and
[the announcement post for updating cards](https://outerbounds.com/blog/metaflow-dynamic-cards/).

## How to use cards?
Expand Down Expand Up @@ -85,7 +86,7 @@
above.

Also, crucially, cards work in any compute
environment such as laptops, [any remote tasks](/scaling/remote-tasks/introduction), or
environment such as laptops, [any remote tasks](/scaling/remote-tasks/requesting-resources), or
when the flow is [scheduled to run automatically](/production/introduction). Hence, you
can use cards to inspect and debug results during prototyping, as well as monitor the
quality of production runs.
Expand Down Expand Up @@ -123,4 +124,4 @@

:::tip
You can test example cards interactively in [Metaflow
Sandbox](https://account.outerbounds.dev/account/?workspace=/home/workspace/workspaces/dynamic-cards/workspace.code-workspace), conveniently in the browser.

Check warning on line 127 in docs/metaflow/visualizing-results/README.md

View workflow job for this annotation

GitHub Actions / Run linters

Line length: Expected: 100; Actual: 156
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ You should see a blank page with a blue “Hello World!” text.
![](</assets/card-docs-html_(2).png>)

A particularly useful feature of card templates is that they work in any compute
environment, even when [executing tasks remotely](/scaling/remote-tasks/introduction).
environment, even when [executing tasks remotely](/scaling/remote-tasks/requesting-resources).
For instance, if you have AWS Batch set up, you can run the flow as follows:

```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ required in the code. All data artifacts produced by steps run on Airflow are av
using the [Client API](../../metaflow/client.md). All tasks are run on Kubernetes
respecting the `@resources` decorator, as if the `@kubernetes` decorator was added to
all steps, as explained in [Executing Tasks
Remotely](/scaling/remote-tasks/introduction#safeguard-flags).
Remotely](/scaling/remote-tasks/requesting-resources).

This document describes the basics of Airflow scheduling. If your project involves
multiple people, multiple workflows, or it is becoming business-critical, check out the
Expand Down Expand Up @@ -115,7 +115,7 @@ the default with the `--max-workers` option. For instance, `airflow create --max
500` allows 500 tasks to be executed concurrently for every foreach step.

This option is similar to [`run
--max-workers`](/scaling/remote-tasks/introduction#safeguard-flags) that is used to
--max-workers`](/scaling/remote-tasks/controlling-parallelism) that is used to
limit concurrency outside Airflow.

### Deploy-time parameters
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ changes are required in the code. All data artifacts produced by steps run on Ar
Workflows are available using the [Client API](../../metaflow/client.md). All tasks are
run on Kubernetes respecting the `@resources` decorator, as if the `@kubernetes`
decorator was added to all steps, as explained in [Executing Tasks
Remotely](/scaling/remote-tasks/introduction#safeguard-flags).
Remotely](/scaling/remote-tasks/requesting-resources).

This document describes the basics of Argo Workflows scheduling. If your project
involves multiple people, multiple workflows, or it is becoming business-critical, check
Expand Down Expand Up @@ -129,7 +129,7 @@ the default with the `--max-workers` option. For instance, `argo-workflows creat
--max-workers 500` allows 500 tasks to be executed concurrently for every foreach step.

This option is similar to [`run
--max-workers`](/scaling/remote-tasks/introduction#safeguard-flags) that is used to
--max-workers`](/scaling/remote-tasks/controlling-parallelism) that is used to
limit concurrency outside Argo Workflows.

### Deploy-time parameters
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ changes are required in the code. All data artifacts produced by steps run on AW
Functions are available using the [Client API](../../metaflow/client). All tasks are run
on AWS Batch respecting the `@resources` decorator, as if the `@batch` decorator was
added to all steps, as explained in [Executing Remote
Tasks](/scaling/remote-tasks/introduction).
Tasks](/scaling/remote-tasks/requesting-resources).

This document describes the basics of AWS Step Functions scheduling. If your project
involves multiple people, multiple workflows, or it is becoming business-critical, check
Expand Down Expand Up @@ -146,7 +146,7 @@ the default with the `--max-workers` option. For instance, `step-functions creat
--max-workers 500` allows 500 tasks to be executed concurrently for every foreach step.

This option is similar to [`run
--max-workers`](/scaling/remote-tasks/introduction#safeguard-flags) that is used to
--max-workers`](/scaling/remote-tasks/controlling-parallelism) that is used to
limit concurrency outside AWS Step Functions.

### Deploy-time parameters
Expand Down
6 changes: 3 additions & 3 deletions docs/scaling/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ manipulation:
are caused by out-of-sync features.
* It is quicker to iterate on your model. Testing and debugging Python is easier than
testing and debugging SQL.
* You can request [arbitrary amount of resources](/scaling/remote-tasks/introduction)
* You can request [arbitrary amount of resources](/scaling/remote-tasks/requesting-resources)
for your data manipulation needs.
* Instead of having data manipulation code in two places (SQL and Python), all code can
be clearly laid out in a single place, in a single language, for maximum readability.
Expand Down Expand Up @@ -65,7 +65,7 @@ of `metaflow.S3` to directly interface with data files on S3 backing your tables
data is loaded directly from S3, there is no limitation to the number of parallel
processes. The size of data is only limited by the size of your instance, which can be
easily controlled with [the `@resources`
decorator](/scaling/remote-tasks/introduction#requesting-resources-with-resources-decorator).
decorator](/scaling/remote-tasks/requesting-resources).
The best part is that this approach is blazingly fast compared to executing SQL.

The main downside of this approach is that the table needs to have partitions that match
Expand Down Expand Up @@ -549,7 +549,7 @@ Read more about [fast data processing with `metaflow.S3` in this blog post](http

For maximum performance, ensure that
[the `@resources(memory=)`
setting](/scaling/remote-tasks/introduction#requesting-resources-with-resources-decorator)
setting](/scaling/remote-tasks/requesting-resources)
is higher than the amount of data you are downloading with `metaflow.S3`.

If the amount of data is higher than the available disk space, you can use the
Expand Down
4 changes: 2 additions & 2 deletions docs/scaling/dependencies/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ others.
to retrieve a flow, rerun it, and get similar results. To make this possible, a
flow can't depend on libraries that are only available on your laptop.

3. **[Remote execution of tasks](/scaling/remote-tasks/introduction)** requires that
3. **[Remote execution of tasks](/scaling/remote-tasks/requesting-resources)** requires that
all dependencies can be reinstantated on the fly in a remote environment.
Again, this is not possible if the flow depends on libraries that are only
available on your laptop.
Expand Down Expand Up @@ -61,7 +61,7 @@ dependencies like this automatically.

3. Crucially, Metaflow packages Metaflow itself for remote execution so that you
don't have to install it manually when
[using `@batch` and `@kubernetes`](/scaling/remote-tasks/introduction). Also, this
[using `@batch` and `@kubernetes`](/scaling/remote-tasks/requesting-resources). Also, this
guarantees that all tasks use the same version of Metaflow consistently.

4. External libraries can be included via [the `@pypi` and `@conda`
Expand Down
2 changes: 1 addition & 1 deletion docs/scaling/dependencies/containers.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@

# Defining Custom Images

All [tasks executed remotely](/scaling/remote-tasks/introduction) run in a container,
All [tasks executed remotely](/scaling/remote-tasks/requesting-resources) run in a container,
both on [Kubernetes](/scaling/remote-tasks/kubernetes) and in
[AWS Batch](/scaling/remote-tasks/aws-batch). Hence, the default environment for
remote tasks is defined by the container (Docker) image used.
Expand Down
2 changes: 1 addition & 1 deletion docs/scaling/dependencies/libraries.md
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,7 @@ in the `end` step.
## Libraries in remote tasks

A major benefit of `@pypi` and `@conda` is that they allow you to define libraries that will be
automatically made available when [you execute tasks remotely](/scaling/remote-tasks/introduction)
automatically made available when [you execute tasks remotely](/scaling/remote-tasks/requesting-resources)
on `@kubernetes` or on `@batch`. You don't need to do anything special to make this happen.

If you have `@kubernetes` or `@batch` [configured to work with
Expand Down
4 changes: 2 additions & 2 deletions docs/scaling/dependencies/project-structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

This page describes how to arrange files in your projects to
follow software development best practices, which also leads to
[easy remote execution](/scaling/remote-tasks/introduction).
[easy remote execution](/scaling/remote-tasks/requesting-resources).

## Separating code to modules

Expand Down Expand Up @@ -55,7 +55,7 @@ can run the flow as usual:
python teaflow.py run
```
The `teatime.py` module works out of the box. If you have
[remote execution](/scaling/remote-tasks/introduction) set up,
[remote execution](/scaling/remote-tasks/requesting-resources) set up,
you can run the code `--with batch` or `--with kubernetes` and it works
equally well!

Expand Down
4 changes: 2 additions & 2 deletions docs/scaling/failures.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ sometimes the `start` step raises an exception and needs to be retried. By defau
always succeed.

It is recommended that you use `retry` every time you [run tasks
remotely](/scaling/remote-tasks/introduction). Instead of annotating every step with a
remotely](/scaling/remote-tasks/requesting-resources). Instead of annotating every step with a
retry decorator, you can also automatically add a retry decorator to all steps that do
not have one as follows:

Expand Down Expand Up @@ -104,7 +104,7 @@ can be retried without concern.
### Maximizing Safety

By default, `retry` will retry the step for three times before giving up. It waits for 2
minutes between retries for [remote tasks](/scaling/remote-tasks/introduction). This
minutes between retries for [remote tasks](/scaling/remote-tasks/requesting-resources). This
means that if your code fails fast, any transient platform issues need to get resolved
in less than 10 minutes or the whole run will fail. 10 minutes is typically more than
enough, but sometimes you want both a belt and suspenders.
Expand Down
26 changes: 14 additions & 12 deletions docs/scaling/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,18 @@ optimization](https://xkcd.com/1691/) is probably not the best use of your time.

Instead, you can leverage the cloud to get a bigger laptop or more laptops (virtually,
not literally). This is the Stage II in Metaflow development: Scaling flows with the
cloud. Luckily Metaflow makes this trivially easy - no changes in the code required -
cloud. Metaflow makes this easy - no changes in the code required -
after you have done the initial legwork to [configure infrastructure for
Metaflow](/getting-started/infrastructure).

![](/assets/intro-cartoon-2.svg)

## Supersizing Flows

Here's how Metaflow can help make your project more scalable:
Here's how Metaflow can help make your project more scalable, both tecnically and
organizationally:

1. You can make your existing flows more scalable just by adding a line of code,
1. You can make your existing flow scalable just by adding a line of code,
`@resources`. This way you can request more CPUs, memory, or GPUs in your flows. Or, you
can parallelize processing over multiple instances, even thousands of them.

Expand All @@ -28,18 +29,18 @@ attracting interest from colleagues too. Metaflow contains a number of features,
[namespaces](/scaling/tagging), which make collaboration smoother by allowing many
people contribute without interfering with each other's work accidentally.

### Toolbox of Scalability
### Easy patterns for scalable, high-performance code

There is no single magic formula for scalability. Instead of proposing a novel paradigm
to make your Python code faster, Metaflow provides a set of pragmatic tools, leveraging
the best off-the-shelf components and services, which help you make code more scalable
and performant depending on your specific needs.
to make your Python code faster, [Metaflow provides a set of
practical, commonly used patterns](/scaling/remote-tasks/introduction), which help you
make code more scalable and performant depending on your specific needs.

The scalability tools fall into three categories:
The patterns fall into three categories:

- **Performance Optimization**: You can improve performance of your code by utilizing
off-the-shelf, high-performance libraries such as
[XGboost](https://github.com/dmlc/xgboost) or [Tensorflow](https://tensorflow.org).
[XGboost](https://github.com/dmlc/xgboost) or [PyTorch](https://pytorch.org/).
Or, if you need something more custom, you can leverage the vast landscape of data
tools for Python, including compilers like [Numba](https://numba.pydata.org) to speed
up your code.
Expand All @@ -53,7 +54,9 @@ care of provisioning such machines on demand.
- **Scaling Out**: Besides executing code on a single instance, Metaflow makes it easy
to parallelize steps over an arbitrarily large number of instances, leveraging
Kubernetes and AWS Batch, giving you access to virtually unlimited amount of computing
power.
power. Besides many independent tasks, you can [create large compute
clusters](/scaling/remote-tasks/distributed-computing) too on the fly, e.g.
to train large (language) models.

Often an effective recipe for scalability is a combination of these three techniques:
Start with high-performance Python libraries, run them on large instances, and if
Expand All @@ -73,8 +76,7 @@ In this section, you will learn how to make your flows capable of handling more
execute faster. You will also learn how to scale projects over multiple people by
organizing results better. We cover five topics:

1. [Executing tasks remotely with Kubernetes or AWS
Batch](/scaling/remote-tasks/introduction)
1. [Scaling compute in the cloud](/scaling/remote-tasks/introduction)
2. [Dealing with failures](/scaling/failures)
3. [Managing execution environments](/scaling/dependencies)
4. [Loading and storing data efficiently](/scaling/data)
Expand Down
53 changes: 49 additions & 4 deletions docs/scaling/remote-tasks/aws-batch.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,49 @@ You can also specify the resource requirements on command line as well:
$ python BigSum.py run --with batch:cpu=4,memory=10000,queue=default,image=ubuntu:latest
```

## My job is stuck in `RUNNABLE` state. What do I do?
## Using GPUs and Trainium instances with AWS Batch

To use GPUs in Metaflow tasks that run on AWS Batch, you need to run the flow in a
[Job Queue](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html) that
is attached to a [Compute
Environment](https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html)
with GPU/Trainium instances.

To set this up, you can either modify the allowable instances in a [Metaflow AWS deployment
template](https://github.com/outerbounds/metaflow-tools/tree/master/aws) or manually add such a
compute environment from the AWS console. The steps are:

1. Create a compute environment with GPU-enabled EC2 instances or Trainium instances.
2. Attach the compute environment to a new Job Queue - for example named `my-gpu-queue`.
3. Run a flow with a GPU task in the `my-gpu-queue` job queue by
- setting the `METAFLOW_BATCH_JOB_QUEUE` environment variable, or
- setting the `METAFLOW_BATCH_JOB_QUEUE` value in your Metaflow config, or
- (most explicit) setting the `queue` parameter in the `@batch` decorator.

It is a good practice to separate the job queues that you run GPU tasks on from those that do not
require GPUs (or Trainium instances). This makes it easier to track hardware-accelerated workflows,
which can be costly, independent of other workflows. Just add a line like
```python
@batch(gpu=1, queue='my-gpu-queue')
```
in steps that require GPUs.

## My job is stuck in `RUNNABLE` state. What should I do?

Does the Batch job queue you are trying to run the Metaflow task in have a compute environment
with EC2 instances with the resources requested? For example, if your job queue is connected to
a single compute environment that only has `p3.2xlarge` as a GPU instance, and a user requests 2
GPUs, that job will never get scheduled because `p3.2xlarge` only have 1 GPU per instance.

Consult [this
For more information, [see this
article](https://docs.aws.amazon.com/batch/latest/userguide/troubleshooting.html#job_stuck_in_runnable).

## My job is stuck in `STARTING` state. What should I do?

Are the resources requested in your Metaflow code/command sufficient? Especially when using
custom GPU images, you might need to increase the requested memory to pull the container image
into your compute environment.

## Listing and killing AWS Batch tasks

If you interrupt a Metaflow run, any AWS Batch tasks launched by the run get killed by
Expand Down Expand Up @@ -101,8 +139,15 @@ As a convenience feature, you can also see the logs of any past step as follows:
$ python bigsum.py logs 15/end
```


## Disk space

You can request higher disk space on AWS Batch instances by using an unmanaged Compute
Environment with a custom AMI.
Environment with a custom AMI.

## How to configure AWS Batch for distributed computing?

[See these instructions](https://outerbounds.com1/engineering/operations/distributed-computing/)
if you want to use AWS Batch for [distributed computing](/scaling/remote-tasks/distributed-computing).



Loading
Loading