Netflix · tuulos · May 5, 2024 · Oct 29, 2023 · Oct 30, 2023 · Oct 30, 2023
diff --git a/docs/getting-started/tutorials/season-2-scaling-out-and-up/episode05.md b/docs/getting-started/tutorials/season-2-scaling-out-and-up/episode05.md
@@ -4,7 +4,7 @@
 
 This flow is a simple linear workflow that verifies your cloud configuration. The
 `start` and `end` steps will run locally, while the `hello` step will [run
-remotely](/scaling/remote-tasks/introduction). After [configuring
+remotely](/scaling/remote-tasks/requesting-resources). After [configuring
 Metaflow](/getting-started/infrastructure) to run in the cloud, data and metadata about
 your runs will be stored remotely. This means you can use the client to access
 information about any flow from anywhere.

diff --git a/docs/index.md b/docs/index.md
@@ -31,12 +31,12 @@ projects.
 - [Creating Flows](metaflow/basics)
 - [Inspecting Flows and Results](metaflow/client)
 - [Debugging Flows](metaflow/debugging)
-- [Visualizing Results](metaflow/visualizing-results/) ✨*New Features*✨
+- [Visualizing Results](metaflow/visualizing-results/) 
 
 ## II. Scalable Flows
 
 - [Introduction to Scalable Compute and Data](scaling/introduction)
-- [Executing Tasks remotely](scaling/remote-tasks/introduction)
+- [Computing at Scale](scaling/remote-tasks/introduction) ✨*Updated and Expanded*✨
 - [Managing Dependencies](scaling/dependencies) 
 - [Dealing with Failures](scaling/failures)
 - [Loading and Storing Data](scaling/data)

diff --git a/docs/introduction/metaflow-resources.md b/docs/introduction/metaflow-resources.md
@@ -36,12 +36,17 @@
 
 ## Videos
 
+ - [Metaflow Office Hours](https://www.youtube.com/watch?v=76k5r6s6M1Q&list=PLUsOvkBBnJBcdF7SScDIndbnkMnFOoskA) - 
+ cool stories from companies using Metaflow.
  - [Metaflow on YouTube](https://www.youtube.com/results?search_query=metaflow+ml).
  - You can start with [this recent
  overview](https://www.youtube.com/watch?v=gZnhSHvhuFQ).
 
 ## Blogs
 
+Find [a more comprehensive list of Metaflow users and posts here](https://outerbounds.com/stories/). Here's
+a small sample:
+
  - 23andMe: [Developing safe and reliable ML products at
  23andMe](https://medium.com/23andme-engineering/machine-learning-eeee69d40736)
  - AWS: [Getting started with the open source data science tool Metaflow on
@@ -50,6 +55,7 @@
  CNN](https://medium.com/cnn-digital/accelerating-ml-within-cnn-983f6b7bd2eb)
  - Latana: [Brand Tracking with Bayesian Statistics and AWS
  Batch](https://aws.amazon.com/blogs/startups/brand-tracking-with-bayesian-statistics-and-aws-batch/)
+ - Netflix: [Supporting Diverse ML Systems at Netflix](https://netflixtechblog.com/supporting-diverse-ml-systems-at-netflix-2d2e6b6d205d)
  - Netflix: [Open-Sourcing Metaflow, a Human-Centric Framework for Data
  Science](https://netflixtechblog.com/open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9)
  - Netflix: [Unbundling Data Science Workflows with Metaflow and AWS Step

diff --git a/docs/metaflow/visualizing-results/README.md b/docs/metaflow/visualizing-results/README.md
@@ -39,7 +39,8 @@
 
 In contrast, cards are not meant for building complex, interactive dashboards or for
 ad-hoc exploration that is a spot-on use case for notebooks. If you are curious, you can read
-more about the motivation for cards in [the original release blog post](https://outerbounds.com/blog/integrating-pythonic-visual-reports-into-ml-pipelines/) and
+more about the motivation for cards in [the original release blog
+post](https://outerbounds.com/blog/integrating-pythonic-visual-reports-into-ml-pipelines/) and
 [the announcement post for updating cards](https://outerbounds.com/blog/metaflow-dynamic-cards/).
 
 ## How to use cards?
@@ -85,7 +86,7 @@
  above.
 
 Also, crucially, cards work in any compute
-environment such as laptops, [any remote tasks](/scaling/remote-tasks/introduction), or
+environment such as laptops, [any remote tasks](/scaling/remote-tasks/requesting-resources), or
 when the flow is [scheduled to run automatically](/production/introduction). Hence, you
 can use cards to inspect and debug results during prototyping, as well as monitor the
 quality of production runs.
@@ -123,4 +124,4 @@

 :::tip
 You can test example cards interactively in [Metaflow
 Sandbox](https://account.outerbounds.dev/account/?workspace=/home/workspace/workspaces/dynamic-cards/workspace.code-workspace), conveniently in the browser.

diff --git a/docs/metaflow/visualizing-results/advanced-shareable-cards-with-card-templates.md b/docs/metaflow/visualizing-results/advanced-shareable-cards-with-card-templates.md
@@ -80,7 +80,7 @@ You should see a blank page with a blue “Hello World!” text.
 ![](</assets/card-docs-html_(2).png>)
 
 A particularly useful feature of card templates is that they work in any compute
-environment, even when [executing tasks remotely](/scaling/remote-tasks/introduction).
+environment, even when [executing tasks remotely](/scaling/remote-tasks/requesting-resources).
 For instance, if you have AWS Batch set up, you can run the flow as follows:
 
 ```

diff --git a/docs/production/scheduling-metaflow-flows/scheduling-with-airflow.md b/docs/production/scheduling-metaflow-flows/scheduling-with-airflow.md
@@ -36,7 +36,7 @@ required in the code. All data artifacts produced by steps run on Airflow are av
 using the [Client API](../../metaflow/client.md). All tasks are run on Kubernetes
 respecting the `@resources` decorator, as if the `@kubernetes` decorator was added to
 all steps, as explained in [Executing Tasks
-Remotely](/scaling/remote-tasks/introduction#safeguard-flags).
+Remotely](/scaling/remote-tasks/requesting-resources).
 
 This document describes the basics of Airflow scheduling. If your project involves
 multiple people, multiple workflows, or it is becoming business-critical, check out the
@@ -115,7 +115,7 @@ the default with the `--max-workers` option. For instance, `airflow create --max
 500` allows 500 tasks to be executed concurrently for every foreach step.
 
 This option is similar to [`run
---max-workers`](/scaling/remote-tasks/introduction#safeguard-flags) that is used to
+--max-workers`](/scaling/remote-tasks/controlling-parallelism) that is used to
 limit concurrency outside Airflow.
 
 ### Deploy-time parameters

diff --git a/docs/production/scheduling-metaflow-flows/scheduling-with-argo-workflows.md b/docs/production/scheduling-metaflow-flows/scheduling-with-argo-workflows.md
@@ -25,7 +25,7 @@ changes are required in the code. All data artifacts produced by steps run on Ar
 Workflows are available using the [Client API](../../metaflow/client.md). All tasks are
 run on Kubernetes respecting the `@resources` decorator, as if the `@kubernetes`
 decorator was added to all steps, as explained in [Executing Tasks
-Remotely](/scaling/remote-tasks/introduction#safeguard-flags).
+Remotely](/scaling/remote-tasks/requesting-resources).
 
 This document describes the basics of Argo Workflows scheduling. If your project
 involves multiple people, multiple workflows, or it is becoming business-critical, check
@@ -129,7 +129,7 @@ the default with the `--max-workers` option. For instance, `argo-workflows creat
 --max-workers 500` allows 500 tasks to be executed concurrently for every foreach step.
 
 This option is similar to [`run
---max-workers`](/scaling/remote-tasks/introduction#safeguard-flags) that is used to
+--max-workers`](/scaling/remote-tasks/controlling-parallelism) that is used to
 limit concurrency outside Argo Workflows.
 
 ### Deploy-time parameters

diff --git a/docs/production/scheduling-metaflow-flows/scheduling-with-aws-step-functions.md b/docs/production/scheduling-metaflow-flows/scheduling-with-aws-step-functions.md
@@ -23,7 +23,7 @@ changes are required in the code. All data artifacts produced by steps run on AW
 Functions are available using the [Client API](../../metaflow/client). All tasks are run
 on AWS Batch respecting the `@resources` decorator, as if the `@batch` decorator was
 added to all steps, as explained in [Executing Remote
-Tasks](/scaling/remote-tasks/introduction).
+Tasks](/scaling/remote-tasks/requesting-resources).
 
 This document describes the basics of AWS Step Functions scheduling. If your project
 involves multiple people, multiple workflows, or it is becoming business-critical, check
@@ -146,7 +146,7 @@ the default with the `--max-workers` option. For instance, `step-functions creat
 --max-workers 500` allows 500 tasks to be executed concurrently for every foreach step.
 
 This option is similar to [`run
---max-workers`](/scaling/remote-tasks/introduction#safeguard-flags) that is used to
+--max-workers`](/scaling/remote-tasks/controlling-parallelism) that is used to
 limit concurrency outside AWS Step Functions.
 
 ### Deploy-time parameters

diff --git a/docs/scaling/data.md b/docs/scaling/data.md
@@ -29,7 +29,7 @@ manipulation:
  are caused by out-of-sync features.
 * It is quicker to iterate on your model. Testing and debugging Python is easier than
  testing and debugging SQL.
-* You can request [arbitrary amount of resources](/scaling/remote-tasks/introduction)
+* You can request [arbitrary amount of resources](/scaling/remote-tasks/requesting-resources)
  for your data manipulation needs.
 * Instead of having data manipulation code in two places (SQL and Python), all code can
  be clearly laid out in a single place, in a single language, for maximum readability.
@@ -65,7 +65,7 @@ of `metaflow.S3` to directly interface with data files on S3 backing your tables
 data is loaded directly from S3, there is no limitation to the number of parallel
 processes. The size of data is only limited by the size of your instance, which can be
 easily controlled with [the `@resources`
-decorator](/scaling/remote-tasks/introduction#requesting-resources-with-resources-decorator).
+decorator](/scaling/remote-tasks/requesting-resources).
 The best part is that this approach is blazingly fast compared to executing SQL.
 
 The main downside of this approach is that the table needs to have partitions that match
@@ -549,7 +549,7 @@ Read more about [fast data processing with `metaflow.S3` in this blog post](http
 
 For maximum performance, ensure that
 [the `@resources(memory=)`
-setting](/scaling/remote-tasks/introduction#requesting-resources-with-resources-decorator)
+setting](/scaling/remote-tasks/requesting-resources)
 is higher than the amount of data you are downloading with `metaflow.S3`.
 
 If the amount of data is higher than the available disk space, you can use the

diff --git a/docs/scaling/dependencies/README.md b/docs/scaling/dependencies/README.md
@@ -30,7 +30,7 @@ others.
 to retrieve a flow, rerun it, and get similar results. To make this possible, a
 flow can't depend on libraries that are only available on your laptop.
 
-3. **[Remote execution of tasks](/scaling/remote-tasks/introduction)** requires that
+3. **[Remote execution of tasks](/scaling/remote-tasks/requesting-resources)** requires that
 all dependencies can be reinstantated on the fly in a remote environment.
 Again, this is not possible if the flow depends on libraries that are only
 available on your laptop.
@@ -61,7 +61,7 @@ dependencies like this automatically.
 
 3. Crucially, Metaflow packages Metaflow itself for remote execution so that you
 don't have to install it manually when
-[using `@batch` and `@kubernetes`](/scaling/remote-tasks/introduction). Also, this
+[using `@batch` and `@kubernetes`](/scaling/remote-tasks/requesting-resources). Also, this
 guarantees that all tasks use the same version of Metaflow consistently.
 
 4. External libraries can be included via [the `@pypi` and `@conda`

diff --git a/docs/scaling/dependencies/containers.md b/docs/scaling/dependencies/containers.md
@@ -1,7 +1,7 @@
 
 # Defining Custom Images
 
-All [tasks executed remotely](/scaling/remote-tasks/introduction) run in a container,
+All [tasks executed remotely](/scaling/remote-tasks/requesting-resources) run in a container,
 both on [Kubernetes](/scaling/remote-tasks/kubernetes) and in
 [AWS Batch](/scaling/remote-tasks/aws-batch). Hence, the default environment for
 remote tasks is defined by the container (Docker) image used.

diff --git a/docs/scaling/dependencies/libraries.md b/docs/scaling/dependencies/libraries.md
@@ -236,7 +236,7 @@ in the `end` step.
 ## Libraries in remote tasks
 
 A major benefit of `@pypi` and `@conda` is that they allow you to define libraries that will be
-automatically made available when [you execute tasks remotely](/scaling/remote-tasks/introduction)
+automatically made available when [you execute tasks remotely](/scaling/remote-tasks/requesting-resources)
 on `@kubernetes` or on `@batch`. You don't need to do anything special to make this happen.
 
 If you have `@kubernetes` or `@batch` [configured to work with

diff --git a/docs/scaling/dependencies/project-structure.md b/docs/scaling/dependencies/project-structure.md
@@ -3,7 +3,7 @@
 
 This page describes how to arrange files in your projects to
 follow software development best practices, which also leads to
-[easy remote execution](/scaling/remote-tasks/introduction).
+[easy remote execution](/scaling/remote-tasks/requesting-resources).
 
 ## Separating code to modules
 
@@ -55,7 +55,7 @@ can run the flow as usual:
 python teaflow.py run
 ``` 
 The `teatime.py` module works out of the box. If you have
-[remote execution](/scaling/remote-tasks/introduction) set up,
+[remote execution](/scaling/remote-tasks/requesting-resources) set up,
 you can run the code `--with batch` or `--with kubernetes` and it works
 equally well!
 

diff --git a/docs/scaling/failures.md b/docs/scaling/failures.md
@@ -52,7 +52,7 @@ sometimes the `start` step raises an exception and needs to be retried. By defau
 always succeed.
 
 It is recommended that you use `retry` every time you [run tasks
-remotely](/scaling/remote-tasks/introduction). Instead of annotating every step with a
+remotely](/scaling/remote-tasks/requesting-resources). Instead of annotating every step with a
 retry decorator, you can also automatically add a retry decorator to all steps that do
 not have one as follows:
 
@@ -104,7 +104,7 @@ can be retried without concern.
 ### Maximizing Safety
 
 By default, `retry` will retry the step for three times before giving up. It waits for 2
-minutes between retries for [remote tasks](/scaling/remote-tasks/introduction). This
+minutes between retries for [remote tasks](/scaling/remote-tasks/requesting-resources). This
 means that if your code fails fast, any transient platform issues need to get resolved
 in less than 10 minutes or the whole run will fail. 10 minutes is typically more than
 enough, but sometimes you want both a belt and suspenders.

diff --git a/docs/scaling/introduction.md b/docs/scaling/introduction.md
@@ -9,17 +9,18 @@ optimization](https://xkcd.com/1691/) is probably not the best use of your time.
 
 Instead, you can leverage the cloud to get a bigger laptop or more laptops (virtually,
 not literally). This is the Stage II in Metaflow development: Scaling flows with the
-cloud. Luckily Metaflow makes this trivially easy - no changes in the code required -
+cloud. Metaflow makes this easy - no changes in the code required -
 after you have done the initial legwork to [configure infrastructure for
 Metaflow](/getting-started/infrastructure).
 
 ![](/assets/intro-cartoon-2.svg)
 
 ## Supersizing Flows
 
-Here's how Metaflow can help make your project more scalable:
+Here's how Metaflow can help make your project more scalable, both tecnically and
+organizationally:
 
-1. You can make your existing flows more scalable just by adding a line of code,
+1. You can make your existing flow scalable just by adding a line of code,
 `@resources`. This way you can request more CPUs, memory, or GPUs in your flows. Or, you
 can parallelize processing over multiple instances, even thousands of them.
 
@@ -28,18 +29,18 @@ attracting interest from colleagues too. Metaflow contains a number of features,
 [namespaces](/scaling/tagging), which make collaboration smoother by allowing many
 people contribute without interfering with each other's work accidentally.
 
-### Toolbox of Scalability
+### Easy patterns for scalable, high-performance code
 
 There is no single magic formula for scalability. Instead of proposing a novel paradigm
-to make your Python code faster, Metaflow provides a set of pragmatic tools, leveraging
-the best off-the-shelf components and services, which help you make code more scalable
-and performant depending on your specific needs. 
+to make your Python code faster, [Metaflow provides a set of
+practical, commonly used patterns](/scaling/remote-tasks/introduction), which help you
+make code more scalable and performant depending on your specific needs. 
 
-The scalability tools fall into three categories:
+The patterns fall into three categories:
 
 - **Performance Optimization**: You can improve performance of your code by utilizing
  off-the-shelf, high-performance libraries such as
- [XGboost](https://github.com/dmlc/xgboost) or [Tensorflow](https://tensorflow.org).
+ [XGboost](https://github.com/dmlc/xgboost) or [PyTorch](https://pytorch.org/).
  Or, if you need something more custom, you can leverage the vast landscape of data
  tools for Python, including compilers like [Numba](https://numba.pydata.org) to speed
  up your code.
@@ -53,7 +54,9 @@ care of provisioning such machines on demand.
 - **Scaling Out**: Besides executing code on a single instance, Metaflow makes it easy
  to parallelize steps over an arbitrarily large number of instances, leveraging
  Kubernetes and AWS Batch, giving you access to virtually unlimited amount of computing
- power.
+ power. Besides many independent tasks, you can [create large compute
+ clusters](/scaling/remote-tasks/distributed-computing) too on the fly, e.g.
+ to train large (language) models.
 
 Often an effective recipe for scalability is a combination of these three techniques:
 Start with high-performance Python libraries, run them on large instances, and if
@@ -73,8 +76,7 @@ In this section, you will learn how to make your flows capable of handling more
 execute faster. You will also learn how to scale projects over multiple people by
 organizing results better. We cover five topics:
 
-1. [Executing tasks remotely with Kubernetes or AWS
- Batch](/scaling/remote-tasks/introduction)
+1. [Scaling compute in the cloud](/scaling/remote-tasks/introduction)
 2. [Dealing with failures](/scaling/failures)
 3. [Managing execution environments](/scaling/dependencies)
 4. [Loading and storing data efficiently](/scaling/data)

diff --git a/docs/scaling/remote-tasks/aws-batch.md b/docs/scaling/remote-tasks/aws-batch.md
@@ -35,11 +35,49 @@ You can also specify the resource requirements on command line as well:
 $ python BigSum.py run --with batch:cpu=4,memory=10000,queue=default,image=ubuntu:latest
 ```
 
-## My job is stuck in `RUNNABLE` state. What do I do?
+## Using GPUs and Trainium instances with AWS Batch
+
+To use GPUs in Metaflow tasks that run on AWS Batch, you need to run the flow in a
+[Job Queue](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html) that
+is attached to a [Compute
+Environment](https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html)
+with GPU/Trainium instances.
+
+To set this up, you can either modify the allowable instances in a [Metaflow AWS deployment
+template](https://github.com/outerbounds/metaflow-tools/tree/master/aws) or manually add such a
+compute environment from the AWS console. The steps are:
+
+1. Create a compute environment with GPU-enabled EC2 instances or Trainium instances.
+2. Attach the compute environment to a new Job Queue - for example named `my-gpu-queue`. 
+3. Run a flow with a GPU task in the `my-gpu-queue` job queue by
+ - setting the `METAFLOW_BATCH_JOB_QUEUE` environment variable, or
+ - setting the `METAFLOW_BATCH_JOB_QUEUE` value in your Metaflow config, or 
+ - (most explicit) setting the `queue` parameter in the `@batch` decorator.
+
+It is a good practice to separate the job queues that you run GPU tasks on from those that do not
+require GPUs (or Trainium instances). This makes it easier to track hardware-accelerated workflows,
+which can be costly, independent of other workflows. Just add a line like
+```python
+@batch(gpu=1, queue='my-gpu-queue')
+```
+in steps that require GPUs.
+
+## My job is stuck in `RUNNABLE` state. What should I do?
+
+Does the Batch job queue you are trying to run the Metaflow task in have a compute environment
+with EC2 instances with the resources requested? For example, if your job queue is connected to
+a single compute environment that only has `p3.2xlarge` as a GPU instance, and a user requests 2
+GPUs, that job will never get scheduled because `p3.2xlarge` only have 1 GPU per instance.
 
-Consult [this
+For more information, [see this
 article](https://docs.aws.amazon.com/batch/latest/userguide/troubleshooting.html#job_stuck_in_runnable).
 
+## My job is stuck in `STARTING` state. What should I do?
+
+Are the resources requested in your Metaflow code/command sufficient? Especially when using
+custom GPU images, you might need to increase the requested memory to pull the container image
+into your compute environment.
+
 ## Listing and killing AWS Batch tasks
 
 If you interrupt a Metaflow run, any AWS Batch tasks launched by the run get killed by
@@ -101,8 +139,15 @@ As a convenience feature, you can also see the logs of any past step as follows:
 $ python bigsum.py logs 15/end
 ```
 
-
 ## Disk space
 
 You can request higher disk space on AWS Batch instances by using an unmanaged Compute
-Environment with a custom AMI.
+Environment with a custom AMI.
+
+## How to configure AWS Batch for distributed computing?
+
+[See these instructions](https://outerbounds.com1/engineering/operations/distributed-computing/)
+if you want to use AWS Batch for [distributed computing](/scaling/remote-tasks/distributed-computing).
+
+
+