Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Add warning against installing Ray in base environment #3267

Merged
merged 11 commits into from
Mar 14, 2024
Merged
45 changes: 45 additions & 0 deletions docs/source/running-jobs/distributed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -137,3 +137,48 @@ This allows you directly to SSH into the worker nodes, if required.
# Worker nodes.
$ ssh mycluster-worker1
$ ssh mycluster-worker2


Executing a Distributed Ray Program
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For discussion: We may want to add a dedicated guide on how to use Ray as part of the user job. That page can use a tip like this. A full YAML example #3195 may help promote this recommendation more directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let’s add a yaml example of starting a distributed job with ray as mentioned above. We can try to move it to a separate page later after this pr is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for confirmation: I should add the command to run the example in distributed_ray_train, the one that uses FashionMNIST and add the command for tha yaml file and its output as example in the documentation ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can add the yaml file inline with the command to run that yaml : )

------------------------------------
To execute a distributed Ray program on many VMs, you can use the following example:
MysteryManav marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: console

$ sky launch ray_train.yaml
MysteryManav marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: yaml
:emphasize-lines: 6-6,21-22,24-25
MysteryManav marked this conversation as resolved.
Show resolved Hide resolved

resources:
accelerators: L4:2
memory: 64+

num_nodes: 2

setup: |
MysteryManav marked this conversation as resolved.
Show resolved Hide resolved
pip install "ray[train]"
MysteryManav marked this conversation as resolved.
Show resolved Hide resolved
pip install tqdm
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

run: |
sudo chmod 777 -R /var/tmp
head_ip=`echo "$SKYPILOT_NODE_IPS" | head -n1`
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
ray start --head --disable-usage-stats --port 6379
MysteryManav marked this conversation as resolved.
Show resolved Hide resolved
if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
ps aux | grep ray | grep 6379 &> /dev/null || ray start --head --disable-usage-stats --port 6379
sleep 5
python train.py --num-workers $num_nodes
else
sleep 5
ps aux | grep ray | grep 6379 &> /dev/null || ray start --address $head_ip:6379 --disable-usage-stats
fi

.. warning::
**Avoid Installing Ray in Base Environment**

Before proceeding with the execution of a distributed Ray program, it is crucial to ensure that Ray is **not** installed in the base environment. Installing a different version of Ray in the base environment can lead to compatibility issues, conflicts, and unintended consequences.

To maintain a clean and stable environment for your distributed Ray program, it is highly recommended to **create a dedicated virtual environment** for Ray and its dependencies. This helps isolate the Ray installation and prevents interference with other packages in your base environment.
MysteryManav marked this conversation as resolved.
Show resolved Hide resolved

Loading