skypilot-org · Michaelvll · Mar 14, 2024 · Mar 3, 2024 · Mar 9, 2024 · Mar 10, 2024
diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst
@@ -137,3 +137,48 @@ This allows you directly to SSH into the worker nodes, if required.
  # Worker nodes.
  $ ssh mycluster-worker1
  $ ssh mycluster-worker2
+
+
+Executing a Distributed Ray Program
+------------------------------------
+To execute a distributed Ray program on many VMs, you can use the following example:
+
+.. code-block:: console
+
+ $ sky launch ray_train.yaml
+
+.. code-block:: yaml
+ :emphasize-lines: 6-6,21-22,24-25
+
+ resources:
+ accelerators: L4:2
+ memory: 64+
+
+ num_nodes: 2
+
+ setup: |
+ pip install "ray[train]"
+ pip install tqdm
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+
+ run: |
+ sudo chmod 777 -R /var/tmp
+ head_ip=`echo "$SKYPILOT_NODE_IPS" | head -n1`
+ num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
+ ray start --head --disable-usage-stats --port 6379
+ if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
+ ps aux | grep ray | grep 6379 &> /dev/null || ray start --head --disable-usage-stats --port 6379
+ sleep 5
+ python train.py --num-workers $num_nodes
+ else
+ sleep 5
+ ps aux | grep ray | grep 6379 &> /dev/null || ray start --address $head_ip:6379 --disable-usage-stats
+ fi
+
+.. warning:: 
+ **Avoid Installing Ray in Base Environment**
+
+ Before proceeding with the execution of a distributed Ray program, it is crucial to ensure that Ray is **not** installed in the base environment. Installing a different version of Ray in the base environment can lead to compatibility issues, conflicts, and unintended consequences.
+
+ To maintain a clean and stable environment for your distributed Ray program, it is highly recommended to **create a dedicated virtual environment** for Ray and its dependencies. This helps isolate the Ray installation and prevents interference with other packages in your base environment.
+