-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'devel' of github.com:NREL/rlmolecule into devel
- Loading branch information
Showing
4 changed files
with
294 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
Official documentation here: https://docs.ray.io/en/latest/cluster/cloud.html | ||
|
||
The example that currently works uses a cluster launched by the local `example-full.yaml`. | ||
It uses 1 head and 2 worker CPU m5.large instances running the | ||
`rayproject/ray-ml:latest-cpu` image. It also installs rllib as part of the | ||
`setup_commands`. | ||
|
||
## Default cluster configuration fails in multiple ways | ||
|
||
The default yaml file, which uses a GPU Docker image and compute instance, | ||
fails in multiple ways "as is". | ||
|
||
1. Subnet error: | ||
|
||
``` | ||
No usable subnets found, try manually creating an instance in your specified | ||
region to populate the list of subnets and trying this again. Note that the | ||
subnet must map public IPs on instance launch unless you set `use_internal_ips: true` | ||
in the `provider` config. | ||
``` | ||
__Fix:__ Set `use_internal_ips: True` in the `provider` configuration | ||
|
||
2. Credentials error: | ||
|
||
``` | ||
botocore.exceptions.NoCredentialsError: Unable to locate credentials | ||
``` | ||
__Fix:__ Set the profile name to `default` in `~/.aws/credentials`. | ||
|
||
3. Disk space error: | ||
|
||
``` | ||
latest-gpu: Pulling from rayproject/ray-ml | ||
e4ca327ec0e7: Pull complete | ||
... | ||
850f7f4138ca: Extracting 3.752GB/3.752GB | ||
3b7026c2a927: Download complete | ||
failed to register layer: Error processing tar file(exit status 1): write /usr/lib/x86_64-linux-gnu/dri/iris_dri.so: no space left on device | ||
Shared connection to 172.18.106.160 closed. | ||
New status: update-failed | ||
!!! | ||
SSH command failed. | ||
!!! | ||
Failed to setup head node. | ||
``` | ||
__Fix:__ Increasing storage from 100 GiB to 500 GiB seems to have done the trick, | ||
|
||
``` | ||
available_node_types: | ||
ray.head.default: | ||
node_config: | ||
BlockDeviceMappings: | ||
- DeviceName: /dev/sda1 | ||
Ebs: | ||
VolumeSize: 500 | ||
``` | ||
|
||
## Multi-node training example using only CPU instances | ||
|
||
Here are the steps that demonstrated multi-node training of a PPO policy on | ||
the CartPole-v0 environment. The Docker image and EC2 instances are cpu-only. | ||
|
||
1. Start the cluster using the local version of the yaml file. | ||
|
||
``` | ||
ray up example-full.yaml | ||
``` | ||
|
||
3. SSH into head node and check ray status. | ||
|
||
``` | ||
ray attach example-full.yaml # on local machine | ||
ray status # on remote machine | ||
``` | ||
|
||
Be patient -- the worker nodes take long time to start up and connect to the head! | ||
|
||
4. Run a simple training example, ensuring that more than a single node is used. | ||
With 1 head and 2 worker m5.large nodes, this command runs rllib using all | ||
available CPUs (there are 6 total). | ||
|
||
``` | ||
rllib train --run PPO --env CartPole-v0 --ray-num-cpus 6 --config '{"num_workers": 5}' | ||
``` | ||
|
||
## Multi-node training example using GPU head and CPU worker instances | ||
|
||
Same as above, but set | ||
|
||
``` | ||
docker: | ||
head_image: "rayproject/ray-ml:latest-gpu" | ||
worker_image: "rayproject/ray-ml:latest-cpu" | ||
``` | ||
|
||
and use something like this command to train: | ||
|
||
``` | ||
rllib train --env CartPole-v0 --run PPO --ray-num-gpu 1 --ray-num-cpu 6 --config '{"num_workers": 5}' | ||
``` | ||
|
||
Here we are using 1 head GPU instance and 2 worker CPU instances, with a total of 1 GPU and 6 CPUs. | ||
|
||
|
||
## Other notes | ||
|
||
* Out of two attempts to create ray clusters, one had a failed worker node (1 | ||
out of 2 worker nodes failed, unsure why). | ||
|
||
* The cluster is quite slow to launch, 15+ minutes with only 2 small worker nodes. | ||
This is not just the docker pull / launch steps, but setting up the Ray cluster | ||
itself. Could be due to using low-performance resources? | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,178 @@ | ||
# This configuration file has been | ||
|
||
# An unique identifier for the head node and workers of this cluster. | ||
cluster_name: default | ||
|
||
# The maximum number of workers nodes to launch in addition to the head | ||
# node. | ||
max_workers: 2 | ||
|
||
# The autoscaler will scale up the cluster faster with higher upscaling speed. | ||
# E.g., if the task requires adding more nodes then autoscaler will gradually | ||
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes. | ||
# This number should be > 0. | ||
upscaling_speed: 1.0 | ||
|
||
# This executes all commands on all nodes in the docker container, | ||
# and opens all the necessary ports to support the Ray cluster. | ||
# Empty string means disabled. | ||
docker: | ||
image: "rayproject/ray-ml:latest-cpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup | ||
# image: rayproject/ray:latest-gpu # use this one if you don't need ML dependencies, it's faster to pull | ||
container_name: "ray_container" | ||
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image | ||
# if no cached version is present. | ||
pull_before_run: True | ||
run_options: # Extra options to pass into "docker run" | ||
- --ulimit nofile=65536:65536 | ||
|
||
# Example of running a GPU head with CPU workers | ||
# head_image: "rayproject/ray-ml:latest-gpu" | ||
# Allow Ray to automatically detect GPUs | ||
|
||
# worker_image: "rayproject/ray-ml:latest-cpu" | ||
# worker_run_options: [] | ||
|
||
# If a node is idle for this many minutes, it will be removed. | ||
idle_timeout_minutes: 5 | ||
|
||
# Cloud-provider specific configuration. | ||
provider: | ||
type: aws | ||
region: us-west-2 | ||
# Availability zone(s), csomma-separated, that nodes may be launched in. | ||
# Nodes are currently spread between zones by a round-robin approach, | ||
# however this implementation detail should not be relied upon. | ||
availability_zone: us-west-2a,us-west-2b | ||
# Whether to allow node reuse. If set to False, nodes will be terminated | ||
# instead of stopped. | ||
cache_stopped_nodes: True # If not present, the default is True. | ||
use_internal_ips: True | ||
|
||
# How Ray will authenticate with newly launched nodes. | ||
auth: | ||
ssh_user: ubuntu | ||
# By default Ray creates a new private keypair, but you can also use your own. | ||
# If you do so, make sure to also set "KeyName" in the head and worker node | ||
# configurations below. | ||
# ssh_private_key: /path/to/your/key.pem | ||
|
||
# Tell the autoscaler the allowed node types and the resources they provide. | ||
# The key is the name of the node type, which is just for debugging purposes. | ||
# The node config specifies the launch config and physical instance type. | ||
available_node_types: | ||
ray.head.default: | ||
# The node type's CPU and GPU resources are auto-detected based on AWS instance type. | ||
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler. | ||
# You can also set custom resources. | ||
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set | ||
# resources: {"CPU": 1, "GPU": 1, "custom": 5} | ||
resources: {} | ||
# Provider-specific config for this node type, e.g. instance type. By default | ||
# Ray will auto-configure unspecified fields such as SubnetId and KeyName. | ||
# For more documentation on available fields, see: | ||
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances | ||
node_config: | ||
InstanceType: m5.large | ||
ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30 | ||
# You can provision additional disk space with a conf as follows | ||
BlockDeviceMappings: | ||
- DeviceName: /dev/sda1 | ||
Ebs: | ||
VolumeSize: 100 | ||
# Additional options in the boto docs. | ||
ray.worker.default: | ||
# The minimum number of worker nodes of this type to launch. | ||
# This number should be >= 0. | ||
min_workers: 2 | ||
# The maximum number of worker nodes of this type to launch. | ||
# This takes precedence over min_workers. | ||
max_workers: 2 | ||
# The node type's CPU and GPU resources are auto-detected based on AWS instance type. | ||
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler. | ||
# You can also set custom resources. | ||
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set | ||
# resources: {"CPU": 1, "GPU": 1, "custom": 5} | ||
resources: {} | ||
# Provider-specific config for this node type, e.g. instance type. By default | ||
# Ray will auto-configure unspecified fields such as SubnetId and KeyName. | ||
# For more documentation on available fields, see: | ||
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances | ||
node_config: | ||
InstanceType: m5.large | ||
ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30 | ||
# Run workers on spot by default. Comment this out to use on-demand. | ||
# NOTE: If relying on spot instances, it is best to specify multiple different instance | ||
# types to avoid interruption when one instance type is experiencing heightened demand. | ||
# Demand information can be found at https://aws.amazon.com/ec2/spot/instance-advisor/ | ||
InstanceMarketOptions: | ||
MarketType: spot | ||
# Additional options can be found in the boto docs, e.g. | ||
# SpotOptions: | ||
# MaxPrice: MAX_HOURLY_PRICE | ||
# Additional options in the boto docs. | ||
|
||
# Specify the node type of the head node (as configured above). | ||
head_node_type: ray.head.default | ||
|
||
# Files or directories to copy to the head and worker nodes. The format is a | ||
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g. | ||
file_mounts: { | ||
# "/path1/on/remote/machine": "/path1/on/local/machine", | ||
# "/path2/on/remote/machine": "/path2/on/local/machine", | ||
} | ||
|
||
# Files or directories to copy from the head node to the worker nodes. The format is a | ||
# list of paths. The same path on the head node will be copied to the worker node. | ||
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases | ||
# you should just use file_mounts. Only use this if you know what you're doing! | ||
cluster_synced_files: [] | ||
|
||
# Whether changes to directories in file_mounts or cluster_synced_files in the head node | ||
# should sync to the worker node continuously | ||
file_mounts_sync_continuously: False | ||
|
||
# Patterns for files to exclude when running rsync up or rsync down | ||
rsync_exclude: | ||
- "**/.git" | ||
- "**/.git/**" | ||
|
||
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for | ||
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided | ||
# as a value, the behavior will match git's behavior for finding and using .gitignore files. | ||
rsync_filter: | ||
- ".gitignore" | ||
|
||
# List of commands that will be run before `setup_commands`. If docker is | ||
# enabled, these commands will run outside the container and before docker | ||
# is setup. | ||
initialization_commands: [] | ||
|
||
# List of shell commands to run to set up nodes. | ||
setup_commands: | ||
- pip install ray[rllib] | ||
# Note: if you're developing Ray, you probably want to create a Docker image that | ||
# has your Ray repo pre-cloned. Then, you can replace the pip installs | ||
# below with a git checkout <your_sha> (and possibly a recompile). | ||
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image | ||
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line: | ||
# - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl" | ||
|
||
# Custom commands that will be run on the head node after common setup. | ||
head_setup_commands: [] | ||
|
||
# Custom commands that will be run on worker nodes after common setup. | ||
worker_setup_commands: [] | ||
|
||
# Command to start ray on the head node. You don't need to change this. | ||
head_start_ray_commands: | ||
- ray stop | ||
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml | ||
|
||
# Command to start ray on worker nodes. You don't need to change this. | ||
worker_start_ray_commands: | ||
- ray stop | ||
- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 | ||
|
||
head_node: {} | ||
worker_nodes: {} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters