Skip to content

Commit

Permalink
Merge branch 'devel' of github.com:NREL/rlmolecule into devel
Browse files Browse the repository at this point in the history
  • Loading branch information
jlaw9 committed Jan 8, 2022
2 parents 8e4d932 + f506db9 commit 29820cd
Show file tree
Hide file tree
Showing 4 changed files with 294 additions and 2 deletions.
114 changes: 114 additions & 0 deletions devtools/aws/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
Official documentation here: https://docs.ray.io/en/latest/cluster/cloud.html

The example that currently works uses a cluster launched by the local `example-full.yaml`.
It uses 1 head and 2 worker CPU m5.large instances running the
`rayproject/ray-ml:latest-cpu` image. It also installs rllib as part of the
`setup_commands`.

## Default cluster configuration fails in multiple ways

The default yaml file, which uses a GPU Docker image and compute instance,
fails in multiple ways "as is".

1. Subnet error:

```
No usable subnets found, try manually creating an instance in your specified
region to populate the list of subnets and trying this again. Note that the
subnet must map public IPs on instance launch unless you set `use_internal_ips: true`
in the `provider` config.
```
__Fix:__ Set `use_internal_ips: True` in the `provider` configuration

2. Credentials error:

```
botocore.exceptions.NoCredentialsError: Unable to locate credentials
```
__Fix:__ Set the profile name to `default` in `~/.aws/credentials`.

3. Disk space error:

```
latest-gpu: Pulling from rayproject/ray-ml
e4ca327ec0e7: Pull complete
...
850f7f4138ca: Extracting 3.752GB/3.752GB
3b7026c2a927: Download complete
failed to register layer: Error processing tar file(exit status 1): write /usr/lib/x86_64-linux-gnu/dri/iris_dri.so: no space left on device
Shared connection to 172.18.106.160 closed.
New status: update-failed
!!!
SSH command failed.
!!!
Failed to setup head node.
```
__Fix:__ Increasing storage from 100 GiB to 500 GiB seems to have done the trick,

```
available_node_types:
ray.head.default:
node_config:
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 500
```

## Multi-node training example using only CPU instances

Here are the steps that demonstrated multi-node training of a PPO policy on
the CartPole-v0 environment. The Docker image and EC2 instances are cpu-only.

1. Start the cluster using the local version of the yaml file.

```
ray up example-full.yaml
```

3. SSH into head node and check ray status.

```
ray attach example-full.yaml # on local machine
ray status # on remote machine
```

Be patient -- the worker nodes take long time to start up and connect to the head!

4. Run a simple training example, ensuring that more than a single node is used.
With 1 head and 2 worker m5.large nodes, this command runs rllib using all
available CPUs (there are 6 total).

```
rllib train --run PPO --env CartPole-v0 --ray-num-cpus 6 --config '{"num_workers": 5}'
```

## Multi-node training example using GPU head and CPU worker instances

Same as above, but set

```
docker:
head_image: "rayproject/ray-ml:latest-gpu"
worker_image: "rayproject/ray-ml:latest-cpu"
```

and use something like this command to train:

```
rllib train --env CartPole-v0 --run PPO --ray-num-gpu 1 --ray-num-cpu 6 --config '{"num_workers": 5}'
```

Here we are using 1 head GPU instance and 2 worker CPU instances, with a total of 1 GPU and 6 CPUs.


## Other notes

* Out of two attempts to create ray clusters, one had a failed worker node (1
out of 2 worker nodes failed, unsure why).

* The cluster is quite slow to launch, 15+ minutes with only 2 small worker nodes.
This is not just the docker pull / launch steps, but setting up the Ray cluster
itself. Could be due to using low-performance resources?

178 changes: 178 additions & 0 deletions devtools/aws/example-full.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# This configuration file has been

# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
image: "rayproject/ray-ml:latest-cpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
# image: rayproject/ray:latest-gpu # use this one if you don't need ML dependencies, it's faster to pull
container_name: "ray_container"
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
# if no cached version is present.
pull_before_run: True
run_options: # Extra options to pass into "docker run"
- --ulimit nofile=65536:65536

# Example of running a GPU head with CPU workers
# head_image: "rayproject/ray-ml:latest-gpu"
# Allow Ray to automatically detect GPUs

# worker_image: "rayproject/ray-ml:latest-cpu"
# worker_run_options: []

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-2
# Availability zone(s), csomma-separated, that nodes may be launched in.
# Nodes are currently spread between zones by a round-robin approach,
# however this implementation detail should not be relied upon.
availability_zone: us-west-2a,us-west-2b
# Whether to allow node reuse. If set to False, nodes will be terminated
# instead of stopped.
cache_stopped_nodes: True # If not present, the default is True.
use_internal_ips: True

# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
# ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray.head.default:
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
# You can also set custom resources.
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
resources: {}
# Provider-specific config for this node type, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
node_config:
InstanceType: m5.large
ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
# You can provision additional disk space with a conf as follows
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 100
# Additional options in the boto docs.
ray.worker.default:
# The minimum number of worker nodes of this type to launch.
# This number should be >= 0.
min_workers: 2
# The maximum number of worker nodes of this type to launch.
# This takes precedence over min_workers.
max_workers: 2
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
# You can also set custom resources.
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
resources: {}
# Provider-specific config for this node type, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
node_config:
InstanceType: m5.large
ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
# Run workers on spot by default. Comment this out to use on-demand.
# NOTE: If relying on spot instances, it is best to specify multiple different instance
# types to avoid interruption when one instance type is experiencing heightened demand.
# Demand information can be found at https://aws.amazon.com/ec2/spot/instance-advisor/
InstanceMarketOptions:
MarketType: spot
# Additional options can be found in the boto docs, e.g.
# SpotOptions:
# MaxPrice: MAX_HOURLY_PRICE
# Additional options in the boto docs.

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
- ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands:
- pip install ray[rllib]
# Note: if you're developing Ray, you probably want to create a Docker image that
# has your Ray repo pre-cloned. Then, you can replace the pip installs
# below with a git checkout <your_sha> (and possibly a recompile).
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
# - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}
2 changes: 1 addition & 1 deletion rlmolecule/alphazero/tensorflow/tf_keras_policy.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ def align_input_names(keras_inputs, mask_dict):

class TimeCsvLogger(tf.keras.callbacks.CSVLogger):
def on_epoch_end(self, epoch, logs=None):
logs = logs or {}
logs = logs.copy() or {}
logs['time'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')
super(TimeCsvLogger, self).on_epoch_end(epoch, logs)

Expand Down
2 changes: 1 addition & 1 deletion tests/alphazero/test_molecule_alphazero.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ def test_train_policy_model(self, game):

weights_before = problem.batched_policy_model.get_weights()[1]

history = problem.train_policy_model(steps_per_epoch=10, epochs=1)
history = problem.train_policy_model(steps_per_epoch=10, epochs=1, verbose=2)
assert np.isfinite(history.history['loss'][0])
assert 'policy.01.index' in os.listdir(problem.policy_checkpoint_dir)

Expand Down

0 comments on commit 29820cd

Please sign in to comment.