Merge branch 'devel' of github.com:NREL/rlmolecule into devel

NREL · Jan 8, 2022 · 29820cd · 29820cd
2 parents 8e4d932 + f506db9
commit 29820cd
Show file tree

Hide file tree

Showing 4 changed files with 294 additions and 2 deletions.
diff --git a/devtools/aws/README.md b/devtools/aws/README.md
@@ -0,0 +1,114 @@
+Official documentation here:  https://docs.ray.io/en/latest/cluster/cloud.html
+
+The example that currently works uses a cluster launched by the local `example-full.yaml`.
+It uses 1 head and 2 worker CPU m5.large instances running the 
+`rayproject/ray-ml:latest-cpu` image.  It also installs rllib as part of the 
+`setup_commands`.
+
+## Default cluster configuration fails in multiple ways
+
+The default yaml file, which uses a GPU Docker image and compute instance, 
+fails in multiple ways "as is".  
+
+1.  Subnet error:
+
+    ```
+    No usable subnets found, try manually creating an instance in your specified 
+    region to populate the list of subnets and trying this again. Note that the 
+    subnet must map public IPs on instance launch unless you set `use_internal_ips: true` 
+    in the `provider` config.
+    ```
+    __Fix:__ Set `use_internal_ips: True` in the `provider` configuration
+
+2. Credentials error:
+
+    ```
+    botocore.exceptions.NoCredentialsError: Unable to locate credentials
+    ```
+    __Fix:__ Set the profile name to `default` in  `~/.aws/credentials`.
+
+3. Disk space error:
+
+    ```
+    latest-gpu: Pulling from rayproject/ray-ml
+    e4ca327ec0e7: Pull complete 
+    ...
+    850f7f4138ca: Extracting  3.752GB/3.752GB
+    3b7026c2a927: Download complete 
+    failed to register layer: Error processing tar file(exit status 1): write /usr/lib/x86_64-linux-gnu/dri/iris_dri.so: no space left on device
+    Shared connection to 172.18.106.160 closed.
+    New status: update-failed
+    !!!
+    SSH command failed.
+    !!!
+    
+    Failed to setup head node.
+    ```
+    __Fix:__ Increasing storage from 100 GiB to 500 GiB seems to have done the trick,
+
+    ```
+    available_node_types:
+        ray.head.default:
+            node_config:
+                BlockDeviceMappings:
+                  - DeviceName: /dev/sda1
+                    Ebs:
+                         VolumeSize: 500
+    ```
+
+## Multi-node training example using only CPU instances
+
+Here are the steps that demonstrated multi-node training of a PPO policy on 
+the CartPole-v0 environment.  The Docker image and EC2 instances are cpu-only.
+
+1. Start the cluster using the local version of the yaml file.
+
+    ```
+    ray up example-full.yaml
+    ```
+
+3. SSH into head node and check ray status.
+
+    ```
+    ray attach example-full.yaml  # on local machine
+    ray status                    # on remote machine
+    ```
+
+    Be patient -- the worker nodes take long time to start up and connect to the head!
+
+4. Run a simple training example, ensuring that more than a single node is used. 
+With 1 head and 2 worker m5.large nodes, this command runs rllib using all 
+available CPUs (there are 6 total).
+
+    ```
+    rllib train --run PPO --env CartPole-v0 --ray-num-cpus 6 --config '{"num_workers": 5}' 
+    ```
+
+## Multi-node training example using GPU head and CPU worker instances
+
+Same as above, but set
+
+```
+docker:
+    head_image: "rayproject/ray-ml:latest-gpu"
+    worker_image: "rayproject/ray-ml:latest-cpu"
+```
+
+and use something like this command to train:
+
+```
+rllib train --env CartPole-v0 --run PPO --ray-num-gpu 1 --ray-num-cpu 6 --config '{"num_workers": 5}'
+```
+
+Here we are using 1 head GPU instance and 2 worker CPU instances, with a total of 1 GPU and 6 CPUs.
+
+
+## Other notes
+
+* Out of two attempts to create ray clusters, one had a failed worker node (1 
+out of 2 worker nodes failed, unsure why).
+
+* The cluster is quite slow to launch, 15+ minutes with only 2 small worker nodes. 
+This is not just the docker pull / launch steps, but setting up the Ray cluster
+itself.  Could be due to using low-performance resources?
+
diff --git a/devtools/aws/example-full.yaml b/devtools/aws/example-full.yaml
@@ -0,0 +1,178 @@
+# This configuration file has been
+
+# An unique identifier for the head node and workers of this cluster.
+cluster_name: default
+
+# The maximum number of workers nodes to launch in addition to the head
+# node.
+max_workers: 2
+
+# The autoscaler will scale up the cluster faster with higher upscaling speed.
+# E.g., if the task requires adding more nodes then autoscaler will gradually
+# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
+# This number should be > 0.
+upscaling_speed: 1.0
+
+# This executes all commands on all nodes in the docker container,
+# and opens all the necessary ports to support the Ray cluster.
+# Empty string means disabled.
+docker:
+    image: "rayproject/ray-ml:latest-cpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
+    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
+    container_name: "ray_container"
+    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
+    # if no cached version is present.
+    pull_before_run: True
+    run_options:   # Extra options to pass into "docker run"
+        - --ulimit nofile=65536:65536
+
+    # Example of running a GPU head with CPU workers
+    # head_image: "rayproject/ray-ml:latest-gpu"
+    # Allow Ray to automatically detect GPUs
+
+    # worker_image: "rayproject/ray-ml:latest-cpu"
+    # worker_run_options: []
+
+# If a node is idle for this many minutes, it will be removed.
+idle_timeout_minutes: 5
+
+# Cloud-provider specific configuration.
+provider:
+    type: aws
+    region: us-west-2
+    # Availability zone(s), csomma-separated, that nodes may be launched in.
+    # Nodes are currently spread between zones by a round-robin approach,
+    # however this implementation detail should not be relied upon.
+    availability_zone: us-west-2a,us-west-2b
+    # Whether to allow node reuse. If set to False, nodes will be terminated
+    # instead of stopped.
+    cache_stopped_nodes: True # If not present, the default is True.
+    use_internal_ips: True
+
+# How Ray will authenticate with newly launched nodes.
+auth:
+    ssh_user: ubuntu
+# By default Ray creates a new private keypair, but you can also use your own.
+# If you do so, make sure to also set "KeyName" in the head and worker node
+# configurations below.
+#    ssh_private_key: /path/to/your/key.pem
+
+# Tell the autoscaler the allowed node types and the resources they provide.
+# The key is the name of the node type, which is just for debugging purposes.
+# The node config specifies the launch config and physical instance type.
+available_node_types:
+    ray.head.default:
+        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
+        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
+        # You can also set custom resources.
+        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
+        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
+        resources: {}
+        # Provider-specific config for this node type, e.g. instance type. By default
+        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
+        # For more documentation on available fields, see:
+        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
+        node_config:
+            InstanceType: m5.large
+            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
+            # You can provision additional disk space with a conf as follows
+            BlockDeviceMappings:
+                - DeviceName: /dev/sda1
+                  Ebs:
+                      VolumeSize: 100
+            # Additional options in the boto docs.
+    ray.worker.default:
+        # The minimum number of worker nodes of this type to launch.
+        # This number should be >= 0.
+        min_workers: 2
+        # The maximum number of worker nodes of this type to launch.
+        # This takes precedence over min_workers.
+        max_workers: 2
+        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
+        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
+        # You can also set custom resources.
+        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
+        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
+        resources: {}
+        # Provider-specific config for this node type, e.g. instance type. By default
+        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
+        # For more documentation on available fields, see:
+        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
+        node_config:
+            InstanceType: m5.large
+            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
+            # Run workers on spot by default. Comment this out to use on-demand.
+            # NOTE: If relying on spot instances, it is best to specify multiple different instance
+            # types to avoid interruption when one instance type is experiencing heightened demand.
+            # Demand information can be found at https://aws.amazon.com/ec2/spot/instance-advisor/
+            InstanceMarketOptions:
+                MarketType: spot
+                # Additional options can be found in the boto docs, e.g.
+                #   SpotOptions:
+                #       MaxPrice: MAX_HOURLY_PRICE
+            # Additional options in the boto docs.
+
+# Specify the node type of the head node (as configured above).
+head_node_type: ray.head.default
+
+# Files or directories to copy to the head and worker nodes. The format is a
+# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
+file_mounts: {
+#    "/path1/on/remote/machine": "/path1/on/local/machine",
+#    "/path2/on/remote/machine": "/path2/on/local/machine",
+}
+
+# Files or directories to copy from the head node to the worker nodes. The format is a
+# list of paths. The same path on the head node will be copied to the worker node.
+# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
+# you should just use file_mounts. Only use this if you know what you're doing!
+cluster_synced_files: []
+
+# Whether changes to directories in file_mounts or cluster_synced_files in the head node
+# should sync to the worker node continuously
+file_mounts_sync_continuously: False
+
+# Patterns for files to exclude when running rsync up or rsync down
+rsync_exclude:
+    - "**/.git"
+    - "**/.git/**"
+
+# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
+# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
+# as a value, the behavior will match git's behavior for finding and using .gitignore files.
+rsync_filter:
+    - ".gitignore"
+
+# List of commands that will be run before `setup_commands`. If docker is
+# enabled, these commands will run outside the container and before docker
+# is setup.
+initialization_commands: []
+
+# List of shell commands to run to set up nodes.
+setup_commands:
+    - pip install ray[rllib]
+    # Note: if you're developing Ray, you probably want to create a Docker image that
+    # has your Ray repo pre-cloned. Then, you can replace the pip installs
+    # below with a git checkout <your_sha> (and possibly a recompile).
+    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
+    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
+    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
+
+# Custom commands that will be run on the head node after common setup.
+head_setup_commands: []
+
+# Custom commands that will be run on worker nodes after common setup.
+worker_setup_commands: []
+
+# Command to start ray on the head node. You don't need to change this.
+head_start_ray_commands:
+    - ray stop
+    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
+
+# Command to start ray on worker nodes. You don't need to change this.
+worker_start_ray_commands:
+    - ray stop
+    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
+
+head_node: {}
+worker_nodes: {}
diff --git a/rlmolecule/alphazero/tensorflow/tf_keras_policy.py b/rlmolecule/alphazero/tensorflow/tf_keras_policy.py
@@ -122,7 +122,7 @@ def align_input_names(keras_inputs, mask_dict):
 
 class TimeCsvLogger(tf.keras.callbacks.CSVLogger):
     def on_epoch_end(self, epoch, logs=None):
-        logs = logs or {}
+        logs = logs.copy() or {}
         logs['time'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')
         super(TimeCsvLogger, self).on_epoch_end(epoch, logs)
 

diff --git a/tests/alphazero/test_molecule_alphazero.py b/tests/alphazero/test_molecule_alphazero.py
@@ -139,7 +139,7 @@ def test_train_policy_model(self, game):
 
         weights_before = problem.batched_policy_model.get_weights()[1]
 
-        history = problem.train_policy_model(steps_per_epoch=10, epochs=1)
+        history = problem.train_policy_model(steps_per_epoch=10, epochs=1, verbose=2)
         assert np.isfinite(history.history['loss'][0])
         assert 'policy.01.index' in os.listdir(problem.policy_checkpoint_dir)