Oracle Cloud Infrastructure Data Science Service supports distributed training with Jobs for the frameworks: Dask, Horovod, TensorFlow Distributed and PyTorch Distributed.
Here are the key pre-requisites that you would need to execute before you can proceed to run a distributed training on Oracle Cloud Infrastructure Data Science Service.
Configure your network
- required for the inter node communication (P.S we are working to provide managed networking for your distributed cluster communication, coming soon)Create object storage bucket
- to be used for storing checkpoints, logs and other artifacts during the training processSet the policies
- required for accessing OCI services by the distributed jobConfigure your Auth Token
- to use the OCI SDK on your local machine, to create, run and monitor the jobsInstall a desktop container management
- our cli requires dekstop container management tool to build, run and push your container imagesInstall ads[opctl]
- required for packaging training script and launching OCI Data Science distributed jobsCreate a container registry repository
- to store the container images that will be used during the distributed training
OCI
= Oracle Cloud InfrastructureDT
= Distributed TrainingADS
= Oracle Accelerated Data Science LibraryOCIR
= Oracle Cloud Infrastructure Container Registry
You need to use a private subnet
for distributed training and config the ports in the VCN Security List
to allow traffic for communication between nodes. The following default ports are used by the corresponding frameworks:
Note we are working to provide managed networking for your distributed cluster communication, which will remove this configuration step in the future coming soon.
If you're working a Proof-of-Concept you can open all the Ingress/Egress TCP ports in the subnet!
-
Dask:
- Scheduler Port: 8786. More information here
- Dashboard Port: 8787. More information here
- Worker Ports: Default is Random. It is good to open a specific range of port and then provide the value in the startup option. More information here
- Nanny Process Ports: Default is Random. It is good to open a specific range of port and then provide the value in the startup option. More information here
-
PyTorch: By default, PyTorch uses 29400.
-
Horovod: You need to allow all traffic within the subnet.
-
Tensorflow: Worker Port: Allow traffic from all source ports to one worker port (default: 12345). If changed, provide this in train.yaml config.
See also: Security Lists
Create an object storage bucket at your Oracle Cloud Infrastructure to be used for the distibuted training.
Distributed training uses OCI Object Storage to store artifacts, outputs, checkpoints etc. The bucket should be created before starting any distributed training. The manage objects
policy provided later in the guide is needed for users and job runs to read/write files in the bucket you will create and it is required for job runs to synchronize generated artifacts.
To use Distributed Training on OCI Data Science Service, your accounts and services require access to multiple resources, which will be specified by the OCI Policies shown below.
If you're just trying out Oracle Cloud Infrastructure Data Science Distributed Training in a proof-of-concept project, you may not need more than a few administrators with full access to everything. In that case, you can simply create any new users you need and add them to the Administrators group. The users will be able to do anything with any kind of resource, and you can create all your resources directly in the tenancy (the root compartment). You don't need to create any compartments yet, or any other policies beyond the Tenant Admin Policy, which automatically comes with your tenancy and can't be changed. Additionally you have to create following resources:
-
Create a Dynamic Group in your cloud tenancy with the following matching rules:
all { resource.type = 'datasciencenotebooksession' } all { resource.type = 'datasciencejobrun' } all { resource.type = 'datasciencemodeldeployment' } all { resource.type = 'datasciencepipelinerun' }
-
Create a policy in your root compartment with the following statements:
allow service datascience to use virtual-network-family in tenancy allow dynamic-group <your-dynamic-group-name> to manage data-science-family in tenancy allow dynamic-group <your-dynamic-group-name> to manage all-resources in tenancy
Replace
<your-dynamic-group-name>
with the name of your dynamic group! -
Create new user(s) you need and add them to your Administrators Group
If you're past the proof-of-concept phase and want to restrict access to your resources, first:
- Make sure you're familiar with the basic IAM components, and read through the example scenario: Overview of Identity and Access Management
- Think about how to organize your resources into compartments: Learn Best - Practices for Setting Up Your Tenancy
- Learn the basics of how policies work: How Policies Work
- Check the OCI Data Science Policies Guidance
At the high level the process is again as following:
-
Create a Dynamic Group in your cloud tenancy with the following matching rules:
all { resource.type = 'datasciencenotebooksession' } all { resource.type = 'datasciencejobrun' } all { resource.type = 'datasciencemodeldeployment' } all { resource.type = 'datasciencepipelinerun' }
-
Create a User Group in your cloud tenancy. Only the users beloging to this group would have access to the service, as per the policies we will write in the next step.
-
Create the policies in the compartment where you intend to use the OCI Data Science Service.
Allow service datascience to manage virtual-network-family in compartment <your_compartment_name> Allow dynamic-group <your-dynamic-group-name> to read repos in compartment <your_compartment_name> Allow dynamic-group <your-dynamic-group-name> to manage data-science-family in compartment <your_compartment_name> Allow dynamic-group <your-dynamic-group-name> to manage log-groups in compartment <your_compartment_name> Allow dynamic-group <your-dynamic-group-name> to manage log-content in compartment <your_compartment_name> Allow dynamic-group <your-dynamic-group-name> to manage objects in compartment <your_compartment_name> Allow group <your-data-science-users-group> to manage repos in compartment <your_compartment_name> Allow group <your-data-science-users-group> to manage data-science-family in compartment <your_compartment_name> Allow group <your-data-science-users-group> to use virtual-network-family in compartment <your_compartment_name> Allow group <your-data-science-users-group> to manage log-groups in compartment <your_compartment_name> Allow group <your-data-science-users-group> to manage log-content in compartment <your_compartment_name> Allow group <your-data-science-users-group> to read metrics in compartment <your_compartment_name> Allow group <your-data-science-users-group> to manage objects in compartment <your_compartment_name>
Replace
<your-dynamic-group-name>
,<your-data-science-users-group>
with your user group name and<your_compartment_name>
with the name of the compartment where your distributed training should run. -
Add the users required to have access to the Data Science service to the group you created in step (2) in your tenancy
You could restrict the permission to specific container repository, for example:
Allow <group|dynamic-group> <group-name> to read repos in compartment <your_compartment_name> where all { target.repo.name=<your_repo_name> }
See also: Policies to Control Repository Access
Using following policy you could restrict acccess to a specific bucket only, for example:
Allow <group|dynamic-group> <group-name> to manage buckets in compartment <your_compartment_name> where all {target.bucket.name=<your_bucket_name>}
See also Object Storage Policies
Configure your API Auth Token to be able to run and test your code locally and monitor the logs.
The OCI Auth Token is used for the OCI CLI and Python SDK. Follow the guidance from the online documentation to configure: https://docs.oracle.com/en-us/iaas/Content/API/Concepts/apisigningkey.htm
At the high level the instructions are:
- (1) login into your Oracle Cloud
- (2) select your account from the top right dropdown menu
- (3) generate a new API Auth Key
- (4) download the private key and store it into your
$HOME/.oci
folder - (5) copy the suggested configuration and store it into your home directy
$HOME/.oci/config
file - (6) change the
$HOME/.oci/config
with the suggested configuration in (5) and point to your private key - (7) test the SDK or CLI
ADS OPCTL CLI requires a container desktop tool to build, run, launch and push the containers. We support:
Install the ads[opctl]
CLI which is required to package (containerize) your distributed training script and launch OCI Data Science Distributed Training Jobs.
- For Linux and Windows Subsystem for Linux
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Miniconda3-latest-Linux-x86_64.sh
- MacOS Intel
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o Miniconda3-latest-MacOSX-x86_64.sh
- MacOS Apple Silicon
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh -o Miniconda3-latest-MacOSX-arm64.sh
- Run the installer
Depending on your host system, get the one that much:
bash Miniconda3-latest-<Linux|MacOSX>-<x86_64|arm64>.sh
You may need to restart your terminal or run source ~/.bashrc
or ~/.zshrc
to enable the conda command. Use conda -V
to test if it is installed successfully.
- Create new conda
conda create -n distributed-training python=3.8
- Activate it
conda activate distributed-training
- Install the
oracle-ads[opctl] >= 2.8.0
in the activated conda.
python3 -m pip install "oracle-ads[opctl]"
- Test the CLI is running
ads opctl -h
OCI Data Science Distributed Training uses OCI Container Registry to store the container image.
You may need to docker login
to the Oracle Cloud Container Registry (OCIR) from your local machine, if you haven't done so before, to been able to push your images. To login you have to use your API Auth Token that can be created under your Oracle Cloud Account->Auth Token
. You need to login only once.
docker login -u '<tenant-namespace>/<username>' <region>.ocir.io
... where <tenancy-namespace>
is the auto-generated Object Storage namespace string of your tenancy (as shown on the Tenancy Information page).
If your tenancy is federated with Oracle Identity Cloud Service, use the format <tenancy-namespace>/oracleidentitycloudservice/<username>
Create a repository in your Container Registry in your Oracle Cloud, preferably in the same compartment where you will run the distributed training. For more information follow the Creating a Repository Guide.
Take a note of the
name
of the repository as well as thenamespace
, those will be used later.
You are now all set to create, test and launch your distributed training workload. Refer to the specific framework guides to continue.