Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added GPU support for Google Cloud #2

Merged
merged 24 commits into from
Jul 12, 2024
Merged

Conversation

exolyr
Copy link
Collaborator

@exolyr exolyr commented Mar 29, 2024

No description provided.

exolyr and others added 19 commits February 14, 2024 18:38
- check_machine_type_availability() verifies the machine is available in the zone
- check_gpu_model_support() verifies that tha machine and gpu model are compatible
- check_gpu_enabled() verifies that GPU_ENABLED=true if GPU_MODEL is populated
- Created src/infractl/deploy/gcp/main.py
- Moved check_* functions to  src/infractl/deploy/gcp/main.py
- Made icl/jupyterhub module support 'intel' or 'nvidia'
- Check/validate fucntions from gke.sh moved to src/infractl/deploy/gcp/main.py
- Built new mutli-stage image for GPU profile image which uses nvidia/cuda:12.2.2-base-ubuntu22.04 as the base and adds pbchekin/icl-jupyterhub:0.0.21 changes
- Changed GKE_GPU_DRIVER_VERSION environment variable default to "LATEST"
- Added outputs to terraform modules for visibility and ease of future debugging
- Modified terraform/icl module to dynamically set selected GPU image with jupyterhub_gpu_profile_image
- Added shared_gpu variable added to terraform/gcp and terraform/gcp/icl-cluster
- Created new conditional module in terraform/gcp/icl-cluster dependent on shared_gpu variable value
- Modified pool names to reflect exclusive vs shared GPU modes
- Added node_count and gpu_count variables to easily allow future addition of multi-node deployments
- Changed var.jupyterhub_extra_resource_limits from map(string) to string
- Removed default value for var.jupyterhub_extra_resource_limits
- Added default value for jupyterhub_extra_resource_limits
- Fixed subprocess.run calls using shell
- Removed unused $GPU_ENABLED parameter from gke.sh call to infractl.deploy.gcp.main
- isort, black, and pyline changes
…to enhance flexibility in specifying the type of GPU
- Trailing lines added
- Uneeded whitepspace trimmed
- subprocess import in main.py changed and reordered
- Unintentional ray downgrade reverted
- Duplicate variable declaration removed from gke.sh
- Added GKE_GPU_DRIVER_VERSION description to gke.sh help output
- Removed print lines from subprocess.CalledProcessError exceptions
Co-authored-by: Pavel Chekin <[email protected]>
@pbchekin pbchekin requested a review from kwasd March 29, 2024 16:45
@pbchekin
Copy link
Collaborator

@kwasd please take a look before merging.

@kwasd kwasd mentioned this pull request Apr 10, 2024
@kwasd kwasd mentioned this pull request May 14, 2024
- Added firewall-rule-bastion-ports module
- Added additional variables to /terraform/gcp/variables.tf for bastion-host
- Added generate_bastion_key function to create public SSH key when CREATE_BASTION="true"
- Added two new environment variables related to bastion creation
- Added function check BASTION_SOURCE_RANGES exists and -neq "" if CREATE_BASTION=true
…nce installtion methods will vary across environments.
@@ -91,6 +91,7 @@ function x1_terraform_args() {
-var jupyterhub_extra_resource_limits="${JUPYTERHUB_EXTRA_RESOURCE_LIMITS}"
-var gpu_enabled="${GPU_ENABLED}"
-var gpu_type="${GPU_TYPE}"
-var deployment_type="aws"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be better to introduce something like nvidia_gpu_operator_enabled instead, if we need provider-specific behavior. This could be passed from each provider's top-level script. So it will keep modules provider-agnostic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this variable has been changed to enable_nvidia_operator. Let me know if AWS deployments with and without GPU work with the new change.

…nvidia_operator"

- Changed some ENV default values from string to boolean to better align with what TF is expecting
- Fixed bastion_name variable
- Updated terraform/icl/main.tf to use updated "enable_nvidia_operator" variable
@kwasd
Copy link
Collaborator

kwasd commented Jul 9, 2024

I've checked the AWS part, LGTM!

@exolyr exolyr merged commit a9ce9e9 into main Jul 12, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants