-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added GPU support for Google Cloud #2
Conversation
- check_machine_type_availability() verifies the machine is available in the zone - check_gpu_model_support() verifies that tha machine and gpu model are compatible - check_gpu_enabled() verifies that GPU_ENABLED=true if GPU_MODEL is populated
- Created src/infractl/deploy/gcp/main.py - Moved check_* functions to src/infractl/deploy/gcp/main.py - Made icl/jupyterhub module support 'intel' or 'nvidia'
- Check/validate fucntions from gke.sh moved to src/infractl/deploy/gcp/main.py - Built new mutli-stage image for GPU profile image which uses nvidia/cuda:12.2.2-base-ubuntu22.04 as the base and adds pbchekin/icl-jupyterhub:0.0.21 changes - Changed GKE_GPU_DRIVER_VERSION environment variable default to "LATEST" - Added outputs to terraform modules for visibility and ease of future debugging - Modified terraform/icl module to dynamically set selected GPU image with jupyterhub_gpu_profile_image
- Added shared_gpu variable added to terraform/gcp and terraform/gcp/icl-cluster - Created new conditional module in terraform/gcp/icl-cluster dependent on shared_gpu variable value - Modified pool names to reflect exclusive vs shared GPU modes - Added node_count and gpu_count variables to easily allow future addition of multi-node deployments - Changed var.jupyterhub_extra_resource_limits from map(string) to string - Removed default value for var.jupyterhub_extra_resource_limits
- Added default value for jupyterhub_extra_resource_limits - Fixed subprocess.run calls using shell - Removed unused $GPU_ENABLED parameter from gke.sh call to infractl.deploy.gcp.main - isort, black, and pyline changes
…to enhance flexibility in specifying the type of GPU
- Trailing lines added - Uneeded whitepspace trimmed - subprocess import in main.py changed and reordered - Unintentional ray downgrade reverted - Duplicate variable declaration removed from gke.sh - Added GKE_GPU_DRIVER_VERSION description to gke.sh help output - Removed print lines from subprocess.CalledProcessError exceptions
Co-authored-by: Pavel Chekin <[email protected]>
Co-authored-by: Pavel Chekin <[email protected]>
Co-authored-by: Vadim Musin <[email protected]>
@kwasd please take a look before merging. |
- Added firewall-rule-bastion-ports module - Added additional variables to /terraform/gcp/variables.tf for bastion-host - Added generate_bastion_key function to create public SSH key when CREATE_BASTION="true" - Added two new environment variables related to bastion creation - Added function check BASTION_SOURCE_RANGES exists and -neq "" if CREATE_BASTION=true
…nce installtion methods will vary across environments.
scripts/deploy/aws.sh
Outdated
@@ -91,6 +91,7 @@ function x1_terraform_args() { | |||
-var jupyterhub_extra_resource_limits="${JUPYTERHUB_EXTRA_RESOURCE_LIMITS}" | |||
-var gpu_enabled="${GPU_ENABLED}" | |||
-var gpu_type="${GPU_TYPE}" | |||
-var deployment_type="aws" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be better to introduce something like nvidia_gpu_operator_enabled
instead, if we need provider-specific behavior. This could be passed from each provider's top-level script. So it will keep modules provider-agnostic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this variable has been changed to enable_nvidia_operator. Let me know if AWS deployments with and without GPU work with the new change.
…nvidia_operator" - Changed some ENV default values from string to boolean to better align with what TF is expecting - Fixed bastion_name variable - Updated terraform/icl/main.tf to use updated "enable_nvidia_operator" variable
I've checked the AWS part, LGTM! |
No description provided.