Added GPU support for Google Cloud #2

exolyr · 2024-03-29T16:43:08Z

No description provided.

- check_machine_type_availability() verifies the machine is available in the zone - check_gpu_model_support() verifies that tha machine and gpu model are compatible - check_gpu_enabled() verifies that GPU_ENABLED=true if GPU_MODEL is populated

- Created src/infractl/deploy/gcp/main.py - Moved check_* functions to src/infractl/deploy/gcp/main.py - Made icl/jupyterhub module support 'intel' or 'nvidia'

- Check/validate fucntions from gke.sh moved to src/infractl/deploy/gcp/main.py - Built new mutli-stage image for GPU profile image which uses nvidia/cuda:12.2.2-base-ubuntu22.04 as the base and adds pbchekin/icl-jupyterhub:0.0.21 changes - Changed GKE_GPU_DRIVER_VERSION environment variable default to "LATEST" - Added outputs to terraform modules for visibility and ease of future debugging - Modified terraform/icl module to dynamically set selected GPU image with jupyterhub_gpu_profile_image

- Added shared_gpu variable added to terraform/gcp and terraform/gcp/icl-cluster - Created new conditional module in terraform/gcp/icl-cluster dependent on shared_gpu variable value - Modified pool names to reflect exclusive vs shared GPU modes - Added node_count and gpu_count variables to easily allow future addition of multi-node deployments - Changed var.jupyterhub_extra_resource_limits from map(string) to string - Removed default value for var.jupyterhub_extra_resource_limits

- Added default value for jupyterhub_extra_resource_limits - Fixed subprocess.run calls using shell - Removed unused $GPU_ENABLED parameter from gke.sh call to infractl.deploy.gcp.main - isort, black, and pyline changes

…to enhance flexibility in specifying the type of GPU

- Trailing lines added - Uneeded whitepspace trimmed - subprocess import in main.py changed and reordered - Unintentional ray downgrade reverted - Duplicate variable declaration removed from gke.sh - Added GKE_GPU_DRIVER_VERSION description to gke.sh help output - Removed print lines from subprocess.CalledProcessError exceptions

Co-authored-by: Pavel Chekin <[email protected]>

…default

Co-authored-by: Vadim Musin <[email protected]>

pbchekin · 2024-03-29T16:46:12Z

@kwasd please take a look before merging.

- Added firewall-rule-bastion-ports module - Added additional variables to /terraform/gcp/variables.tf for bastion-host - Added generate_bastion_key function to create public SSH key when CREATE_BASTION="true" - Added two new environment variables related to bastion creation - Added function check BASTION_SOURCE_RANGES exists and -neq "" if CREATE_BASTION=true

…e names and tags

…nce installtion methods will vary across environments.

kwasd · 2024-06-12T17:01:16Z

scripts/deploy/aws.sh

@@ -91,6 +91,7 @@ function x1_terraform_args() {
    -var jupyterhub_extra_resource_limits="${JUPYTERHUB_EXTRA_RESOURCE_LIMITS}"
    -var gpu_enabled="${GPU_ENABLED}"
    -var gpu_type="${GPU_TYPE}"
+    -var deployment_type="aws"


I think it may be better to introduce something like nvidia_gpu_operator_enabled instead, if we need provider-specific behavior. This could be passed from each provider's top-level script. So it will keep modules provider-agnostic.

Agreed, this variable has been changed to enable_nvidia_operator. Let me know if AWS deployments with and without GPU work with the new change.

…nvidia_operator" - Changed some ENV default values from string to boolean to better align with what TF is expecting - Fixed bastion_name variable - Updated terraform/icl/main.tf to use updated "enable_nvidia_operator" variable

kwasd · 2024-07-09T15:55:44Z

I've checked the AWS part, LGTM!

exolyr and others added 19 commits February 14, 2024 18:38

Merge feature/gke-gpu with only Michael's changes

77b44eb

Merge feature/gke-gpu with only Michael's changes

2ae143e

Merge feature/gke-gpu with only Michael's changes

5b0286b

Merge feature/gke-gpu with only Michael's changes

24f44e0

Merge feature/gke-gpu with only Michael's changes

930c1e7

Updated gcp icl-cluster module

d74306f

Additional functions added

f778776

- check_machine_type_availability() verifies the machine is available in the zone - check_gpu_model_support() verifies that tha machine and gpu model are compatible - check_gpu_enabled() verifies that GPU_ENABLED=true if GPU_MODEL is populated

Moved some gke.sh functions to Python

7aad5b2

- Created src/infractl/deploy/gcp/main.py - Moved check_* functions to src/infractl/deploy/gcp/main.py - Made icl/jupyterhub module support 'intel' or 'nvidia'

Bug Fixes and Linting

70d6ab2

- Added default value for jupyterhub_extra_resource_limits - Fixed subprocess.run calls using shell - Removed unused $GPU_ENABLED parameter from gke.sh call to infractl.deploy.gcp.main - isort, black, and pyline changes

Merge branch 'main' into feature/gke-gpu-only-velasco

70dfbf0

Updated the conditional logic in the xpumanager module instantiation …

ffc9d31

…to enhance flexibility in specifying the type of GPU

Variable reversions for compatibility. Will change in separate PR.

8c5450f

Update scripts/deploy/gke.sh

76847c0

Co-authored-by: Pavel Chekin <[email protected]>

Update terraform/gcp/modules/icl-cluster/main.tf

553f533

Co-authored-by: Pavel Chekin <[email protected]>

Conditional execution of check_gpu_suport() and reverting a variable …

9346369

…default

Apply suggestions from code review

98c3f9f

Co-authored-by: Vadim Musin <[email protected]>

exolyr requested review from aregm and pbchekin as code owners March 29, 2024 16:43

pbchekin requested a review from kwasd March 29, 2024 16:45

kwasd mentioned this pull request Apr 10, 2024

AWS GPU #3

Closed

kwasd mentioned this pull request May 14, 2024

AWS GPU (split) #11

Merged

exolyr added 4 commits June 11, 2024 14:35

Merge branch 'main' into feature/gke-gpu-only-velasco

6c0e8a5

-Additional variables added for user/bastion/cluster specific resourc…

afc6b37

…e names and tags

Added deplyoment_type variable to control execution of GPU modules si…

4bd7883

…nce installtion methods will vary across environments.

kwasd reviewed Jun 12, 2024

View reviewed changes

kwasd approved these changes Jul 9, 2024

View reviewed changes

exolyr merged commit a9ce9e9 into main Jul 12, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added GPU support for Google Cloud #2

Added GPU support for Google Cloud #2

exolyr commented Mar 29, 2024

pbchekin commented Mar 29, 2024

kwasd Jun 12, 2024

exolyr Jul 8, 2024

kwasd commented Jul 9, 2024

Added GPU support for Google Cloud #2

Added GPU support for Google Cloud #2

Conversation

exolyr commented Mar 29, 2024

pbchekin commented Mar 29, 2024

kwasd Jun 12, 2024

Choose a reason for hiding this comment

exolyr Jul 8, 2024

Choose a reason for hiding this comment

kwasd commented Jul 9, 2024