Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added GPU support for Google Cloud #2

Merged
merged 24 commits into from
Jul 12, 2024
Merged

Commits on Feb 14, 2024

  1. Configuration menu
    Copy the full SHA
    77b44eb View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2ae143e View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    5b0286b View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    24f44e0 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    930c1e7 View commit details
    Browse the repository at this point in the history
  6. Updated gcp icl-cluster module

    exolyr committed Feb 14, 2024
    Configuration menu
    Copy the full SHA
    d74306f View commit details
    Browse the repository at this point in the history

Commits on Mar 6, 2024

  1. Additional functions added

    - check_machine_type_availability() verifies the machine is available in the zone
    - check_gpu_model_support() verifies that tha machine and gpu model are compatible
    - check_gpu_enabled() verifies that GPU_ENABLED=true if GPU_MODEL is populated
    exolyr committed Mar 6, 2024
    Configuration menu
    Copy the full SHA
    f778776 View commit details
    Browse the repository at this point in the history

Commits on Mar 8, 2024

  1. Moved some gke.sh functions to Python

    - Created src/infractl/deploy/gcp/main.py
    - Moved check_* functions to  src/infractl/deploy/gcp/main.py
    - Made icl/jupyterhub module support 'intel' or 'nvidia'
    exolyr committed Mar 8, 2024
    Configuration menu
    Copy the full SHA
    7aad5b2 View commit details
    Browse the repository at this point in the history

Commits on Mar 12, 2024

  1. Python migration and new nvidia base image

    - Check/validate fucntions from gke.sh moved to src/infractl/deploy/gcp/main.py
    - Built new mutli-stage image for GPU profile image which uses nvidia/cuda:12.2.2-base-ubuntu22.04 as the base and adds pbchekin/icl-jupyterhub:0.0.21 changes
    - Changed GKE_GPU_DRIVER_VERSION environment variable default to "LATEST"
    - Added outputs to terraform modules for visibility and ease of future debugging
    - Modified terraform/icl module to dynamically set selected GPU image with jupyterhub_gpu_profile_image
    exolyr committed Mar 12, 2024
    Configuration menu
    Copy the full SHA
    05a479a View commit details
    Browse the repository at this point in the history

Commits on Mar 14, 2024

  1. Enable shared gpu and extra_resource_limits fix

    - Added shared_gpu variable added to terraform/gcp and terraform/gcp/icl-cluster
    - Created new conditional module in terraform/gcp/icl-cluster dependent on shared_gpu variable value
    - Modified pool names to reflect exclusive vs shared GPU modes
    - Added node_count and gpu_count variables to easily allow future addition of multi-node deployments
    - Changed var.jupyterhub_extra_resource_limits from map(string) to string
    - Removed default value for var.jupyterhub_extra_resource_limits
    exolyr committed Mar 14, 2024
    Configuration menu
    Copy the full SHA
    da15670 View commit details
    Browse the repository at this point in the history

Commits on Mar 20, 2024

  1. Bug Fixes and Linting

    - Added default value for jupyterhub_extra_resource_limits
    - Fixed subprocess.run calls using shell
    - Removed unused $GPU_ENABLED parameter from gke.sh call to infractl.deploy.gcp.main
    - isort, black, and pyline changes
    exolyr committed Mar 20, 2024
    Configuration menu
    Copy the full SHA
    70d6ab2 View commit details
    Browse the repository at this point in the history

Commits on Mar 21, 2024

  1. Configuration menu
    Copy the full SHA
    70dfbf0 View commit details
    Browse the repository at this point in the history
  2. Updated the conditional logic in the xpumanager module instantiation …

    …to enhance flexibility in specifying the type of GPU
    exolyr committed Mar 21, 2024
    Configuration menu
    Copy the full SHA
    ffc9d31 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    8c5450f View commit details
    Browse the repository at this point in the history
  4. Formatting and review suggestions

    - Trailing lines added
    - Uneeded whitepspace trimmed
    - subprocess import in main.py changed and reordered
    - Unintentional ray downgrade reverted
    - Duplicate variable declaration removed from gke.sh
    - Added GKE_GPU_DRIVER_VERSION description to gke.sh help output
    - Removed print lines from subprocess.CalledProcessError exceptions
    exolyr committed Mar 21, 2024
    Configuration menu
    Copy the full SHA
    0de90d5 View commit details
    Browse the repository at this point in the history
  5. Update scripts/deploy/gke.sh

    Co-authored-by: Pavel Chekin <[email protected]>
    exolyr and pbchekin authored Mar 21, 2024
    Configuration menu
    Copy the full SHA
    76847c0 View commit details
    Browse the repository at this point in the history
  6. Update terraform/gcp/modules/icl-cluster/main.tf

    Co-authored-by: Pavel Chekin <[email protected]>
    exolyr and pbchekin authored Mar 21, 2024
    Configuration menu
    Copy the full SHA
    553f533 View commit details
    Browse the repository at this point in the history

Commits on Mar 22, 2024

  1. Configuration menu
    Copy the full SHA
    9346369 View commit details
    Browse the repository at this point in the history

Commits on Mar 26, 2024

  1. Apply suggestions from code review

    Co-authored-by: Vadim Musin <[email protected]>
    exolyr and kwasd authored Mar 26, 2024
    Configuration menu
    Copy the full SHA
    98c3f9f View commit details
    Browse the repository at this point in the history

Commits on Jun 11, 2024

  1. - Added bastion-host terraform module

    - Added firewall-rule-bastion-ports module
    - Added additional variables to /terraform/gcp/variables.tf for bastion-host
    - Added generate_bastion_key function to create public SSH key when CREATE_BASTION="true"
    - Added two new environment variables related to bastion creation
    - Added function check BASTION_SOURCE_RANGES exists and -neq "" if CREATE_BASTION=true
    exolyr committed Jun 11, 2024
    Configuration menu
    Copy the full SHA
    d034480 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    6c0e8a5 View commit details
    Browse the repository at this point in the history

Commits on Jun 12, 2024

  1. Configuration menu
    Copy the full SHA
    afc6b37 View commit details
    Browse the repository at this point in the history
  2. Added deplyoment_type variable to control execution of GPU modules si…

    …nce installtion methods will vary across environments.
    exolyr committed Jun 12, 2024
    Configuration menu
    Copy the full SHA
    4bd7883 View commit details
    Browse the repository at this point in the history

Commits on Jun 27, 2024

  1. - Changed "deployment_type" variable to more generic boolean "enable_…

    …nvidia_operator"
    
    - Changed some ENV default values from string to boolean to better align with what TF is expecting
    - Fixed bastion_name variable
    - Updated terraform/icl/main.tf to use updated "enable_nvidia_operator" variable
    exolyr committed Jun 27, 2024
    Configuration menu
    Copy the full SHA
    39cb829 View commit details
    Browse the repository at this point in the history