Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UX] Add infeasibility reasons to the exception message #3986

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Conless
Copy link
Contributor

@Conless Conless commented Sep 25, 2024

This PR fixes #3911 by summarize the infeasibility reasons for each resource into a table, and append it to the end of the final exception message.

Here is a minimal example.

$ sky launch -c k8s --cloud kubernetes -i 10 -y
I 09-25 20:04:46 optimizer.py:719] == Optimizer ==
I 09-25 20:04:46 optimizer.py:742] Estimated cost: $0.0 / hour
I 09-25 20:04:46 optimizer.py:742] 
I 09-25 20:04:46 optimizer.py:867] Considered resources (1 node):
I 09-25 20:04:46 optimizer.py:937] ---------------------------------------------------------------------------------------------
I 09-25 20:04:46 optimizer.py:937]  CLOUD        INSTANCE    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 09-25 20:04:46 optimizer.py:937] ---------------------------------------------------------------------------------------------
I 09-25 20:04:46 optimizer.py:937]  Kubernetes   2CPU--2GB   2       2         -              kubernetes    0.00          ✔     
I 09-25 20:04:46 optimizer.py:937] ---------------------------------------------------------------------------------------------
I 09-25 20:04:46 optimizer.py:937] 
Running task on cluster k8s...
I 09-25 20:04:46 cloud_vm_ray_backend.py:4421] Creating a new cluster: 'k8s' [1x Kubernetes(2CPU--2GB)].
I 09-25 20:04:46 cloud_vm_ray_backend.py:4421] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
W 09-25 20:04:48 cloud_vm_ray_backend.py:2000] sky.exceptions.NotSupportedError: The following features are not supported by Kubernetes:
W 09-25 20:04:48 cloud_vm_ray_backend.py:2000]  Feature  Reason                                     
W 09-25 20:04:48 cloud_vm_ray_backend.py:2000]  stop     Kubernetes does not support stopping VMs.  
W 09-25 20:04:48 cloud_vm_ray_backend.py:2026] 
W 09-25 20:04:48 cloud_vm_ray_backend.py:2026] Provision failed for 1x Kubernetes(2CPU--2GB) in kubernetes. Trying other locations (if any).

sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Kubernetes()
To keep retrying until the cluster is up, use the `--retry-until-up` flag.
The reasons for the infeasibility of each resource are summarized below. For detailed explanations, please refer to the log above.
Resource               Reason                                      
Kubernetes(2CPU--2GB)  Cloud does not support requested features.  

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@Michaelvll Michaelvll self-requested a review September 27, 2024 07:32
Comment on lines +2070 to +2078
table = log_utils.create_table(['Resource', 'Reason'])
for (resource, exception) in resource_exceptions.items():
table.add_row([
resource,
_EXCEPTION_SUMMARY_MESSAGE[exception.__class__]
])
raise exceptions.ResourcesUnavailableError(
_RESOURCES_UNAVAILABLE_LOG + '\n' + table.get_string(),
failover_history=failover_history)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of parsing the exceptions here, should we directly rely on the failover_history to generate reason table at the caller? Or, is there a reason we have to do it here?

It might be good to test with, e.g. sky launch --gpus H100:8 to see how the output for failover through many regions look like

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of parsing the exceptions here, should we directly rely on the failover_history to generate reason table at the caller? Or, is there a reason we have to do it here?

Yes this was a design that I've tried, but I don't think the failover_history gives enough information for users to identify the problem. For example, when I run sky launch --gpus H100:8, the (partial) failover history would be

[ResourcesUnavailableError('Failed to acquire resources in us-central1-a. Try changing resource requirements or use another zone.'), ResourcesUnavailableError('Failed to acquire resources in us-west1-a. Try changing resource requirements or use another zone.'),

As you see it only contains the region of each failed provision, not even includes the cloud provider or resource information. So I think constructing the mapping from each resource to the exception here is more user-friendly.

It might be good to test with, e.g. sky launch --gpus H100:8 to see how the output for failover through many regions look like.

Sure here is the final output:

$ sky launch --gpus H100:8
sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x <Cloud>({'H100': 8})
To keep retrying until the cluster is up, use the `--retry-until-up` flag.
The reasons for the infeasibility of each resource are summarized below. For detailed explanations, please refer to the log above.
Resource                         Reason                                                  
GCP(a3-highgpu-8g, {'H100': 8})  Requested resources cannot be satisfied on this cloud.  
GCP(a3-highgpu-8g, {'H100': 8})  Requested resources cannot be satisfied on this cloud.  
GCP(a3-highgpu-8g, {'H100': 8})  Requested resources cannot be satisfied on this cloud.  
GCP(a3-highgpu-8g, {'H100': 8})  Requested resources cannot be satisfied on this cloud.  
AWS(p5.48xlarge, {'H100': 8})    Requested resources cannot be satisfied on this cloud.  
AWS(p5.48xlarge, {'H100': 8})    Requested resources cannot be satisfied on this cloud.  
AWS(p5.48xlarge, {'H100': 8})    Requested resources cannot be satisfied on this cloud.  
GCP(a3-highgpu-8g, {'H100': 8})  Requested resources cannot be satisfied on this cloud.  
GCP(a3-highgpu-8g, {'H100': 8})  Requested resources cannot be satisfied on this cloud.  
GCP(a3-highgpu-8g, {'H100': 8})  Requested resources cannot be satisfied on this cloud.  
AWS(p5.48xlarge, {'H100': 8})    Requested resources cannot be satisfied on this cloud.  
GCP(a3-highgpu-8g, {'H100': 8})  Requested resources cannot be satisfied on this cloud.  
GCP(a3-highgpu-8g, {'H100': 8})  Requested resources cannot be satisfied on this cloud.  
GCP(a3-highgpu-8g, {'H100': 8})  Requested resources cannot be satisfied on this cloud.  
GCP(a3-highgpu-8g, {'H100': 8})  Requested resources cannot be satisfied on this cloud.  

Seems that it works as expected.

A special case occurs when a resource have too many requirements, causing the 'Resource' column to become very long, which affects the display in the terminal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[UX] Better logging for unsupported features
2 participants