[UX] Add infeasibility reasons to the exception message #3986

Conless · 2024-09-25T12:09:16Z

This PR fixes #3911 by summarize the infeasibility reasons for each resource into a table, and append it to the end of the final exception message.

Here is a minimal example.

$ sky launch -c k8s --cloud kubernetes -i 10 -y
I 09-25 20:04:46 optimizer.py:719] == Optimizer ==
I 09-25 20:04:46 optimizer.py:742] Estimated cost: $0.0 / hour
I 09-25 20:04:46 optimizer.py:742] 
I 09-25 20:04:46 optimizer.py:867] Considered resources (1 node):
I 09-25 20:04:46 optimizer.py:937] ---------------------------------------------------------------------------------------------
I 09-25 20:04:46 optimizer.py:937]  CLOUD        INSTANCE    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 09-25 20:04:46 optimizer.py:937] ---------------------------------------------------------------------------------------------
I 09-25 20:04:46 optimizer.py:937]  Kubernetes   2CPU--2GB   2       2         -              kubernetes    0.00          ✔     
I 09-25 20:04:46 optimizer.py:937] ---------------------------------------------------------------------------------------------
I 09-25 20:04:46 optimizer.py:937] 
Running task on cluster k8s...
I 09-25 20:04:46 cloud_vm_ray_backend.py:4421] Creating a new cluster: 'k8s' [1x Kubernetes(2CPU--2GB)].
I 09-25 20:04:46 cloud_vm_ray_backend.py:4421] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
W 09-25 20:04:48 cloud_vm_ray_backend.py:2000] sky.exceptions.NotSupportedError: The following features are not supported by Kubernetes:
W 09-25 20:04:48 cloud_vm_ray_backend.py:2000]  Feature  Reason                                     
W 09-25 20:04:48 cloud_vm_ray_backend.py:2000]  stop     Kubernetes does not support stopping VMs.  
W 09-25 20:04:48 cloud_vm_ray_backend.py:2026] 
W 09-25 20:04:48 cloud_vm_ray_backend.py:2026] Provision failed for 1x Kubernetes(2CPU--2GB) in kubernetes. Trying other locations (if any).

sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Kubernetes()
To keep retrying until the cluster is up, use the `--retry-until-up` flag.
The reasons for the infeasibility of each resource are summarized below. For detailed explanations, please refer to the log above.
Resource               Reason                                      
Kubernetes(2CPU--2GB)  Cloud does not support requested features.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Michaelvll · 2024-10-01T16:54:20Z

sky/backends/cloud_vm_ray_backend.py

+ table = log_utils.create_table(['Resource', 'Reason'])
+ for (resource, exception) in resource_exceptions.items():
+ table.add_row([
+ resource,
+ _EXCEPTION_SUMMARY_MESSAGE[exception.__class__]
+ ])
+ raise exceptions.ResourcesUnavailableError(
+ _RESOURCES_UNAVAILABLE_LOG + '\n' + table.get_string(),
+ failover_history=failover_history)


Instead of parsing the exceptions here, should we directly rely on the failover_history to generate reason table at the caller? Or, is there a reason we have to do it here?

It might be good to test with, e.g. sky launch --gpus H100:8 to see how the output for failover through many regions look like

Instead of parsing the exceptions here, should we directly rely on the failover_history to generate reason table at the caller? Or, is there a reason we have to do it here?

Yes this was a design that I've tried, but I don't think the failover_history gives enough information for users to identify the problem. For example, when I run sky launch --gpus H100:8, the (partial) failover history would be

[ResourcesUnavailableError('Failed to acquire resources in us-central1-a. Try changing resource requirements or use another zone.'), ResourcesUnavailableError('Failed to acquire resources in us-west1-a. Try changing resource requirements or use another zone.'),

As you see it only contains the region of each failed provision, not even includes the cloud provider or resource information. So I think constructing the mapping from each resource to the exception here is more user-friendly.

It might be good to test with, e.g. sky launch --gpus H100:8 to see how the output for failover through many regions look like.

Sure here is the final output:

$ sky launch --gpus H100:8 sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x <Cloud>({'H100': 8}) To keep retrying until the cluster is up, use the `--retry-until-up` flag. The reasons for the infeasibility of each resource are summarized below. For detailed explanations, please refer to the log above. Resource Reason GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud. AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud. AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.

Seems that it works as expected.

A special case occurs when a resource have too many requirements, causing the 'Resource' column to become very long, which affects the display in the terminal.

Conless added 2 commits September 21, 2024 23:19

Append infeasibility reasons to the end of final exception message.

533e548

Fix format issues.

6932181

Michaelvll self-requested a review September 27, 2024 07:32

Michaelvll reviewed Oct 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UX] Add infeasibility reasons to the exception message #3986

[UX] Add infeasibility reasons to the exception message #3986

Conless commented Sep 25, 2024

Michaelvll Oct 1, 2024

Conless Oct 2, 2024

[UX] Add infeasibility reasons to the exception message #3986

Are you sure you want to change the base?

[UX] Add infeasibility reasons to the exception message #3986

Conversation

Conless commented Sep 25, 2024

Michaelvll Oct 1, 2024

Choose a reason for hiding this comment

Conless Oct 2, 2024

Choose a reason for hiding this comment