-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UX] Add infeasibility reasons to the exception message #3986
base: master
Are you sure you want to change the base?
Conversation
table = log_utils.create_table(['Resource', 'Reason']) | ||
for (resource, exception) in resource_exceptions.items(): | ||
table.add_row([ | ||
resource, | ||
_EXCEPTION_SUMMARY_MESSAGE[exception.__class__] | ||
]) | ||
raise exceptions.ResourcesUnavailableError( | ||
_RESOURCES_UNAVAILABLE_LOG + '\n' + table.get_string(), | ||
failover_history=failover_history) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of parsing the exceptions here, should we directly rely on the failover_history
to generate reason table at the caller? Or, is there a reason we have to do it here?
It might be good to test with, e.g. sky launch --gpus H100:8
to see how the output for failover through many regions look like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of parsing the exceptions here, should we directly rely on the failover_history to generate reason table at the caller? Or, is there a reason we have to do it here?
Yes this was a design that I've tried, but I don't think the failover_history
gives enough information for users to identify the problem. For example, when I run sky launch --gpus H100:8
, the (partial) failover history would be
[ResourcesUnavailableError('Failed to acquire resources in us-central1-a. Try changing resource requirements or use another zone.'), ResourcesUnavailableError('Failed to acquire resources in us-west1-a. Try changing resource requirements or use another zone.'),
As you see it only contains the region of each failed provision, not even includes the cloud provider or resource information. So I think constructing the mapping from each resource to the exception here is more user-friendly.
It might be good to test with, e.g.
sky launch --gpus H100:8
to see how the output for failover through many regions look like.
Sure here is the final output:
$ sky launch --gpus H100:8
sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x <Cloud>({'H100': 8})
To keep retrying until the cluster is up, use the `--retry-until-up` flag.
The reasons for the infeasibility of each resource are summarized below. For detailed explanations, please refer to the log above.
Resource Reason
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
Seems that it works as expected.
A special case occurs when a resource have too many requirements, causing the 'Resource' column to become very long, which affects the display in the terminal.
This PR fixes #3911 by summarize the infeasibility reasons for each resource into a table, and append it to the end of the final exception message.
Here is a minimal example.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh