-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Make controller regions/ choose from replica resources #4053
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature @euclidgame ! This is awesome. The PR looks mostly good for me. Left some discussions ;)
sky/utils/controller_utils.py
Outdated
for cloud_name, regions in requested_clouds_with_region_zone.items() | ||
for region, zones in regions.items() for zone in zones | ||
for cloud in [clouds.CLOUD_REGISTRY.from_str(cloud_name)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little bit confusing. Lets do explicit for loop instead?
sky/utils/controller_utils.py
Outdated
requested_clouds_with_region_zone[cloud_name] = { | ||
'_allow_any_region': {'_allow_any_zone'} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to left blank if enable any region? same apply for zone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is possible, but it requires special handling to determine whether an empty set indicates that any region is allowed, which means we should only add new regions if it’s the first time. Using a placeholder makes it easier. Maybe I can use None
instead of the specific strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @euclidgame ! Looks good to me. Left some discussions ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @euclidgame ! Left some discussions ;)
# 2. All resources has cloud specified. Some of them | ||
# could NOT host controllers. Return a set, only | ||
# containing those could host controllers. | ||
# 2. Some resources cannot host controllers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we revert this?
# 3. Some resources does not have cloud specified. | ||
# Return the default resources. | ||
# 3. Some resources do not have cloud specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, could we revert this?
@@ -138,3 +149,63 @@ def _could_host_controllers(cloud: sky.clouds.Cloud) -> bool: | |||
assert len(controller_resources) == 1 | |||
config = list(controller_resources)[0].to_yaml_config() | |||
assert config == default_controller_resources, config | |||
|
|||
# 4. All resources have clouds, regions, and zones specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add the expected output of this group as previous comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, if the controller resources specified in the config yaml has a cloud, we directly returned it. however, after adding this, we might still want to respect the region & zones for replica resoruces? (e.g. config yaml only says cloud=aws, but replica resoruces further suggested some regions)
skypilot/sky/utils/controller_utils.py
Lines 508 to 509 in 9201def
if controller_resources_to_use.cloud is not None: | |
return {controller_resources_to_use} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use controller_resources_to_use
as a filter after requested_clouds_with_region_zone
without the previous early return:
filtered_clouds = ({controller_cloud} if controller_cloud else
requested_clouds_with_region_zone.keys())
# Filter regions and zones and construct the result
result = set()
for cloud_name in filtered_clouds:
regions = requested_clouds_with_region_zone.get(cloud_name, {None: {None}})
# Filter regions if controller_resources_to_use.region is specified
filtered_regions = ({controller_region}
if controller_region else regions.keys())
for region in filtered_regions:
zones = regions.get(region, {None})
# Filter zones if controller_resources_to_use.zone is specified
filtered_zones = ({controller_zone} if controller_zone else zones)
# Create combinations of cloud, region, and zone
for zone in filtered_zones:
resource_copy = controller_resources_to_use.copy(
cloud=clouds.CLOUD_REGISTRY.from_str(cloud_name),
region=region,
zone=zone)
result.add(resource_copy)
Might be tedious :-), let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently only allows single resource for controller, which might simplify it a little bit. Pls check:
skypilot/sky/utils/controller_utils.py
Lines 487 to 497 in c6ae536
# TODO(tian): Support multiple resources for the controller. One blocker | |
# here is the semantic if controller resources use `ordered` and we want | |
# to override it with multiple cloud from task resources. | |
if len(controller_resources) != 1: | |
with ux_utils.print_exception_no_traceback(): | |
raise ValueError( | |
CONTROLLER_RESOURCES_NOT_VALID_MESSAGE.format( | |
controller_type=controller.value.controller_type, | |
err=f'Expected exactly one resource, got ' | |
f'{len(controller_resources)} resources: ' | |
f'{controller_resources}').capitalize()) |
Thanks for the suggestions @cblmemo ! All fixed, please review :-) |
Fixes #3364 .
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh