Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Make controller regions/ choose from replica resources #4053

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

euclidgame
Copy link

@euclidgame euclidgame commented Oct 9, 2024

Fixes #3364 .

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Unit test for the case where all resources are specified with clouds, regions and zones.
    • Unit test for the case where all resources are specified with clouds and regions, only part of them have zones.
    • Unit test for mixed cases: some resources have clouds, regions and zones; some don't have regions or zones.
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@euclidgame euclidgame marked this pull request as ready for review October 9, 2024 20:44
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature @euclidgame ! This is awesome. The PR looks mostly good for me. Left some discussions ;)

Comment on lines 525 to 527
for cloud_name, regions in requested_clouds_with_region_zone.items()
for region, zones in regions.items() for zone in zones
for cloud in [clouds.CLOUD_REGISTRY.from_str(cloud_name)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little bit confusing. Lets do explicit for loop instead?

Comment on lines 491 to 493
requested_clouds_with_region_zone[cloud_name] = {
'_allow_any_region': {'_allow_any_zone'}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to left blank if enable any region? same apply for zone.

Copy link
Author

@euclidgame euclidgame Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is possible, but it requires special handling to determine whether an empty set indicates that any region is allowed, which means we should only add new regions if it’s the first time. Using a placeholder makes it easier. Maybe I can use None instead of the specific strings?

sky/utils/controller_utils.py Show resolved Hide resolved
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @euclidgame ! Looks good to me. Left some discussions ;)

sky/utils/controller_utils.py Outdated Show resolved Hide resolved
tests/unit_tests/test_controller_utils.py Show resolved Hide resolved
tests/unit_tests/test_controller_utils.py Show resolved Hide resolved
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @euclidgame ! Left some discussions ;)

Comment on lines 94 to 108
# 2. All resources has cloud specified. Some of them
# could NOT host controllers. Return a set, only
# containing those could host controllers.
# 2. Some resources cannot host controllers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we revert this?

Comment on lines 130 to 142
# 3. Some resources does not have cloud specified.
# Return the default resources.
# 3. Some resources do not have cloud specified.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, could we revert this?

@@ -138,3 +149,63 @@ def _could_host_controllers(cloud: sky.clouds.Cloud) -> bool:
assert len(controller_resources) == 1
config = list(controller_resources)[0].to_yaml_config()
assert config == default_controller_resources, config

# 4. All resources have clouds, regions, and zones specified.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add the expected output of this group as previous comments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, if the controller resources specified in the config yaml has a cloud, we directly returned it. however, after adding this, we might still want to respect the region & zones for replica resoruces? (e.g. config yaml only says cloud=aws, but replica resoruces further suggested some regions)

if controller_resources_to_use.cloud is not None:
return {controller_resources_to_use}

Copy link
Author

@euclidgame euclidgame Oct 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use controller_resources_to_use as a filter after requested_clouds_with_region_zone without the previous early return:

    filtered_clouds = ({controller_cloud} if controller_cloud else
                       requested_clouds_with_region_zone.keys())

    # Filter regions and zones and construct the result
    result = set()
    for cloud_name in filtered_clouds:
        regions = requested_clouds_with_region_zone.get(cloud_name, {None: {None}})

        # Filter regions if controller_resources_to_use.region is specified
        filtered_regions = ({controller_region}
                            if controller_region else regions.keys())

        for region in filtered_regions:
            zones = regions.get(region, {None})

            # Filter zones if controller_resources_to_use.zone is specified
            filtered_zones = ({controller_zone} if controller_zone else zones)

            # Create combinations of cloud, region, and zone
            for zone in filtered_zones:
                resource_copy = controller_resources_to_use.copy(
                    cloud=clouds.CLOUD_REGISTRY.from_str(cloud_name),
                    region=region,
                    zone=zone)
                result.add(resource_copy)

Might be tedious :-), let me know your thoughts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently only allows single resource for controller, which might simplify it a little bit. Pls check:

# TODO(tian): Support multiple resources for the controller. One blocker
# here is the semantic if controller resources use `ordered` and we want
# to override it with multiple cloud from task resources.
if len(controller_resources) != 1:
with ux_utils.print_exception_no_traceback():
raise ValueError(
CONTROLLER_RESOURCES_NOT_VALID_MESSAGE.format(
controller_type=controller.value.controller_type,
err=f'Expected exactly one resource, got '
f'{len(controller_resources)} resources: '
f'{controller_resources}').capitalize())

@euclidgame
Copy link
Author

Thanks for the suggestions @cblmemo ! All fixed, please review :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core][Controller] Respect region/zone settings in controller resources when creating controller
2 participants