New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add MLflow `log_model` option #1544

Open

nancyhung wants to merge 21 commits into main from nancy/log-model

+94 −45

Contributor

nancyhung commented Sep 24, 2024 •

edited

Loading

Context

In order to support customers with sensitive storage network configurations, we have to use the log_model API. This will cause duplicate artifact uploads, which is not efficient, so we will only reserve rolling out to customers who require this.

This PR contains the first of 2 changes:

When saving the final HF checkpoint, use log_model instead of uploading to MLflow artifacts.

Functionally, a user can still find their HF checkpoint files in UC if they really wish to download the model weights and serve somewhere else.
Instead of calling save_model, register_model, and uploading to UC directly via the remote uploader downloader object, this change simplifies the control logic with the mlflow.log_model function. This function is also critical to support secure training requirements, such as customer firewalls or private endpoints. Logging a model to MLflow will call the necessary steps to save and register a model for deployment.
In this change, we only affect the logic while saving the final HF checkpoint. All other logic remains the same.

Next in a follow-up PR, we'll modify the intermediate checkpointing logic to also use log_model but not register the model. That way, a user can still manually register their intermediate checkpoints for evaluation.

Testing

When incorporating this in MAPI, we should enable final_register_only to only upload using the log_model logic instead of uploading a duplicate copy to MLflow artifacts. All tests were done in AWS staging.

Works for older models
[Databricks staging] Llama3 8b
Run: llama3-log-model-xusOti
Llama3 8b was able to be successfully deployed here: https://e2-dogfood.staging.cloud.databricks.com/ml/endpoints/test-log-model?o=6051921418418893.

Works for newest models with extra security
[MCT] Llama3.2 1b
Run: llama3-log-model-O50ClW
Experiment: https://dbc-559ffd80-2bfc.cloud.databricks.com/ml/experiments/2854093459220376?viewStateShareKey=55a332dc80d7200b6a6301d8f0163155ce9aac54d21436c9d292f0745e0bff05
Endpoint: https://dbc-559ffd80-2bfc.cloud.databricks.com/ml/endpoints/test-llama321b?o=7395834863327820

nancyhung added 4 commits

September 23, 2024 17:14


          Register model with MLflow PySDK now that retries are baked in. This …

06d77db

…cleans up the code a little and prevents us from having forked logic in Composer to fetch by run_id


          Register model with MLflow PySDK now that retries are baked in. This …

e40e5dd

…cleans up the code a little and prevents us from having forked logic in Composer to fetch by run_id


          small changes

454e18b


          isolated changes

c8bd06f

nancyhung requested a review from a team as a code owner

September 24, 2024 01:03

dakinggg reviewed

View reviewed changes

Collaborator

dakinggg left a comment

What testing have you done? We need to make sure everything e2e shows up properly

llmfoundry/callbacks/hf_checkpointer.py Outdated Show resolved Hide resolved

dakinggg reviewed

View reviewed changes

llmfoundry/callbacks/hf_checkpointer.py Outdated Show resolved Hide resolved

nancyhung closed this

nancyhung reopened this


          pr feedback with a print statement for testing

0d3f9ce

irenedea reviewed

View reviewed changes

llmfoundry/callbacks/hf_checkpointer.py Outdated Show resolved Hide resolved

irenedea reviewed

View reviewed changes

llmfoundry/callbacks/hf_checkpointer.py Outdated Show resolved Hide resolved


          some more todos and need to test

6ea8de5

nancyhung changed the title ~~Add MLflow log_model option~~ [WIP] Add MLflow log_model option

dakinggg reviewed

View reviewed changes

llmfoundry/callbacks/hf_checkpointer.py Outdated Show resolved Hide resolved

nancyhung added 15 commits

October 10, 2024 17:25


          need to test

81306d8


          Merge branch 'main' into nancy/log-model

b854bb2


          use mlflow log model by default

bc73f65


          patch push


          Merge branch 'main' into nancy/log-model

bc29278


          add log statements

99589c7


          add log outside of process

04ddfaa

fix

8e42217

bug

be04e3d


          print the registered model name

5ab2cc7


          update the model registry prefix

79356d8


          move the download code out of the if statement


          try registering just the model name

6c5fb05


          connect the existing mlflow run id

bb0dd6a


          omg it works

c5ae4ff

nancyhung changed the title ~~[WIP] Add MLflow log_model option~~ Add MLflow log_model option

nancyhung requested review from irenedea and dakinggg

October 26, 2024 01:36

dakinggg reviewed

View reviewed changes

Collaborator

dakinggg left a comment

In the linked run, I don't see a registered model connected to your run. It should show up under registered models.

llmfoundry/callbacks/hf_checkpointer.py

@@ @@ -76,6 +76,11 @@ def _maybe_get_license_filename( @@
                   If the license file does not exist, returns None.
                   """
+                  # Early return if no local directory exists

Collaborator

dakinggg Oct 26, 2024

this should never happen right?

Collaborator

dakinggg Oct 26, 2024

assuming that is correct, please remove

llmfoundry/callbacks/hf_checkpointer.py


		Used mainly to log from a child process.

		Inputs:

Collaborator

dakinggg Oct 26, 2024

Follow docstring format from other places (e.g. Args: and the indentation and such

llmfoundry/callbacks/hf_checkpointer.py

-                      model_uri=model_uri,
-                      name=name,
-                      await_creation_for=await_creation_for,
+                      logging.getLogger('llmfoundry').setLevel(python_logging_level)

Collaborator

dakinggg Oct 26, 2024

it should still set composer log level too

llmfoundry/callbacks/hf_checkpointer.py

+                      mlflow_logger.log_model(
+                          transformers_model=transformers_model_path,
+                          flavor=flavor,
+                          artifact_path='model', # TODO: where should we define this parent dir name?

Collaborator

dakinggg Oct 26, 2024

fix?

llmfoundry/callbacks/hf_checkpointer.py

+                       mlflow_logger.log_model(
+                          transformers_model=transformers_model_path,
+                          flavor=flavor,
+                          artifact_path='model', # TODO: where should we define this parent dir name?

Collaborator

dakinggg Oct 26, 2024

fix?

llmfoundry/callbacks/hf_checkpointer.py

@@ @@ -171,7 +226,7 @@ class HuggingFaceCheckpointer(Callback): @@
                   def __init__(
                       self,
-                      save_folder: str,
+                      save_folder: Optional[str],

Collaborator

dakinggg Oct 26, 2024

this is probably a change for part 2 not for this pr?

llmfoundry/callbacks/hf_checkpointer.py

+                          )
+                          with context_manager:
+                              new_model_instance.save_pretrained(temp_save_dir)
+                          original_tokenizer.save_pretrained(temp_save_dir)

Collaborator

dakinggg Oct 26, 2024

should move the next if statement out too (if new_model_instance....)

llmfoundry/callbacks/hf_checkpointer.py

@@ @@ -702,14 +751,6 @@ def tensor_hook( @@
                                       True,
                                   ) if is_te_imported and state.precision == Precision.AMP_FP8 else contextlib.nullcontext(
                                   )
-                                  with context_manager:

Collaborator

dakinggg Oct 26, 2024

not sure if an equivalent to this is necessary or not, can you determine that?

llmfoundry/callbacks/hf_checkpointer.py

+                                          'transformers_model_path':
+                                              temp_save_dir,
+                                          'flavor':
+                                              'peft' if self.using_peft else 'transformers',

Collaborator

dakinggg Oct 26, 2024

we haven't added peft support for log_model in Composer yet. Will need to think about how to handle this.

llmfoundry/callbacks/hf_checkpointer.py

+                                              'peft' if self.using_peft else 'transformers',
+                                          'python_logging_level':
+                                              logging.getLogger('llmfoundry').level,
+                                          'task':

Collaborator

dakinggg Oct 26, 2024

we should just be ** the self.mlflow_logging_config, rather than picking out pieces of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet