Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose packed: False, set log_peak_memory_stats: True, set compile: False #1872

Merged
merged 10 commits into from
Oct 26, 2024

Conversation

krammnic
Copy link
Contributor

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (Clean up)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
  • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

  • I did not change any public API
  • I have added an example to docs or docstrings

Copy link

pytorch-bot bot commented Oct 20, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1872

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b0b4b14 with merge base 3ca0d30 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 20, 2024
Copy link
Contributor

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two nits: overall, big fan of this UX improvement

@@ -45,7 +45,9 @@ resume_from_checkpoint: False

# Dataset
dataset:
packed: False # Set to true for great speed ups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge nit: can we move this below the _component_ declaration?

That way it reads more as an option for the specific builder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, though like this first time.. But I followed original declaration from issue:

dataset:
packed=False # Set to true for great speed ups

Will be fixed

@@ -57,7 +58,7 @@ loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: null
gradient_accumulation_steps: 1

compile: False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be explicit that this will torch.compile the model and loss? compile seems like a vague name

Copy link
Contributor

@felipemello1 felipemello1 Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. I think we can add a comment, similar to packed

compile=False # pytorch compile, set to true for perf/memory improvement

wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comment there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will add

@felipemello1
Copy link
Contributor

@SalmanMohammadi , do we support compile and packed in PPO? If not, maybe we should not add those to these configs.

@SalmanMohammadi
Copy link
Collaborator

@SalmanMohammadi , do we support compile and packed in PPO? If not, maybe we should not add those to these configs.

Compile... yes, kind of? It's going to be overhauled soon to properly support, it's very out-of-date, it doesn't even have hierarchical compilation.

@felipemello1
Copy link
Contributor

@SalmanMohammadi , do we support compile and packed in PPO? If not, maybe we should not add those to these configs.

Compile... yes, kind of? It's going to be overhauled soon to properly support, it's very out-of-date, it doesn't even have hierarchical compilation.

ok, so compile is NOT a no-op, and we SHOULD have it in the configs. However, packed does NOT work with PPO, and should NOT be added to ppo configs. Is that right?

@SalmanMohammadi
Copy link
Collaborator

@SalmanMohammadi , do we support compile and packed in PPO? If not, maybe we should not add those to these configs.

Compile... yes, kind of? It's going to be overhauled soon to properly support, it's very out-of-date, it doesn't even have hierarchical compilation.

ok, so compile is NOT a no-op, and we SHOULD have it in the configs. However, packed does NOT work with PPO, and should NOT be added to ppo configs. Is that right?

CORRECT

@felipemello1
Copy link
Contributor

@krammnic , we also have to check if compile/packed work for the knowledge distillation recipe. If not, we need to remove it from the configs. In short, i know for sure that these work for LORA and Full finetuning recipes/configs.

@SalmanMohammadi
Copy link
Collaborator

@krammnic packed should also be removed from all the DPO configs, please.

@krammnic
Copy link
Contributor Author

krammnic commented Oct 21, 2024

@krammnic , we also have to check if compile/packed work for the knowledge distillation recipe. If not, we need to remove it from the configs. In short, i know for sure that these work for LORA and Full finetuning recipes/configs.

Sure, will test then

@krammnic
Copy link
Contributor Author

@krammnic packed should also be removed from all the DPO configs, please.

Done

@krammnic
Copy link
Contributor Author

Added, comment to compile: False

@felipemello1
Copy link
Contributor

We will need to update recipes to log memory. We are getting the error

FAILED tests/recipes/test_lora_dpo_single_device.py::TestLoRADPOSingleDeviceRecipe::test_save_and_load_merged_weights - ValueError: Logging memory stats is only supported on CUDA devices, got cpu

So where log_peak_memory_stats, we need to add "if device = 'cuda'" and add info "log.info("log_peak_memory_stats was se to True, however, training does not use cuda. Setting log_peak_memory_stats=False."

cc: @ebsmothers

@krammnic
Copy link
Contributor Author

We will need to update recipes to log memory. We are getting the error

FAILED tests/recipes/test_lora_dpo_single_device.py::TestLoRADPOSingleDeviceRecipe::test_save_and_load_merged_weights - ValueError: Logging memory stats is only supported on CUDA devices, got cpu

So where log_peak_memory_stats, we need to add "if device = 'cuda'" and add info "log.info("log_peak_memory_stats was se to True, however, training does not use cuda. Setting log_peak_memory_stats=False."

cc: @ebsmothers

Sure, will be done!

@felipemello1
Copy link
Contributor

We will need to update recipes to log memory. We are getting the error

FAILED tests/recipes/test_lora_dpo_single_device.py::TestLoRADPOSingleDeviceRecipe::test_save_and_load_merged_weights - ValueError: Logging memory stats is only supported on CUDA devices, got cpu

So where log_peak_memory_stats, we need to add "if device = 'cuda'" and add info "log.info("log_peak_memory_stats was se to True, however, training does not use cuda. Setting log_peak_memory_stats=False."

cc: @ebsmothers

edit: lets actually do this check in the init of the recipe. In the future, we can move all of these checks to some function like "config_parse". We already have multiple of these checks in the init

@SalmanMohammadi
Copy link
Collaborator

We will need to update recipes to log memory. We are getting the error

FAILED tests/recipes/test_lora_dpo_single_device.py::TestLoRADPOSingleDeviceRecipe::test_save_and_load_merged_weights - ValueError: Logging memory stats is only supported on CUDA devices, got cpu

So where log_peak_memory_stats, we need to add "if device = 'cuda'" and add info "log.info("log_peak_memory_stats was se to True, however, training does not use cuda. Setting log_peak_memory_stats=False."
cc: @ebsmothers

edit: lets actually do this check in the init of the recipe. In the future, we can move all of these checks to some function like "config_parse". We already have multiple of these checks in the init

I believe we're already doing this in most recipes when the stats are logged - the DPO recipe hasn't been updated.

@SalmanMohammadi
Copy link
Collaborator

We will need to update recipes to log memory. We are getting the error

FAILED tests/recipes/test_lora_dpo_single_device.py::TestLoRADPOSingleDeviceRecipe::test_save_and_load_merged_weights - ValueError: Logging memory stats is only supported on CUDA devices, got cpu

So where log_peak_memory_stats, we need to add "if device = 'cuda'" and add info "log.info("log_peak_memory_stats was se to True, however, training does not use cuda. Setting log_peak_memory_stats=False."
cc: @ebsmothers

edit: lets actually do this check in the init of the recipe. In the future, we can move all of these checks to some function like "config_parse". We already have multiple of these checks in the init

I believe we're already doing this in most recipes when the stats are logged - the DPO recipe hasn't been updated.

The DPO recipes uses:

                        if self._log_peak_memory_stats:
                            log_dict.update(
                                training.get_memory_stats(device=self._device)
                            )

it should be

                        if self._device.type == "cuda" and self._log_peak_memory_stats:
                            log_dict.update(
                                training.get_memory_stats(device=self._device)
                            )

@krammnic
Copy link
Contributor Author

Add required check:

if self._log_peak_memory_stats and self._device.type == "cuda":
     log.info(
                "log_peak_memory_stats was se to True, however, training does not use cuda. Setting log_peak_memory_stats=False."
     )
     self._log_peak_memory_stats = False

@@ -127,6 +127,12 @@ def __init__(self, cfg: DictConfig) -> None:
self._log_every_n_steps = cfg.get("log_every_n_steps", 1)
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False)

if self._log_peak_memory_stats and self._device.type == "cuda":
Copy link
Collaborator

@SalmanMohammadi SalmanMohammadi Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on whether we actually need this? I realise we kind of fail "silently" at the moment by just not logging if we aren't running on CUDA

if (
self._device.type == "cuda"
and self._log_peak_memory_stats
):
)

@joecummings @felipemello1

As-is we're now doing duplicating this check - once in in the init, and also every time we log the memory stats (in model setup, and during training) which isn't super clean. Personally I'd rather just make the check in the relevant logging util - but don't have to block on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me get back to you on this @krammnic . Thanks for making all of these changes! :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really see the problem in 2 partially duplicating checks. The point is, that if we have such logic in train:

log_dict = {
    "loss": loss_to_log,
    "lr": self._optimizer.param_groups[0]["lr"],
    "tokens_per_second_per_gpu": num_tokens / time_per_step,
}
if self._log_peak_memory_stats:
    log_dict.update(
        training.get_memory_stats(device=self._device)
    )
self._metric_logger.log_dict(
    log_dict,
    step=self.global_step,
)

We can't do anything better, can we? Check in __init__ is about cuda and logging(once). Check in train probably should not be about "cuda"(there is no use case) and not about logging. I'm not sure if this should be in _metric_logger either

@krammnic
Copy link
Contributor Author

Fixed some nits. Probably should be fine


# Training env
device: cuda

# Memory management
enable_activation_checkpointing: True
custom_sharded_layers: ['tok_embeddings', 'output']
compile: False # set it to True for better memory and performance
compile=False # pytorch compile, set to true for perf/memory improvement# set it to True for better memory and performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

@@ -61,7 +62,7 @@ loss:
max_steps_per_epoch: null
gradient_accumulation_steps: 1
optimizer_in_bwd: True
compile: False # set it to True for better memory and performance
compile=False # pytorch compile, set to true for perf/memory improvement# set it to True for better memory and performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mind quickly double checking? Also, if you are using a script, maybe for a sanity check make just that compile/packed dont appear twice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a colon? compile: False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, obviously. Fixed

@@ -121,6 +121,12 @@ def __init__(self, cfg: DictConfig) -> None:
self._log_every_n_steps = cfg.get("log_every_n_steps", 1)
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False)

if self._log_peak_memory_stats and self._device.type == "cuda":
log.info(
"log_peak_memory_stats was se to True, however, training does not use cuda. Setting log_peak_memory_stats=False."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that there are typos. Thats my fault, i guess you just copied/pasted what i wrote earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes))) I didn't double check because it pretty minor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed typos

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, it should be self._device.type != "cuda" not "=="

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krammnic sorry, just realized that the condition is wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Yeah need to be fixed

@@ -71,7 +72,7 @@ fsdp:
epochs: 1
max_steps_per_epoch: null
gradient_accumulation_steps: 16
compile: False # set it to True for better memory and performance
compile=False # pytorch compile, set to true for perf/memory improvement# set it to True for better memory and performance
Copy link
Contributor

@felipemello1 felipemello1 Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other configs still need fixing :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@SalmanMohammadi SalmanMohammadi mentioned this pull request Oct 26, 2024
13 tasks
@krammnic
Copy link
Contributor Author

Fixed all typos

@krammnic krammnic closed this Oct 26, 2024
@krammnic krammnic reopened this Oct 26, 2024
Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thanks for the pr!

@@ -99,7 +100,7 @@ device: cuda
enable_activation_checkpointing: True
custom_sharded_layers: ['tok_embeddings', 'output']
fsdp_cpu_offload: True
compile: False # set it to True for better memory and performance
compile=False # pytorch compile, set to true for perf/memory improvement# set it to True for better memory and performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs fix

@@ -89,15 +90,15 @@ loss:
epochs: 1
max_steps_per_epoch: null
gradient_accumulation_steps: 1
compile: False # set it to True for better memory and performance
compile=False # pytorch compile, set to true for perf/memory improvement# set it to True for better memory and performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs fix

@@ -88,15 +89,15 @@ loss:
epochs: 1
max_steps_per_epoch: null
gradient_accumulation_steps: 1
compile: False # set it to True for better memory and performance
compile=False # pytorch compile, set to true for perf/memory improvement# set it to True for better memory and performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that the easiest way is to ctrol+f and search for "compile=". There are >15 of such cases.

@@ -119,6 +119,12 @@ def __init__(self, cfg: DictConfig) -> None:
self._log_every_n_steps = cfg.get("log_every_n_steps", 1)
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False)

if self._log_peak_memory_stats and self._device.type == "cuda":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self._log_peak_memory_stats and self._device.type == "cuda":
if self._log_peak_memory_stats and self._device.type != "cuda":

@@ -127,6 +127,12 @@ def __init__(self, cfg: DictConfig) -> None:
self._log_every_n_steps = cfg.get("log_every_n_steps", 1)
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False)

if self._log_peak_memory_stats and self._device.type == "cuda":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self._log_peak_memory_stats and self._device.type == "cuda":
if self._log_peak_memory_stats and self._device.type != "cuda":

@@ -130,6 +130,12 @@ def __init__(self, cfg: DictConfig) -> None:
self._log_every_n_steps = cfg.get("log_every_n_steps", 1)
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False)

if self._log_peak_memory_stats and self._device.type == "cuda":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self._log_peak_memory_stats and self._device.type == "cuda":
if self._log_peak_memory_stats and self._device.type != "cuda":

@@ -120,6 +120,12 @@ def __init__(self, cfg: DictConfig) -> None:
self._log_every_n_steps = cfg.get("log_every_n_steps", 1)
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False)

if self._log_peak_memory_stats and self._device.type == "cuda":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self._log_peak_memory_stats and self._device.type == "cuda":
if self._log_peak_memory_stats and self._device.type != "cuda":

@@ -116,6 +116,12 @@ def __init__(self, cfg: DictConfig) -> None:
self._log_every_n_steps = cfg.get("log_every_n_steps", 1)
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False)

if self._log_peak_memory_stats and self._device.type == "cuda":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self._log_peak_memory_stats and self._device.type == "cuda":
if self._log_peak_memory_stats and self._device.type != "cuda":

@@ -121,6 +121,12 @@ def __init__(self, cfg: DictConfig) -> None:
self._log_every_n_steps = cfg.get("log_every_n_steps", 1)
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False)

if self._log_peak_memory_stats and self._device.type == "cuda":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self._log_peak_memory_stats and self._device.type == "cuda":
if self._log_peak_memory_stats and self._device.type != "cuda":

@@ -72,14 +73,15 @@ loss:
epochs: 1
max_steps_per_epoch: null
gradient_accumulation_steps: 32
compile: False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess we didnt add the comment in every config?

@felipemello1 felipemello1 merged commit 2c948c6 into pytorch:main Oct 26, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants