-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose packed: False, set log_peak_memory_stats: True, set compile: False #1872
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1872
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b0b4b14 with merge base 3ca0d30 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two nits: overall, big fan of this UX improvement
@@ -45,7 +45,9 @@ resume_from_checkpoint: False | |||
|
|||
# Dataset | |||
dataset: | |||
packed: False # Set to true for great speed ups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge nit: can we move this below the _component_
declaration?
That way it reads more as an option for the specific builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, though like this first time.. But I followed original declaration from issue:
dataset:
packed=False # Set to true for great speed ups
Will be fixed
@@ -57,7 +58,7 @@ loss: | |||
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss | |||
max_steps_per_epoch: null | |||
gradient_accumulation_steps: 1 | |||
|
|||
compile: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be explicit that this will torch.compile
the model and loss? compile
seems like a vague name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. I think we can add a comment, similar to packed
compile=False # pytorch compile, set to true for perf/memory improvement
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some comment there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will add
@SalmanMohammadi , do we support compile and packed in PPO? If not, maybe we should not add those to these configs. |
Compile... yes, kind of? It's going to be overhauled soon to properly support, it's very out-of-date, it doesn't even have hierarchical compilation. |
ok, so compile is NOT a no-op, and we SHOULD have it in the configs. However, packed does NOT work with PPO, and should NOT be added to ppo configs. Is that right? |
CORRECT |
@krammnic , we also have to check if compile/packed work for the knowledge distillation recipe. If not, we need to remove it from the configs. In short, i know for sure that these work for LORA and Full finetuning recipes/configs. |
@krammnic packed should also be removed from all the DPO configs, please. |
Sure, will test then |
Done |
Added, comment to compile: False |
We will need to update recipes to log memory. We are getting the error
So where log_peak_memory_stats, we need to add "if device = 'cuda'" and add info "log.info("log_peak_memory_stats was se to True, however, training does not use cuda. Setting log_peak_memory_stats=False." cc: @ebsmothers |
Sure, will be done! |
edit: lets actually do this check in the init of the recipe. In the future, we can move all of these checks to some function like "config_parse". We already have multiple of these checks in the init |
I believe we're already doing this in most recipes when the stats are logged - the DPO recipe hasn't been updated. |
The DPO recipes uses: if self._log_peak_memory_stats:
log_dict.update(
training.get_memory_stats(device=self._device)
) it should be if self._device.type == "cuda" and self._log_peak_memory_stats:
log_dict.update(
training.get_memory_stats(device=self._device)
) |
Add required check:
|
recipes/qat_distributed.py
Outdated
@@ -127,6 +127,12 @@ def __init__(self, cfg: DictConfig) -> None: | |||
self._log_every_n_steps = cfg.get("log_every_n_steps", 1) | |||
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False) | |||
|
|||
if self._log_peak_memory_stats and self._device.type == "cuda": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on whether we actually need this? I realise we kind of fail "silently" at the moment by just not logging if we aren't running on CUDA
torchtune/recipes/lora_finetune_single_device.py
Lines 716 to 719 in 17ba37d
if ( | |
self._device.type == "cuda" | |
and self._log_peak_memory_stats | |
): |
As-is we're now doing duplicating this check - once in in the init, and also every time we log the memory stats (in model setup, and during training) which isn't super clean. Personally I'd rather just make the check in the relevant logging util - but don't have to block on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me get back to you on this @krammnic . Thanks for making all of these changes! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really see the problem in 2 partially duplicating checks. The point is, that if we have such logic in train:
log_dict = {
"loss": loss_to_log,
"lr": self._optimizer.param_groups[0]["lr"],
"tokens_per_second_per_gpu": num_tokens / time_per_step,
}
if self._log_peak_memory_stats:
log_dict.update(
training.get_memory_stats(device=self._device)
)
self._metric_logger.log_dict(
log_dict,
step=self.global_step,
)
We can't do anything better, can we? Check in __init__
is about cuda and logging(once). Check in train probably should not be about "cuda"(there is no use case) and not about logging. I'm not sure if this should be in _metric_logger
either
Fixed some nits. Probably should be fine |
|
||
# Training env | ||
device: cuda | ||
|
||
# Memory management | ||
enable_activation_checkpointing: True | ||
custom_sharded_layers: ['tok_embeddings', 'output'] | ||
compile: False # set it to True for better memory and performance | ||
compile=False # pytorch compile, set to true for perf/memory improvement# set it to True for better memory and performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops
@@ -61,7 +62,7 @@ loss: | |||
max_steps_per_epoch: null | |||
gradient_accumulation_steps: 1 | |||
optimizer_in_bwd: True | |||
compile: False # set it to True for better memory and performance | |||
compile=False # pytorch compile, set to true for perf/memory improvement# set it to True for better memory and performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mind quickly double checking? Also, if you are using a script, maybe for a sanity check make just that compile/packed dont appear twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a colon? compile: False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, obviously. Fixed
recipes/full_finetune_distributed.py
Outdated
@@ -121,6 +121,12 @@ def __init__(self, cfg: DictConfig) -> None: | |||
self._log_every_n_steps = cfg.get("log_every_n_steps", 1) | |||
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False) | |||
|
|||
if self._log_peak_memory_stats and self._device.type == "cuda": | |||
log.info( | |||
"log_peak_memory_stats was se to True, however, training does not use cuda. Setting log_peak_memory_stats=False." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that there are typos. Thats my fault, i guess you just copied/pasted what i wrote earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes))) I didn't double check because it pretty minor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed typos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, it should be self._device.type != "cuda" not "=="
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krammnic sorry, just realized that the condition is wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Yeah need to be fixed
@@ -71,7 +72,7 @@ fsdp: | |||
epochs: 1 | |||
max_steps_per_epoch: null | |||
gradient_accumulation_steps: 16 | |||
compile: False # set it to True for better memory and performance | |||
compile=False # pytorch compile, set to true for perf/memory improvement# set it to True for better memory and performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other configs still need fixing :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Fixed all typos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! thanks for the pr!
recipes/configs/llama3/70B_full.yaml
Outdated
@@ -99,7 +100,7 @@ device: cuda | |||
enable_activation_checkpointing: True | |||
custom_sharded_layers: ['tok_embeddings', 'output'] | |||
fsdp_cpu_offload: True | |||
compile: False # set it to True for better memory and performance | |||
compile=False # pytorch compile, set to true for perf/memory improvement# set it to True for better memory and performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs fix
recipes/configs/llama3/70B_lora.yaml
Outdated
@@ -89,15 +90,15 @@ loss: | |||
epochs: 1 | |||
max_steps_per_epoch: null | |||
gradient_accumulation_steps: 1 | |||
compile: False # set it to True for better memory and performance | |||
compile=False # pytorch compile, set to true for perf/memory improvement# set it to True for better memory and performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs fix
@@ -88,15 +89,15 @@ loss: | |||
epochs: 1 | |||
max_steps_per_epoch: null | |||
gradient_accumulation_steps: 1 | |||
compile: False # set it to True for better memory and performance | |||
compile=False # pytorch compile, set to true for perf/memory improvement# set it to True for better memory and performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that the easiest way is to ctrol+f and search for "compile=". There are >15 of such cases.
@@ -119,6 +119,12 @@ def __init__(self, cfg: DictConfig) -> None: | |||
self._log_every_n_steps = cfg.get("log_every_n_steps", 1) | |||
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False) | |||
|
|||
if self._log_peak_memory_stats and self._device.type == "cuda": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self._log_peak_memory_stats and self._device.type == "cuda": | |
if self._log_peak_memory_stats and self._device.type != "cuda": |
recipes/qat_distributed.py
Outdated
@@ -127,6 +127,12 @@ def __init__(self, cfg: DictConfig) -> None: | |||
self._log_every_n_steps = cfg.get("log_every_n_steps", 1) | |||
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False) | |||
|
|||
if self._log_peak_memory_stats and self._device.type == "cuda": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self._log_peak_memory_stats and self._device.type == "cuda": | |
if self._log_peak_memory_stats and self._device.type != "cuda": |
recipes/lora_dpo_distributed.py
Outdated
@@ -130,6 +130,12 @@ def __init__(self, cfg: DictConfig) -> None: | |||
self._log_every_n_steps = cfg.get("log_every_n_steps", 1) | |||
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False) | |||
|
|||
if self._log_peak_memory_stats and self._device.type == "cuda": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self._log_peak_memory_stats and self._device.type == "cuda": | |
if self._log_peak_memory_stats and self._device.type != "cuda": |
@@ -120,6 +120,12 @@ def __init__(self, cfg: DictConfig) -> None: | |||
self._log_every_n_steps = cfg.get("log_every_n_steps", 1) | |||
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False) | |||
|
|||
if self._log_peak_memory_stats and self._device.type == "cuda": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self._log_peak_memory_stats and self._device.type == "cuda": | |
if self._log_peak_memory_stats and self._device.type != "cuda": |
@@ -116,6 +116,12 @@ def __init__(self, cfg: DictConfig) -> None: | |||
self._log_every_n_steps = cfg.get("log_every_n_steps", 1) | |||
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False) | |||
|
|||
if self._log_peak_memory_stats and self._device.type == "cuda": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self._log_peak_memory_stats and self._device.type == "cuda": | |
if self._log_peak_memory_stats and self._device.type != "cuda": |
recipes/full_finetune_distributed.py
Outdated
@@ -121,6 +121,12 @@ def __init__(self, cfg: DictConfig) -> None: | |||
self._log_every_n_steps = cfg.get("log_every_n_steps", 1) | |||
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False) | |||
|
|||
if self._log_peak_memory_stats and self._device.type == "cuda": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self._log_peak_memory_stats and self._device.type == "cuda": | |
if self._log_peak_memory_stats and self._device.type != "cuda": |
@@ -72,14 +73,15 @@ loss: | |||
epochs: 1 | |||
max_steps_per_epoch: null | |||
gradient_accumulation_steps: 32 | |||
compile: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess we didnt add the comment in every config?
Context
What is the purpose of this PR? Is it to
Please link to any issues this PR addresses.
Changelog
What are the changes made in this PR?
Test plan
Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.
pre-commit install
)pytest tests
pytest tests -m integration_test
UX
If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example