BN Fixes #783

adefazio · 2024-09-05T15:39:36Z

There are some subtle issues with how BatchNorm is handled in the PyTorch version of the code. Currently, workload.model_fn has an update_batch_norm parameter, which in theory should allow the submission to control whether the batch-norm statistics are updated during a forward pass. The issues are the following:

The update_batch_norm_fn function stores the old momentum parameter for each batchnorm layer in a momentum_backup variable, so it can be restored later, before zeroing the parameter. However, if it is called with update_batch_norm=False twice in a row, it overwrites the momentum_backup with 0 on the second call, so momentum then remains zero for the remainder of training.
In PyTorch's bultin BatchNorm, 0 indicates that the momentum buffer shouldn't be updated. This is the opposite of how EMA momentum is usually done (i.e. in Adam), where 1 would indicate that it shouldn't be updated, and 0 means it's set to the latest value at every step. The custom BatchNorm modules used in the two librispeech workloads follows this second, more standard convention instead. However, the update_batch_norm_fn sets the momentum to zero for all three layer types, resulting in incorrect behavior for the librispeech workloads.
The update_batch_norm_fn sets the BN layers to eval mode. This doesn't make sense as it prevents the use-case where you use batch-computed statistics (train mode) without also updating the running statistics. The BN layers can bet set to eval mode separately by passing in ForwardPassMode.EVAL to the forward pass, so removing this .eval() call doesn't prevent the submission from using eval mode during a forward pass.

This PR changes switch the custom BN code to follow the BN convention so that momentum=0 doesn't update the running buffers. It also fixes the issues in the update_batch_norm_fn function mentioned above.

Dev -> Main

Dev -> main

github-actions · 2024-09-05T15:39:49Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

priyakasimbeg · 2024-09-05T17:04:40Z

Thanks for spotting all these issues. I agree, we should incorporate these fixes.
I spotted one more subtle issue in our JAX code similar to 3 here.

  def __call__(self,
               x: spec.Tensor,
               update_batch_norm: bool = True) -> spec.Tensor:
    conv = functools.partial(nn.Conv, use_bias=False, dtype=self.dtype)
    norm = functools.partial(
        nn.BatchNorm,
        use_running_average=not update_batch_norm,
        momentum=0.9,
        epsilon=1e-5,
        dtype=self.dtype)

This prevents the use case where you (don't) want to use the running average in train mode and (don't) want update the batch_norm statistics. Maybe we need an extra arg in the call functions to distinguish between train and eval mode (or just whether or not to use_running_average) instead of inferring from the update_batch_norm arg?

adefazio · 2024-09-06T19:21:25Z

recheck

priyakasimbeg

LGTM. JAX changes are in follow up PR.
Nit: Can you run `yapf -i -r -vv -p on the files that are failing the Linting and yapf tests'?

priyakasimbeg and others added 3 commits July 30, 2024 07:29

Merge pull request mlcommons#777 from mlcommons/dev

2d1ac6f

Dev -> Main

Merge pull request mlcommons#781 from mlcommons/dev

bdece3b

Dev -> main

BN Fixes

b24812f

adefazio requested a review from a team as a code owner September 5, 2024 15:39

adefazio mentioned this pull request Sep 5, 2024

Control over batch-norm running_mean/var buffers #767

Open

priyakasimbeg changed the base branch from main to dev September 12, 2024 19:20

priyakasimbeg approved these changes Oct 17, 2024

View reviewed changes

priyakasimbeg mentioned this pull request Oct 17, 2024

BN fixes in JAX #795

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BN Fixes #783

BN Fixes #783

adefazio commented Sep 5, 2024

github-actions bot commented Sep 5, 2024 •

edited

Loading

priyakasimbeg commented Sep 5, 2024 •

edited

Loading

adefazio commented Sep 6, 2024

priyakasimbeg left a comment •

edited

Loading

BN Fixes #783

Are you sure you want to change the base?

BN Fixes #783

Conversation

adefazio commented Sep 5, 2024

github-actions bot commented Sep 5, 2024 • edited Loading

priyakasimbeg commented Sep 5, 2024 • edited Loading

adefazio commented Sep 6, 2024

priyakasimbeg left a comment • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Sep 5, 2024 •

edited

Loading

priyakasimbeg commented Sep 5, 2024 •

edited

Loading

priyakasimbeg left a comment •

edited

Loading