Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong implementation about EMA ? #1

Open
YilinLiu97 opened this issue Feb 13, 2019 · 8 comments
Open

Wrong implementation about EMA ? #1

YilinLiu97 opened this issue Feb 13, 2019 · 8 comments

Comments

@YilinLiu97
Copy link

YilinLiu97 commented Feb 13, 2019

Hi @perone, I found that the teacher model's weights seem to be not updated as it performed as bad as it was first initialized.

def update_ema_variables(model, ema_model, alpha, global_step):

  alpha = min(1 - 1 / (global_step + 1), alpha)
  for ema_param, param in zip(ema_model.parameters(), model.parameters()):
  ema_param.data.mul_(alpha).add_(1 - alpha, param.data)

About this

line:ema_param.data.mul_(alpha).add_(1 - alpha, param.data),

shouldn't this be ema_param.data.mul_(alpha).add_((1 - alpha)*param.data) ??
Thanks!

@YilinLiu97 YilinLiu97 changed the title About EMA Wrong implementation about EMA ? Feb 13, 2019
@perone
Copy link
Contributor

perone commented Feb 14, 2019

Hi @YilinLiu97, why do you think it's wrong ? Can you provide a concrete case where it would be wrong ? For me our implementation of the updating rule is definitely correct.

@YilinLiu97
Copy link
Author

YilinLiu97 commented Feb 14, 2019

Sorry, No offense. I could be wrong, but these are the parameters printed out during training, and it does not seem to follow the rule: alpha*teacher_param + (1-alpha)*student_param.

I think that line should be ema_param.data.mul_(alpha).add_((1 - alpha)*param.data). instead of ema_param.data.mul_(alpha).add_(1 - alpha, param.data). Maybe there was a typo.

('t_p: ', Parameter containing:
tensor([ 0.0007, -0.0006, 0.0046, -0.0033, 0.0004, 0.0262, 0.0153, -0.0259,
-0.0115, -0.0015, -0.0117, -0.0060, 0.0161, 0.0104, 0.0080, -0.0015,
-0.0116, -0.0160, 0.0247, -0.0227, 0.0077, 0.0052, 0.0217, 0.0111,
-0.0036, -0.0176, -0.0188, 0.0026, -0.0163, 0.0155],
device='cuda:0'))
('p: ', Parameter containing:
tensor([-0.0322, -0.0153, 0.0206, -0.0212, -0.0274, 0.0293, 0.0225, -0.0279,
-0.0272, -0.0282, -0.0272, -0.0261, 0.0275, 0.0261, 0.0274, -0.0251,
0.0014, -0.0285, 0.0296, -0.0296, 0.0105, -0.0209, 0.0123, 0.0227,
-0.0162, -0.0081, -0.0079, -0.0233, -0.0145, 0.0030],
device='cuda:0', requires_grad=True))
('(after) t_p: ', Parameter containing:
tensor([ 0.0007, -0.0006, 0.0046, -0.0033, 0.0004, 0.0262, 0.0153, -0.0259,
-0.0115, -0.0016, -0.0117, -0.0060, 0.0161, 0.0104, 0.0080, -0.0015,
-0.0116, -0.0160, 0.0247, -0.0227, 0.0077, 0.0052, 0.0217, 0.0111,
-0.0036, -0.0176, -0.0187, 0.0026, -0.0163, 0.0155],
device='cuda:0'))

@YilinLiu97
Copy link
Author

t_p - teacher's weights before being updated
p: student's weights
(after) t_p - teacher's weights after being updated

@YilinLiu97
Copy link
Author

I used alpha=0.5 for this test.

@perone
Copy link
Contributor

perone commented Feb 15, 2019

Hi @YilinLiu97, no problem at all. Let me do an example that might make it clear:

import torch
alpha = 0.99
params = torch.randn(100, 100)
ema_params = torch.randn(100, 100)
ema_params_2 = ema_params.clone()

# Case 1
tensor_a = ema_params.data.mul_(alpha).add_(1-alpha, params.data)

# Case 2
tensor_b = ema_params_2.data.mul_(alpha).add((1-alpha)*params.data)

torch.allclose(tensor_a, tensor_b)

If you execute that, you'll see that they will give exact the same results and the result of allclose will be True.

@YilinLiu97
Copy link
Author

YilinLiu97 commented Feb 18, 2019

Thanks very much for the detailed explanation! Yeah you are right. I forgot that I 'deep copied' from the model's weights to the teacher's weights at the very begining so with an alpha of 0.5 the parameters indeed looked super similar.

After some extensive experiments, I found that the key to domain adaptation of this experiment seems to be the use of group normalization(?) because of the discrepancy of running_mean/std in source and target domain. Without it the ema model doesn't seem to work at all. I also found that with GN it's hard to tune the hyperparameters. The results were worse than using BN even in a supervised setting.

Theoretically, if alpha = 0 and the batches of source data were forwarded to the model only, batches of target data were forwarded to the teacher only, this is equivalent to AdaBN, right? I've also tried this but still the ema model doesn't give reasonable results. Do you possibly know why? Thanks!

@perone
Copy link
Contributor

perone commented Mar 1, 2019

Hi @YilinLiu97, sorry for the late reply. This depends a lot on the task and requirements for the task. You can actually use BatchNorm but you need to be careful with the internal estimated parameters because you're dealing with two different domains, AdaBN is one way to help with that. However, in many cases when you have large inputs, you need to use small batches because of memory and for small batches, BatchNorm underperforms Group Norm, so it depends a lot on the task.
I would recommend you to go through the recommendations in the bottom of this page where they give some very good advice that helped us to get good results: https://github.com/CuriousAI/mean-teacher.

@KingMV
Copy link

KingMV commented Jun 12, 2021

Mean-Teacher only uses model.parameters() to update EMA model but model.parameters() do not contain the stats of BatchNorm. You can check it by printing the parameters of BatchNorm in model and EMA model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants