-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong implementation about EMA ? #1
Comments
Hi @YilinLiu97, why do you think it's wrong ? Can you provide a concrete case where it would be wrong ? For me our implementation of the updating rule is definitely correct. |
Sorry, No offense. I could be wrong, but these are the parameters printed out during training, and it does not seem to follow the rule: alpha*teacher_param + (1-alpha)*student_param. I think that line should be ('t_p: ', Parameter containing: |
t_p - teacher's weights before being updated |
I used alpha=0.5 for this test. |
Hi @YilinLiu97, no problem at all. Let me do an example that might make it clear:
If you execute that, you'll see that they will give exact the same results and the result of |
Thanks very much for the detailed explanation! Yeah you are right. I forgot that I 'deep copied' from the model's weights to the teacher's weights at the very begining so with an alpha of 0.5 the parameters indeed looked super similar. After some extensive experiments, I found that the key to domain adaptation of this experiment seems to be the use of group normalization(?) because of the discrepancy of running_mean/std in source and target domain. Without it the ema model doesn't seem to work at all. I also found that with GN it's hard to tune the hyperparameters. The results were worse than using BN even in a supervised setting. Theoretically, if alpha = 0 and the batches of source data were forwarded to the model only, batches of target data were forwarded to the teacher only, this is equivalent to AdaBN, right? I've also tried this but still the ema model doesn't give reasonable results. Do you possibly know why? Thanks! |
Hi @YilinLiu97, sorry for the late reply. This depends a lot on the task and requirements for the task. You can actually use BatchNorm but you need to be careful with the internal estimated parameters because you're dealing with two different domains, AdaBN is one way to help with that. However, in many cases when you have large inputs, you need to use small batches because of memory and for small batches, BatchNorm underperforms Group Norm, so it depends a lot on the task. |
Mean-Teacher only uses model.parameters() to update EMA model but model.parameters() do not contain the stats of BatchNorm. You can check it by printing the parameters of BatchNorm in model and EMA model. |
Hi @perone, I found that the teacher model's weights seem to be not updated as it performed as bad as it was first initialized.
def update_ema_variables(model, ema_model, alpha, global_step):
About this
shouldn't this be
ema_param.data.mul_(alpha).add_((1 - alpha)*param.data)
??Thanks!
The text was updated successfully, but these errors were encountered: