Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to replicate results on CIFAR10 #1

Open
kevinkasa opened this issue Feb 23, 2023 · 2 comments
Open

Unable to replicate results on CIFAR10 #1

kevinkasa opened this issue Feb 23, 2023 · 2 comments

Comments

@kevinkasa
Copy link

Hello, thank you for providing this codebase!

Unfortunately, I am unable to to replicate the reported results on the CIFAR10 dataset (MNIST seems to work fine). I am using the conda environment provided in the repo, and running as described (albeit on a slurm cluster).

The evaluation results I get for the backbone CIFAR10 model are:
Accuracy: 0.763000
Coverage: 0.989100
Size: 3.685510

For the baseline fine-tuned model:
Accuracy: 0.762930
Coverage: 0.989310
Size: 3.701350

And conformal training fine-tuned:
Accuracy: 0.760380
Coverage: 0.989150
Size: 3.764270

I can also provide the training logs if needed. It seems that the size loss is not decreasing much (or at all) during fine tuning. I believe I am following the instructions as described, however I could be missing something. Please let me know if you'd like any further information. Thank you!

@davidstutz
Copy link
Collaborator

It is a bit hard to debug without details, but I think the main problem is that the base accuracy is too low. Our model in the paper obtained 82% intead of 76% (see Table F in the appendix). The difference in terms of inefficiency on CIFAR10 (between baseline and conformal training) is usually rather small and depends strongly on accuracy. I don't think your cluster should be the problem. This could also be a wrong configuration in this repository compared to the model we used for the paper. I just checked the original model and it indeed gets ~82% accuracy and it seems to use the same configuration (resnet 34, 4 channels flip+crop data augmentation), but I wasn't able to match all the details - it could be that data augmentation details or something similar changed in the meantime.

If it is critical for you to reproduce the paper results, I'd suggest playing around with data augmentation/number of channels to get to 82% or higher accuracy. Let me know what you think.

@kevinkasa
Copy link
Author

Hi @davidstutz, thanks for your response. I agree that the config parameters do seem to match what is reported in the paper. I believe only the weight decay value isn't reported in the paper, so I'm not sure if there are differences there. However I am still familiarizing myself with the codebase.

I did try increasing the number of channels to 8 & 16, here are some of the results:

8 channel
backbone:
acc: 0.825710
coverage: 0.98940
size: 2.890260

baseline:
acc: 0.825520
coverage: 0.9892200
size: 2.907480

conformal:
acc: 0.82500
coverage: 0.9895100
size: 2.913100

16 channel:
backbone:
acc: 0.859080
coverage: 0.988810
size: 2.530480

baseline:
acc: 0.859750
coverage: 0.988630
size: 2.616380

conformal:
acc: 0.858120
coverage: 0.989130
size: 2.530860

With 8 channels, the the model achieves ~82% accuracy however the conformal training results still seem to be higher than expected. 16 channels achieves higher accuracy and smaller set sizes, but here the conformal sets are similar to the regular backbone ones.

I was able to replicate the reported results on Fashion MNIST (in addition to MNIST), so the discrepancy probably does lies in the CIFAR config parameters.

It does seem odd that your local codebase performs differently, with the same number of channels / augmentations. I wonder if there is still some differences between the configurations; however I have not been able to find them. I would like to integrate your codebase into my own work, and so it would be great to be able to confirm that I have it working properly first. Are there other things you might suggest to look into? If you have the time, would you be able to run this version of the code and confirm the results I am seeing, just to rule out any issues with packages/installation I might have on my end?

I can also provide training logs or additional information if it would be helpful. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants