-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda device mismatch in DataParallel
when not using cuda:0
#60
Comments
janfb
changed the title
cuda device mismatch when not using
cuda device mismatch in May 15, 2024
cuda:0
DataParallel
when not using cuda:0
Hi Jan, Thank you for pointing this out! -Gaurav |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi there, thanks for this package, it's really helpful!
On a cluster with multiple GPUs, I have my model on device
cuda:1
.When calculating FID with a passed
gen
function, new samples are generated during FID calculation. To that end, amodel_fn(x)
function is defined here:clean-fid/cleanfid/features.py
Lines 23 to 25 in bd44693
and if
use_dataparallel=True
, the model will be wrapped withmodel = torch.nn.DataParallel(model)
.Problem:
DataParallel
has a kwargdevice_ids=None
which defaults to all the available devices and then selects the first device as the "source" device, i.e.,cuda:0
. Later it asserts that all parameters and buffers of the model are on that device.Now, if device_ids is not passed, this will result in an error because my model device is different from
cuda:0
.I am wondering why
DataParallel
just hard codes everything to the first of all available devices, but there is a solution on thecleanfid
side for this problem.Solution: pass device_ids with the device of the model:
I would be happy to make a PR fixing this. Unless I am missing something?
Cheers,
Jan
The text was updated successfully, but these errors were encountered: