You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training a model across multiple GPUs using parallel_mode="data_parallel", if a "CUDA out of memory" error occurs, an exception is triggered in the onmt.utils.distributed.all_reduce_and_rescale_tensors function:
The issue arises because, in the event of an "OOM" error, gradients are not computed, leading to an empty list for the "tensors" parameter.
The solution is to provide a list with tensors filled with zeros. This ensures that the torch.distributed.all_reduce function can continue functioning without getting stuck indefinitely. However, the drawback is that accumulated gradients are still divided by the total number of GPUs.
The text was updated successfully, but these errors were encountered:
When training a model across multiple GPUs using
parallel_mode="data_parallel"
, if a "CUDA out of memory" error occurs, an exception is triggered in theonmt.utils.distributed.all_reduce_and_rescale_tensors
function:The issue arises because, in the event of an "OOM" error, gradients are not computed, leading to an empty list for the "tensors" parameter.
The solution is to provide a list with tensors filled with zeros. This ensures that the
torch.distributed.all_reduce
function can continue functioning without getting stuck indefinitely. However, the drawback is that accumulated gradients are still divided by the total number of GPUs.The text was updated successfully, but these errors were encountered: