Best practice with distributed training? #895

songtianhui · 2024-06-12T09:36:36Z

songtianhui
Jun 12, 2024

Hello.
In the case of distributed training of CLIP, I think the key engineering details are shard and gather in ClipLoss. I can see several arguments to control the running mode, specifically, local_loss and gather_with_grad.
Following the nice discussion in #616, I got that local_loss must be with gather_with_grad=True. So I want to check two points:

If local_loss=False, namely global_loss, should gather_with_grad be True or False?
local_loss or global_loss, which is better in practice?

Thanks!

Answered by rwightman

Jun 12, 2024

@songtianhui pretty much all models featured here that were trained with OpenCLIP are using --local-loss --gather-with-grad .. it's the only option that scales. Back when we first implemented it, we verified that w/ the gradient through gather, the local loss results were equivalence to doing the global loss.

View full answer

rwightman · 2024-06-12T16:40:59Z

rwightman
Jun 12, 2024
Maintainer

@songtianhui pretty much all models featured here that were trained with OpenCLIP are using --local-loss --gather-with-grad .. it's the only option that scales. Back when we first implemented it, we verified that w/ the gradient through gather, the local loss results were equivalence to doing the global loss.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice with distributed training? #895

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Best practice with distributed training? #895

songtianhui Jun 12, 2024

Replies: 1 comment

rwightman Jun 12, 2024 Maintainer

songtianhui
Jun 12, 2024

rwightman
Jun 12, 2024
Maintainer