Can Paged Attention be used in training to reduce memory usage? #8277

CACACAZ · 2024-09-09T00:34:51Z

CACACAZ
Sep 9, 2024

During the training of LLM, sequences in a batch are typically padded to the same length. Additional memory and computational waste are introduced by these paddings. If used in training, could paged attention reduce memory usage and thus allow for a bigger batch size?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can Paged Attention be used in training to reduce memory usage? #8277

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Can Paged Attention be used in training to reduce memory usage? #8277

CACACAZ Sep 9, 2024

Replies: 0 comments

CACACAZ
Sep 9, 2024