use flash_attn_with_kvcache for faster inference #2539

vince62s · 2023-12-19T15:09:08Z

This PR does:

switch apex RMSNorm to awq_inference_engine RMSNorm which is much faster
add rotary_theta as an option (llama/mistral used to use 1e4 while Mixtral and mistralv0.2 use 1e6)
use flash2 flash_attn_with_kvcache instead of regular flash_attn_func for step > 0 (much faster to increment cache in place in certain cases)

vince62s added 11 commits December 14, 2023 18:01

first try new cache

10a5916

testMerge branch 'master' into newcache

45c6884

Merge branch 'master' into newcache

d3f7199

use flash_attn_with_kvcache

aa2ae4e

Merge branch 'master' into newcache

43f9883

fix flake

32af499

fix

c920d65

patch rmsnorm for multiexperts

3f81b8f

black is black

ea900d8

rope theta as an option

d0ec7a8

black

17373ca

vince62s changed the title ~~[WIP] use flash_attn_with_kvcache for faster inference~~ use flash_attn_with_kvcache for faster inference Dec 26, 2023

vince62s merged commit 0436cdd into OpenNMT:master Dec 26, 2023
2 checks passed

l-k-11235 mentioned this pull request Dec 29, 2023

Restored masked scaled dot attention #2542

Merged

Provide feedback