Releases · flashinfer-ai/flashinfer

27 Aug 01:18

github-actions

v0.1.6

9ee26e7

v0.1.6 Latest

Latest

0.1.6 (2024-08-27)

SM75 Support

Starting from 0.1.6, our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080).

API Changes

`plan`/`run`

Since 0.1.6 on, begin_forward/forward/end_forward APIs are replaced with the new plan/run API.

forward is renamed to run, which is more precise and consistent with the naming convention of cutlass's python API.
begin_forward is renamed to plan, which is consistent with the naming convention of nvmath API.
end_forward is deprecated and has no effect after this PR.

There is some slight difference between the old forward and the new run API:

All extra arguments such as causal and logits_soft_cap will be provided in plan (previously begin_forward) API, and cached until next plan call, and we only need to provide query and KV-Cache tensors in run API.

The old begin_forward/forward/end_forward APIs are still functional, but we will gradually deprecate them in future releases.

Check #466 for more details.

`MultiLevelCascadeAttentionWrapper`

Since 0.1.6 on, we introduce a new MultiLevelCascadeAttentionWrapper API for cascade inference,
which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache.

See documentation and tutorial on API usage and layout explaination.

The old BatchDecodeWithSharedPrefixPagedKVCacheWrapper and BatchPrefillWithSharedPrefixPagedKVCacheWrapper will be deprecated in future releases.

Features

sm75 support (#448, #449)
add MultiLevelCascadeAttentionWrapper API (#462) (1e37989)
add accept num, emit num metric for ChainSpeculativeSampling (#450) (fa38b5e)
support bmm fp8 (#469) (f1c0b68)

Refactor

refactor: replace begin_forward/forward/end_forward with plan/run #466

Misc

misc: improve error handling of sampling kernels (#456) (0dce178)

Performance Improvements

slight optimization on f16->f8 fragment layout swizzling (#453) (0d61871)
slight optimization on fragment layout swizzle (#458) (7c397cb)
use persistent kernel for merging attention states (#459) (be6bf5b)

Acknowledgement

We thank @LiuXiaoxuanPKU on enhance of speculative sampling operator, @merrymercy on API change suggestion and @zhyncs on integrating fp8 BMM cublas implementation.

Assets 38

flashinfer-0.1.6+cu118torch2.2-cp310-cp310-linux_x86_64.whl

1.26 GB 2024-08-28T06:27:01Z
flashinfer-0.1.6+cu118torch2.2-cp311-cp311-linux_x86_64.whl

1.26 GB 2024-08-28T06:29:14Z
flashinfer-0.1.6+cu118torch2.2-cp312-cp312-linux_x86_64.whl

1.26 GB 2024-08-28T06:31:02Z
flashinfer-0.1.6+cu118torch2.2-cp38-cp38-linux_x86_64.whl

1.26 GB 2024-08-28T06:23:08Z
flashinfer-0.1.6+cu118torch2.2-cp39-cp39-linux_x86_64.whl

1.26 GB 2024-08-28T06:25:04Z
flashinfer-0.1.6+cu118torch2.3-cp310-cp310-linux_x86_64.whl

1.26 GB 2024-08-28T06:27:12Z
flashinfer-0.1.6+cu118torch2.3-cp311-cp311-linux_x86_64.whl

1.26 GB 2024-08-28T06:29:13Z
flashinfer-0.1.6+cu118torch2.3-cp312-cp312-linux_x86_64.whl

1.26 GB 2024-08-28T06:31:07Z
flashinfer-0.1.6+cu118torch2.3-cp38-cp38-linux_x86_64.whl

1.26 GB 2024-08-28T06:23:13Z
flashinfer-0.1.6+cu118torch2.3-cp39-cp39-linux_x86_64.whl

1.26 GB 2024-08-28T06:25:10Z
Source code (zip)

2024-08-27T10:11:14Z
Source code (tar.gz)

2024-08-27T10:11:14Z

13 Aug 10:19

github-actions

v0.1.5

838d050

v0.1.5

0.1.5 (2024-08-13)

Bugfix

Fix PagedPrefill python api and some typos (#441) (3fff008)
fix prefill kernels' lse result for empty kv-cache (#440) (6ac28f4)

Features

decouple float and int workspace buffer (#442) (a7ee566)

Performance Improvements

faster fp8->fp16 dequantization for pre sm_90 arch (#439) (c93f647)

Acknowledgement

We thank contributions and feedbacks from the community: @comaniac, @hnyls2002, @jianfei-wangg, @Yard1.

Assets 39

09 Aug 09:07

github-actions

v0.1.4

9ca04e4

v0.1.4

0.1.4 (2024-08-09)

Features

append attention kernels for fp8 kv-cache (#420) (906c2f5)
support min_p sampling (#422) (d52f2da)
deterministic sampling (#417) (0dd801d)
more sampling operator options (#431) (68df9c4)
support fused add rmsnorm (#419) (b781513)
support fused silu mul (#427) (ea0ba9a)
feat: support fused gelu tanh mul (#434) (2c9d1c3)

Bug Fixes

fix dispatch fp16 type when enable fp8 (#430) (daa5566)
improve numerical stability of sampling kernels (#429) (898d8ea)

Other improvements

break up _kernels into multiple modules (#428) (8e482d9)

Acknowledgement

We thank contributions and feedbacks from the community: @comaniac, @esmeetu, @LiuXiaoxuanPKU, @peng1999, @xslingcn, @Yard1, @zhyncs.

Assets 38

31 Jul 10:47

yzh119

v0.1.3

d6c8400

v0.1.3

0.1.3 (2024-07-31)

Bugfix

bugfix: Fix cudagraph mode of BatchPrefillWithRaggedKVCacheWrapper (#412) (9907bc)
fix cu118 cub usage for sampling kernels (#410) (58d359)

Misc

enhance allocator error info and add shape check for prefill begin forward functions (#413) (5e36c5)

Assets 31

29 Jul 12:11

github-actions

v0.1.2

d2f6a42

v0.1.2

0.1.2 (2024-07-29)

Bugfix

Fix the sampling kernel bug for cu118 (#386, #387) (0cd499, dc3f18)

Features

add llama 3.1 style rope (#401) (4c89dec)
non-inplace rope operators (#405) (74ffba1)
sliding window attention (#406) (28cffd3)
support non-contiguous (packed) input for prefill kernels (#404) (68c3719)

Performance Improvements

slight optimization on merge states (#313) (701c813)

Assets 31

20 Jul 09:15

github-actions

v0.1.1

b64d5c9

v0.1.1

0.1.1 (2024-07-20)

Bugfix

fix the invalid kernel configuration for architectures with small shared memory size (#385) (cdac57)

Features

expose decoupled kv-cache to pytorch api (#383) (457a0ae)

Performance Improvements

use stmatrix in epilogue for sm90+ (#380) (c6f20d1)

Assets 27

17 Jul 08:29

github-actions

v0.1.0

58b68d0

v0.1.0

0.1.0 (2024-07-17)

Features

Add mask to merge_state_in_place (#372) (e14fa81)
expose pytorch api for block sparse attention (#375) (4bba6fa)
Fused GPU sampling kernel for joint top-k & top-p sampling (#374) (6e028eb)

Assets 27

12 Jul 05:54

github-actions

v0.0.9

17a5f1b

v0.0.9

0.0.9 (2024-07-12)

Bugfix

fix the decode kernel segfault in cudagraph mode (#368)(c69cfa)

fix decode kernels output for empty kv cache (#363)(ac72b1)
check gpu id in PyTorch APIs and use input tensor's gpu default stream (#361)(1b84fa)

Performance Improvements

accelerate alibi (#365) (4f0a9f9)
accelerate gqa performance (#356) (e56ddad)
Optimize tensor conversions in C++ code to avoid unnecessary copies (#366) (1116237)

Acknowledgement

We thank @Yard1, @Ying1123 and @zhyncs for their contributions.

Assets 27

03 Jul 07:58

yzh119

v0.0.8

478447e

v0.0.8

0.0.8 (2024-07-03)

Bugfix

fix prefill/append kernel behavior for empty kv-cache (#353) (7adc8c)
fix decode attention kernel with logits cap (#350) (f5f7a2)

Assets 27

28 Jun 10:15

github-actions

v0.0.7

fec77d0

v0.0.7

0.0.7 (2024-06-28)

Breaking Changes

batch_decode_with_padded_kv_cache was removed, we encourage user to use BatchDecodeWithPagedKVCacheWrapper instead. (#343)

Bugfix

fix the forward_return_lse function in BatchPrefillWithRaggedKVCache class (#337)
fix the scheduler behavior of large page size (#333)

Features

customize logits_soft_cap value (#339) (a2498f5)

Performance Improvements

change minimal kv_chunk_size back to 128 (#329) (f237f5f)
more options for kv tile size (#336) (bf2a6c7)

Assets 27

Releases: flashinfer-ai/flashinfer

v0.1.6

0.1.6 (2024-08-27)

SM75 Support

API Changes

plan/run

MultiLevelCascadeAttentionWrapper

Features

Refactor

Misc

Performance Improvements

Acknowledgement

v0.1.5

0.1.5 (2024-08-13)

Bugfix

Features

Performance Improvements

Acknowledgement

v0.1.4

0.1.4 (2024-08-09)

Features

Bug Fixes

Other improvements

Acknowledgement

v0.1.3

0.1.3 (2024-07-31)

Bugfix

Misc

v0.1.2

0.1.2 (2024-07-29)

Bugfix

Features

Performance Improvements

v0.1.1

0.1.1 (2024-07-20)

Bugfix

Features

Performance Improvements

v0.1.0

0.1.0 (2024-07-17)

Features

v0.0.9

0.0.9 (2024-07-12)

Bugfix

Performance Improvements

Acknowledgement

v0.0.8

0.0.8 (2024-07-03)

Bugfix

v0.0.7

0.0.7 (2024-06-28)

Breaking Changes

Bugfix

Features

Performance Improvements

`plan`/`run`

`MultiLevelCascadeAttentionWrapper`