Releases · ggerganov/llama.cpp

28 Feb 18:44

08c5ee8

llama : remove deprecated API (#5770)

ggml-ci

Assets 14

28 Feb 12:20

github-actions

b2291

8c0e8f4

b2291

sync : ggml

Assets 14

28 Feb 12:21

github-actions

b2288

a693bea

b2288

server : hit Ctrl+C twice to exit (#5734)

* server: twice ctrl+C to exit

* std::atomic_flag

* sigint: message

* sigint: stderr

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <[email protected]>

---------

Co-authored-by: Jared Van Bortel <[email protected]>

Assets 14

28 Feb 12:20

github-actions

b2287

adcb12a

b2287

llama : fix non-quantization of expert gating tensors (#5754)

This reverts a single line from #5475

Assets 14

28 Feb 12:20

github-actions

b2286

177628b

b2286

llama : improve BERT tokenization (#5740)

* implement nfd for stripping accents in wpm tokenizer

* sort nfd map; reuse iterator

* use builtin tolower

* add locale include

* Simplify to_lower cases

Co-authored-by: Jared Van Bortel <[email protected]>

---------

Co-authored-by: Jared Van Bortel <[email protected]>

Assets 14

28 Feb 12:19

github-actions

b2284

efc7225

b2284

server : add "/chat/completions" alias for "/v1/...` (#5722)

* Add "/chat/completions" as alias for "/v1/chat/completions"

* merge to upstream master

* minor : fix trailing whitespace

---------

Co-authored-by: Georgi Gerganov <[email protected]>

Assets 14

28 Feb 11:45

github-actions

b2283

7c4263d

b2283

ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (#5760)

* WIP: make i-quants work for QK_K = 64

* iq2_xs: attempt to fix AVX dot product for QK_K = 64

Tests pass, but I get gibberish.

* QK_K = 64 tests pass on ARM_NEON and Metal

Sadly, that does not mean it actually works.

* Make CUDA compile with QK_K = 64

Tests don't pass, plus we get misaligned access

* Q2_K: fixed bug in imatrix quantization for QK_K = 64

* iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work)

---------

Co-authored-by: Iwan Kawrakow <[email protected]>

Assets 14

27 Feb 18:01

github-actions

b2282

cb49e0f

b2282

Attempt to fix android build (#5752)

Co-authored-by: Iwan Kawrakow <[email protected]>

Assets 14

27 Feb 15:26

github-actions

b2281

0becb22

b2281

IQ4_XS: a 4.25 bpw quantization (#5747)

* Try IQ4_NL with blocks of 64 - does not look good

* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32

* iq4_xs: CUDA works - 133.2 t/s

* iq4_xs: AVX2 dot product

* iq4_xs: ARM_NEON dot product

* iq4_nl: Metal implementation

As usual, Metal / Apple Silicon don't like my quants.

* iq3_xs: minor fix

* iq4_xs: shrink by using IQ3_S for attn_k and attn_q

* iq4_xs: revert using IQ3_S for attn_k and attn_v

PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.

* Fix CI

* iq4_xs: Added forgotten check for 256 divisibility

---------

Co-authored-by: Iwan Kawrakow <[email protected]>

Assets 14

27 Feb 15:15

github-actions

b2280

c24a2a6

b2280

cuda : replace remaining shfl_xor with calls to warp_reduce functions…

Assets 14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: ggerganov/llama.cpp

b2293

b2291

b2288

b2287

b2286

b2284

b2283

b2282

b2281

b2280