Skip to content

Releases: ggerganov/llama.cpp

b2293

28 Feb 18:44
08c5ee8
Compare
Choose a tag to compare
llama : remove deprecated API (#5770)

ggml-ci

b2291

28 Feb 12:20
8c0e8f4
Compare
Choose a tag to compare
sync : ggml

b2288

28 Feb 12:21
a693bea
Compare
Choose a tag to compare
server : hit Ctrl+C twice to exit (#5734)

* server: twice ctrl+C to exit

* std::atomic_flag

* sigint: message

* sigint: stderr

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <[email protected]>

---------

Co-authored-by: Jared Van Bortel <[email protected]>

b2287

28 Feb 12:20
adcb12a
Compare
Choose a tag to compare
llama : fix non-quantization of expert gating tensors (#5754)

This reverts a single line from #5475

b2286

28 Feb 12:20
177628b
Compare
Choose a tag to compare
llama : improve BERT tokenization (#5740)

* implement nfd for stripping accents in wpm tokenizer

* sort nfd map; reuse iterator

* use builtin tolower

* add locale include

* Simplify to_lower cases

Co-authored-by: Jared Van Bortel <[email protected]>

---------

Co-authored-by: Jared Van Bortel <[email protected]>

b2284

28 Feb 12:19
efc7225
Compare
Choose a tag to compare
server : add "/chat/completions" alias for "/v1/...` (#5722)

* Add "/chat/completions" as alias for "/v1/chat/completions"

* merge to upstream master

* minor : fix trailing whitespace

---------

Co-authored-by: Georgi Gerganov <[email protected]>

b2283

28 Feb 11:45
7c4263d
Compare
Choose a tag to compare
ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (#5760)

* WIP: make i-quants work for QK_K = 64

* iq2_xs: attempt to fix AVX dot product for QK_K = 64

Tests pass, but I get gibberish.

* QK_K = 64 tests pass on ARM_NEON and Metal

Sadly, that does not mean it actually works.

* Make CUDA compile with QK_K = 64

Tests don't pass, plus we get misaligned access

* Q2_K: fixed bug in imatrix quantization for QK_K = 64

* iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work)

---------

Co-authored-by: Iwan Kawrakow <[email protected]>

b2282

27 Feb 18:01
cb49e0f
Compare
Choose a tag to compare
Attempt to fix android build (#5752)

Co-authored-by: Iwan Kawrakow <[email protected]>

b2281

27 Feb 15:26
0becb22
Compare
Choose a tag to compare
IQ4_XS: a 4.25 bpw quantization (#5747)

* Try IQ4_NL with blocks of 64 - does not look good

* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32

* iq4_xs: CUDA works - 133.2 t/s

* iq4_xs: AVX2 dot product

* iq4_xs: ARM_NEON dot product

* iq4_nl: Metal implementation

As usual, Metal / Apple Silicon don't like my quants.

* iq3_xs: minor fix

* iq4_xs: shrink by using IQ3_S for attn_k and attn_q

* iq4_xs: revert using IQ3_S for attn_k and attn_v

PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.

* Fix CI

* iq4_xs: Added forgotten check for 256 divisibility

---------

Co-authored-by: Iwan Kawrakow <[email protected]>

b2280

27 Feb 15:15
c24a2a6
Compare
Choose a tag to compare
cuda : replace remaining shfl_xor with calls to warp_reduce functions…