Releases: ggerganov/llama.cpp
Releases · ggerganov/llama.cpp
b2293
llama : remove deprecated API (#5770) ggml-ci
b2291
sync : ggml
b2288
server : hit Ctrl+C twice to exit (#5734) * server: twice ctrl+C to exit * std::atomic_flag * sigint: message * sigint: stderr * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <[email protected]> --------- Co-authored-by: Jared Van Bortel <[email protected]>
b2287
llama : fix non-quantization of expert gating tensors (#5754) This reverts a single line from #5475
b2286
llama : improve BERT tokenization (#5740) * implement nfd for stripping accents in wpm tokenizer * sort nfd map; reuse iterator * use builtin tolower * add locale include * Simplify to_lower cases Co-authored-by: Jared Van Bortel <[email protected]> --------- Co-authored-by: Jared Van Bortel <[email protected]>
b2284
server : add "/chat/completions" alias for "/v1/...` (#5722) * Add "/chat/completions" as alias for "/v1/chat/completions" * merge to upstream master * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <[email protected]>
b2283
ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (#5760) * WIP: make i-quants work for QK_K = 64 * iq2_xs: attempt to fix AVX dot product for QK_K = 64 Tests pass, but I get gibberish. * QK_K = 64 tests pass on ARM_NEON and Metal Sadly, that does not mean it actually works. * Make CUDA compile with QK_K = 64 Tests don't pass, plus we get misaligned access * Q2_K: fixed bug in imatrix quantization for QK_K = 64 * iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work) --------- Co-authored-by: Iwan Kawrakow <[email protected]>
b2282
Attempt to fix android build (#5752) Co-authored-by: Iwan Kawrakow <[email protected]>
b2281
IQ4_XS: a 4.25 bpw quantization (#5747) * Try IQ4_NL with blocks of 64 - does not look good * iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32 * iq4_xs: CUDA works - 133.2 t/s * iq4_xs: AVX2 dot product * iq4_xs: ARM_NEON dot product * iq4_nl: Metal implementation As usual, Metal / Apple Silicon don't like my quants. * iq3_xs: minor fix * iq4_xs: shrink by using IQ3_S for attn_k and attn_q * iq4_xs: revert using IQ3_S for attn_k and attn_v PPL vs size is good, but CPU performance suffers: on M2 Max TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when using IQ3_S vs 133 t/s with pure IQ4_XS. * Fix CI * iq4_xs: Added forgotten check for 256 divisibility --------- Co-authored-by: Iwan Kawrakow <[email protected]>
b2280
cuda : replace remaining shfl_xor with calls to warp_reduce functions…