Releases: vectorch-ai/ScaleLLM
Releases · vectorch-ai/ScaleLLM
v0.2.1
What's Changed
- feat: added awq marlin qlinear by @guocuimi in #315
- build: speed up compilation for marlin kernels by @guocuimi in #316
- test: added unittests for marlin kernels by @guocuimi in #317
- refactor: clean up build warnings and refactor marlin kernels by @guocuimi in #318
- fix: clean up build warnings: "LOG" redefined by @guocuimi in #319
- cmake: make includes private and disable jinja2cpp build by @guocuimi in #320
- ci: allow build without requiring a physical gpu device by @guocuimi in #321
- fix: put item into asyncio.Queue in a thread-safe way by @guocuimi in #324
- refactor: added static switch for marlin kernel dispatch by @guocuimi in #325
- feat: fix and use marlin kernel for awq by default by @guocuimi in #326
Full Changelog: v0.2.0...v0.2.1
v0.2.0
What's Changed
- kernel: port softcap support for flash attention by @guocuimi in #298
- test: added unittests for attention sliding window by @guocuimi in #299
- model: added gemma2 with softcap and sliding window support by @guocuimi in #300
- kernel: support kernel test in python via pybind by @guocuimi in #301
- test: added unittests for marlin fp16xint4 gemm by @guocuimi in #302
- fix: move eos out of stop token list to honor ignore_eos option by @guocuimi in #305
- refactor: move models to upper folder by @guocuimi in #306
- kernel: port gptq marlin kernel and fp8 marlin kernel by @guocuimi in #307
- rust: upgrade rust libs to latest version by @guocuimi in #309
- refactor: remove the logic loading individual weight from shared partitions by @guocuimi in #311
- feat: added fused column parallel linear by @guocuimi in #313
- feat: added gptq marlin qlinear layer by @guocuimi in #312
- kernel: port awq repack kernel by @guocuimi in #314
Full Changelog: v0.1.9...v0.2.0
v0.1.9
What's Changed
- ci: cancel all previous runs if a new one is triggered by @guocuimi in #283
- pypi: fix invalid classifier by @guocuimi in #284
- refactor: remove exllama kernels by @guocuimi in #285
- kernel: added marlin dense and sparse kernels by @guocuimi in #287
- debug: added environment collection script. by @guocuimi in #288
- kernel: added triton kernel build support by @guocuimi in #289
- feat: added THUDM/glm-4* support by @guocuimi in #292
- fix: handle unfinished utf8 bytes for tiktoken tokenizer by @guocuimi in #293
- triton: fix build error and add example with unittest by @guocuimi in #294
- model: added qwen2 support by @guocuimi in #295
- feat: added sliding window support for QWen2 by @guocuimi in #296
- ci: fix pytest version to avoid flakiness by @guocuimi in #297
Full Changelog: v0.1.8...v0.1.9
v0.1.8
v0.1.7
What's Changed
- build: fix build error with gcc-13 by @guocuimi in #264
- kernel: upgrade cutlass to 3.5.0 + cuda 12.4 for sm89 fp8 support by @guocuimi in #265
- cmake: define header only library instead of symbol link for cutlass and flashinfer by @guocuimi in #266
- feat: added range to support Range-for loops by @guocuimi in #267
- kernel: added attention cpu implementation for testing by @guocuimi in #268
- build: added nvbench as submodule by @guocuimi in #269
- build: upgrade cmake required version from 3.18 to 3.26 by @guocuimi in #270
- ci: build and test in devel docker image by @guocuimi in #272
- ci: use manylinux image to build wheel and run pytest by @guocuimi in #271
- attention: added tile logic using cute::local_tile into cpu attention by @guocuimi in #273
- kernel: added playground for learning and experimenting cute. by @guocuimi in #274
- feat: added rope scaling support for llama3.1 by @guocuimi in #277
- update docs for llama3.1 support and bump up version by @guocuimi in #278
Full Changelog: v0.1.6...v0.1.7
v0.1.6
What's Changed
- alllow deploy docs when triggered on demand by @guocuimi in #253
- [model] support vision language model llava. by @liutongxuan in #178
- dev: fix issues in run_in_docker script by @guocuimi in #254
- dev: added cuda 12.4 build support by @guocuimi in #255
- build: fix multiple definition issue by @guocuimi in #256
- fix: check against num_tokens instead of num_prompt_tokens for shared blocks by @guocuimi in #257
- bugfix: fix invalid max_cache_size when device is cpu. by @liutongxuan in #259
- ci: fail test if not all tests were passed successfully by @guocuimi in #263
- Revert "[model] support vision language model llava. (#178)" by @guocuimi in #262
Full Changelog: v0.1.5...v0.1.6
v0.1.5
Major changes
- added stream options to include usage info in response
- fix multiple gpu cuda graph capture issue
What's Changed
- feat: added include_usage into stream options for stream scenarios by @guocuimi in #243
- feat: added unittests for openai server by @guocuimi in #244
- [minor] use available memory to caculate cache_size by default. by @liutongxuan in #245
- refactor: only do sampling in driver worker (rank=0) by @guocuimi in #247
- fix multiple devices cuda graph capture issue by @guocuimi in #248
- revert torch.cuda.empty_cache change by @guocuimi in #249
- ci: added release workflow by @guocuimi in #250
- fix workflow by @guocuimi in #251
- fix: pass in secrets for workflow calls. by @guocuimi in #252
Full Changelog: v0.1.4...v0.1.5
v0.1.4
Major changes
- Added logprobs for completion and chat apis
- Added best_of for completion and chate apis
What's Changed
- feat: added openai compatible logprobs support by @guocuimi in #232
- feat: added logprobs support for legacy completion api by @guocuimi in #233
- feat: added logprobs for grpc server by @guocuimi in #234
- feat: added best_of functionality for completion apis by @guocuimi in #236
- feat: added token_ids into sequence output for better debuggability. by @guocuimi in #237
- feat: added id_to_token for tokenizer to handle unfinished byte sequence, ending with "�" by @guocuimi in #238
- refactor: split pybind11 binding definitions into seperate files by @guocuimi in #239
- feat: added logprobs support for speculative decoding by @guocuimi in #240
- feat: added synchronization for batch inference by @guocuimi in #241
- feat: added 'repr' function for scalellm package by @guocuimi in #242
Full Changelog: v0.1.3...v0.1.4
v0.1.3
Major changes
- Model arg hotfix for llama3
- Added more help functions
What's Changed
- fix: load vocab_size first then use it to decide model type for model sharing between llama3, llama2 and Yi. by @guocuimi in #230
- feat: added with statement support to release memory and exposed help function for tokenizer by @guocuimi in #231
Full Changelog: v0.1.2...v0.1.3
v0.1.2
Major changes
- set up github pages for docs https://docs.vectorch.com/
- set up whl repository to host published whls: https://whl.vectorch.com/
- support pip install with different versions: for example:
pip install scalellm -i https://whl.vectorch.com/cu121/torch2.3/
- added latency and system metrics
- added initial monitoring dashboard.
- bug fix for decoder, rejection sampler, and default value for llama2
What's Changed
- ci: added workflow to publish docs to GitHub Pages by @guocuimi in #206
- docs: added docs skeleton by @guocuimi in #207
- docs: fixed source directory and added announcement by @guocuimi in #208
- feat: added monitoring docker compose for prometheus and grafana by @guocuimi in #209
- feat: Added prometheus metrics by @guocuimi in #210
- feat: added token related latency metrics by @guocuimi in #211
- fix: fix weight load issue for fused qkv and added more unittests for weight loading by @guocuimi in #213
- fix: use a consistent version for whl by @guocuimi in #214
- refactor: move setup.py to top level by @guocuimi in #217
- feat: carry over prompt to output for feature parity by @guocuimi in #218
- added missing changes for carrying over prompt by @guocuimi in #219
- fix: set correct default value of rope_theta for llama2 by @guocuimi in #223
- feat: convert pickle to safetensors for fast loading by @guocuimi in #224
- docs: add livehtml for docs development by @guocuimi in #225
- fix: use error instead of CHECK when prompt input is empty by @guocuimi in #226
- fix: avoid tensor convertion for converted ones. by @guocuimi in #228
- feat: added time_to_first_token and inter_token metrics for both stream and non-stream requests by @guocuimi in #227
- fix: decode ending tokens one by one to handle unfinished tokens by @guocuimi in #229
Full Changelog: v0.1.1...v0.1.2