Skip to content

Releases: vectorch-ai/ScaleLLM

v0.2.1

04 Sep 23:00
Compare
Choose a tag to compare

What's Changed

  • feat: added awq marlin qlinear by @guocuimi in #315
  • build: speed up compilation for marlin kernels by @guocuimi in #316
  • test: added unittests for marlin kernels by @guocuimi in #317
  • refactor: clean up build warnings and refactor marlin kernels by @guocuimi in #318
  • fix: clean up build warnings: "LOG" redefined by @guocuimi in #319
  • cmake: make includes private and disable jinja2cpp build by @guocuimi in #320
  • ci: allow build without requiring a physical gpu device by @guocuimi in #321
  • fix: put item into asyncio.Queue in a thread-safe way by @guocuimi in #324
  • refactor: added static switch for marlin kernel dispatch by @guocuimi in #325
  • feat: fix and use marlin kernel for awq by default by @guocuimi in #326

Full Changelog: v0.2.0...v0.2.1

v0.2.0

22 Aug 01:49
Compare
Choose a tag to compare

What's Changed

  • kernel: port softcap support for flash attention by @guocuimi in #298
  • test: added unittests for attention sliding window by @guocuimi in #299
  • model: added gemma2 with softcap and sliding window support by @guocuimi in #300
  • kernel: support kernel test in python via pybind by @guocuimi in #301
  • test: added unittests for marlin fp16xint4 gemm by @guocuimi in #302
  • fix: move eos out of stop token list to honor ignore_eos option by @guocuimi in #305
  • refactor: move models to upper folder by @guocuimi in #306
  • kernel: port gptq marlin kernel and fp8 marlin kernel by @guocuimi in #307
  • rust: upgrade rust libs to latest version by @guocuimi in #309
  • refactor: remove the logic loading individual weight from shared partitions by @guocuimi in #311
  • feat: added fused column parallel linear by @guocuimi in #313
  • feat: added gptq marlin qlinear layer by @guocuimi in #312
  • kernel: port awq repack kernel by @guocuimi in #314

Full Changelog: v0.1.9...v0.2.0

v0.1.9

04 Aug 00:38
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.1.8...v0.1.9

v0.1.8

25 Jul 12:02
2e14170
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.1.7...v0.1.8

v0.1.7

24 Jul 06:12
f0f7e07
Compare
Choose a tag to compare

What's Changed

  • build: fix build error with gcc-13 by @guocuimi in #264
  • kernel: upgrade cutlass to 3.5.0 + cuda 12.4 for sm89 fp8 support by @guocuimi in #265
  • cmake: define header only library instead of symbol link for cutlass and flashinfer by @guocuimi in #266
  • feat: added range to support Range-for loops by @guocuimi in #267
  • kernel: added attention cpu implementation for testing by @guocuimi in #268
  • build: added nvbench as submodule by @guocuimi in #269
  • build: upgrade cmake required version from 3.18 to 3.26 by @guocuimi in #270
  • ci: build and test in devel docker image by @guocuimi in #272
  • ci: use manylinux image to build wheel and run pytest by @guocuimi in #271
  • attention: added tile logic using cute::local_tile into cpu attention by @guocuimi in #273
  • kernel: added playground for learning and experimenting cute. by @guocuimi in #274
  • feat: added rope scaling support for llama3.1 by @guocuimi in #277
  • update docs for llama3.1 support and bump up version by @guocuimi in #278

Full Changelog: v0.1.6...v0.1.7

v0.1.6

04 Jul 00:34
7aeb7fa
Compare
Choose a tag to compare

What's Changed

  • alllow deploy docs when triggered on demand by @guocuimi in #253
  • [model] support vision language model llava. by @liutongxuan in #178
  • dev: fix issues in run_in_docker script by @guocuimi in #254
  • dev: added cuda 12.4 build support by @guocuimi in #255
  • build: fix multiple definition issue by @guocuimi in #256
  • fix: check against num_tokens instead of num_prompt_tokens for shared blocks by @guocuimi in #257
  • bugfix: fix invalid max_cache_size when device is cpu. by @liutongxuan in #259
  • ci: fail test if not all tests were passed successfully by @guocuimi in #263
  • Revert "[model] support vision language model llava. (#178)" by @guocuimi in #262

Full Changelog: v0.1.5...v0.1.6

v0.1.5

21 Jun 22:54
ed0c74e
Compare
Choose a tag to compare

Major changes

  • added stream options to include usage info in response
  • fix multiple gpu cuda graph capture issue

What's Changed

Full Changelog: v0.1.4...v0.1.5

v0.1.4

15 Jun 17:16
Compare
Choose a tag to compare

Major changes

  • Added logprobs for completion and chat apis
  • Added best_of for completion and chate apis

What's Changed

  • feat: added openai compatible logprobs support by @guocuimi in #232
  • feat: added logprobs support for legacy completion api by @guocuimi in #233
  • feat: added logprobs for grpc server by @guocuimi in #234
  • feat: added best_of functionality for completion apis by @guocuimi in #236
  • feat: added token_ids into sequence output for better debuggability. by @guocuimi in #237
  • feat: added id_to_token for tokenizer to handle unfinished byte sequence, ending with "�" by @guocuimi in #238
  • refactor: split pybind11 binding definitions into seperate files by @guocuimi in #239
  • feat: added logprobs support for speculative decoding by @guocuimi in #240
  • feat: added synchronization for batch inference by @guocuimi in #241
  • feat: added 'repr' function for scalellm package by @guocuimi in #242

Full Changelog: v0.1.3...v0.1.4

v0.1.3

07 Jun 04:59
Compare
Choose a tag to compare

Major changes

  • Model arg hotfix for llama3
  • Added more help functions

What's Changed

  • fix: load vocab_size first then use it to decide model type for model sharing between llama3, llama2 and Yi. by @guocuimi in #230
  • feat: added with statement support to release memory and exposed help function for tokenizer by @guocuimi in #231

Full Changelog: v0.1.2...v0.1.3

v0.1.2

06 Jun 09:18
917c416
Compare
Choose a tag to compare

Major changes

  • set up github pages for docs https://docs.vectorch.com/
  • set up whl repository to host published whls: https://whl.vectorch.com/
  • support pip install with different versions: for example: pip install scalellm -i https://whl.vectorch.com/cu121/torch2.3/
  • added latency and system metrics
  • added initial monitoring dashboard.
  • bug fix for decoder, rejection sampler, and default value for llama2

What's Changed

  • ci: added workflow to publish docs to GitHub Pages by @guocuimi in #206
  • docs: added docs skeleton by @guocuimi in #207
  • docs: fixed source directory and added announcement by @guocuimi in #208
  • feat: added monitoring docker compose for prometheus and grafana by @guocuimi in #209
  • feat: Added prometheus metrics by @guocuimi in #210
  • feat: added token related latency metrics by @guocuimi in #211
  • fix: fix weight load issue for fused qkv and added more unittests for weight loading by @guocuimi in #213
  • fix: use a consistent version for whl by @guocuimi in #214
  • refactor: move setup.py to top level by @guocuimi in #217
  • feat: carry over prompt to output for feature parity by @guocuimi in #218
  • added missing changes for carrying over prompt by @guocuimi in #219
  • fix: set correct default value of rope_theta for llama2 by @guocuimi in #223
  • feat: convert pickle to safetensors for fast loading by @guocuimi in #224
  • docs: add livehtml for docs development by @guocuimi in #225
  • fix: use error instead of CHECK when prompt input is empty by @guocuimi in #226
  • fix: avoid tensor convertion for converted ones. by @guocuimi in #228
  • feat: added time_to_first_token and inter_token metrics for both stream and non-stream requests by @guocuimi in #227
  • fix: decode ending tokens one by one to handle unfinished tokens by @guocuimi in #229

Full Changelog: v0.1.1...v0.1.2