Releases · vectorch-ai/ScaleLLM

04 Sep 23:00

github-actions

v0.2.1

c28c441

v0.2.1 Latest

Latest

What's Changed

feat: added awq marlin qlinear by @guocuimi in #315
build: speed up compilation for marlin kernels by @guocuimi in #316
test: added unittests for marlin kernels by @guocuimi in #317
refactor: clean up build warnings and refactor marlin kernels by @guocuimi in #318
fix: clean up build warnings: "LOG" redefined by @guocuimi in #319
cmake: make includes private and disable jinja2cpp build by @guocuimi in #320
ci: allow build without requiring a physical gpu device by @guocuimi in #321
fix: put item into asyncio.Queue in a thread-safe way by @guocuimi in #324
refactor: added static switch for marlin kernel dispatch by @guocuimi in #325
feat: fix and use marlin kernel for awq by default by @guocuimi in #326

Full Changelog: v0.2.0...v0.2.1

Contributors

guocuimi

Assets 37

scalellm-0.2.1+cu118torch2.2.2-cp310-cp310-linux_x86_64.whl

103 MB 2024-09-04T23:01:05Z
scalellm-0.2.1+cu118torch2.2.2-cp311-cp311-linux_x86_64.whl

103 MB 2024-09-04T23:01:04Z
scalellm-0.2.1+cu118torch2.2.2-cp312-cp312-linux_x86_64.whl

103 MB 2024-09-04T23:01:04Z
scalellm-0.2.1+cu118torch2.2.2-cp38-cp38-linux_x86_64.whl

103 MB 2024-09-04T23:01:05Z
scalellm-0.2.1+cu118torch2.2.2-cp39-cp39-linux_x86_64.whl

103 MB 2024-09-04T23:01:05Z
scalellm-0.2.1+cu118torch2.3.1-cp310-cp310-linux_x86_64.whl

103 MB 2024-09-04T23:01:04Z
scalellm-0.2.1+cu118torch2.3.1-cp311-cp311-linux_x86_64.whl

103 MB 2024-09-04T23:01:04Z
scalellm-0.2.1+cu118torch2.3.1-cp312-cp312-linux_x86_64.whl

103 MB 2024-09-04T23:01:05Z
scalellm-0.2.1+cu118torch2.3.1-cp38-cp38-linux_x86_64.whl

103 MB 2024-09-04T23:01:05Z
scalellm-0.2.1+cu118torch2.3.1-cp39-cp39-linux_x86_64.whl

103 MB 2024-09-04T23:01:04Z
Source code (zip)

2024-09-04T18:57:56Z
Source code (tar.gz)

2024-09-04T18:57:56Z

22 Aug 01:49

github-actions

v0.2.0

96b8127

v0.2.0

What's Changed

kernel: port softcap support for flash attention by @guocuimi in #298
test: added unittests for attention sliding window by @guocuimi in #299
model: added gemma2 with softcap and sliding window support by @guocuimi in #300
kernel: support kernel test in python via pybind by @guocuimi in #301
test: added unittests for marlin fp16xint4 gemm by @guocuimi in #302
fix: move eos out of stop token list to honor ignore_eos option by @guocuimi in #305
refactor: move models to upper folder by @guocuimi in #306
kernel: port gptq marlin kernel and fp8 marlin kernel by @guocuimi in #307
rust: upgrade rust libs to latest version by @guocuimi in #309
refactor: remove the logic loading individual weight from shared partitions by @guocuimi in #311
feat: added fused column parallel linear by @guocuimi in #313
feat: added gptq marlin qlinear layer by @guocuimi in #312
kernel: port awq repack kernel by @guocuimi in #314

Full Changelog: v0.1.9...v0.2.0

Contributors

guocuimi

Assets 37

04 Aug 00:38

github-actions

v0.1.9

b6f707f

v0.1.9

What's Changed

ci: cancel all previous runs if a new one is triggered by @guocuimi in #283
pypi: fix invalid classifier by @guocuimi in #284
refactor: remove exllama kernels by @guocuimi in #285
kernel: added marlin dense and sparse kernels by @guocuimi in #287
debug: added environment collection script. by @guocuimi in #288
kernel: added triton kernel build support by @guocuimi in #289
feat: added THUDM/glm-4* support by @guocuimi in #292
fix: handle unfinished utf8 bytes for tiktoken tokenizer by @guocuimi in #293
triton: fix build error and add example with unittest by @guocuimi in #294
model: added qwen2 support by @guocuimi in #295
feat: added sliding window support for QWen2 by @guocuimi in #296
ci: fix pytest version to avoid flakiness by @guocuimi in #297

Full Changelog: v0.1.8...v0.1.9

Contributors

guocuimi

Assets 37

25 Jul 12:02

github-actions

v0.1.8

2e14170

v0.1.8

What's Changed

ci: increase ccache max size from 5GB(default) to 25GB by @guocuimi in #279
upgrade torch to 2.4.0 by @guocuimi in #280
default use cuda 12.1 for wheel package by @guocuimi in #281
ci: fix cuda version for wheel build workflow by @guocuimi in #282

Full Changelog: v0.1.7...v0.1.8

Contributors

guocuimi

Assets 37

24 Jul 06:12

github-actions

v0.1.7

f0f7e07

v0.1.7

What's Changed

build: fix build error with gcc-13 by @guocuimi in #264
kernel: upgrade cutlass to 3.5.0 + cuda 12.4 for sm89 fp8 support by @guocuimi in #265
cmake: define header only library instead of symbol link for cutlass and flashinfer by @guocuimi in #266
feat: added range to support Range-for loops by @guocuimi in #267
kernel: added attention cpu implementation for testing by @guocuimi in #268
build: added nvbench as submodule by @guocuimi in #269
build: upgrade cmake required version from 3.18 to 3.26 by @guocuimi in #270
ci: build and test in devel docker image by @guocuimi in #272
ci: use manylinux image to build wheel and run pytest by @guocuimi in #271
attention: added tile logic using cute::local_tile into cpu attention by @guocuimi in #273
kernel: added playground for learning and experimenting cute. by @guocuimi in #274
feat: added rope scaling support for llama3.1 by @guocuimi in #277
update docs for llama3.1 support and bump up version by @guocuimi in #278

Full Changelog: v0.1.6...v0.1.7

Contributors

guocuimi

Assets 26

04 Jul 00:34

github-actions

v0.1.6

7aeb7fa

v0.1.6

What's Changed

alllow deploy docs when triggered on demand by @guocuimi in #253
[model] support vision language model llava. by @liutongxuan in #178
dev: fix issues in run_in_docker script by @guocuimi in #254
dev: added cuda 12.4 build support by @guocuimi in #255
build: fix multiple definition issue by @guocuimi in #256
fix: check against num_tokens instead of num_prompt_tokens for shared blocks by @guocuimi in #257
bugfix: fix invalid max_cache_size when device is cpu. by @liutongxuan in #259
ci: fail test if not all tests were passed successfully by @guocuimi in #263
Revert "[model] support vision language model llava. (#178)" by @guocuimi in #262

Full Changelog: v0.1.5...v0.1.6

Contributors

liutongxuan and guocuimi

Assets 26

21 Jun 22:54

github-actions

v0.1.5

ed0c74e

v0.1.5

Major changes

added stream options to include usage info in response
fix multiple gpu cuda graph capture issue

What's Changed

feat: added include_usage into stream options for stream scenarios by @guocuimi in #243
feat: added unittests for openai server by @guocuimi in #244
[minor] use available memory to caculate cache_size by default. by @liutongxuan in #245
refactor: only do sampling in driver worker (rank=0) by @guocuimi in #247
fix multiple devices cuda graph capture issue by @guocuimi in #248
revert torch.cuda.empty_cache change by @guocuimi in #249
ci: added release workflow by @guocuimi in #250
fix workflow by @guocuimi in #251
fix: pass in secrets for workflow calls. by @guocuimi in #252

Full Changelog: v0.1.4...v0.1.5

Contributors

liutongxuan and guocuimi

Assets 26

15 Jun 17:16

github-actions

v0.1.4

7ee34e7

v0.1.4

Major changes

Added logprobs for completion and chat apis
Added best_of for completion and chate apis

What's Changed

feat: added openai compatible logprobs support by @guocuimi in #232
feat: added logprobs support for legacy completion api by @guocuimi in #233
feat: added logprobs for grpc server by @guocuimi in #234
feat: added best_of functionality for completion apis by @guocuimi in #236
feat: added token_ids into sequence output for better debuggability. by @guocuimi in #237
feat: added id_to_token for tokenizer to handle unfinished byte sequence, ending with "�" by @guocuimi in #238
refactor: split pybind11 binding definitions into seperate files by @guocuimi in #239
feat: added logprobs support for speculative decoding by @guocuimi in #240
feat: added synchronization for batch inference by @guocuimi in #241
feat: added 'repr' function for scalellm package by @guocuimi in #242

Full Changelog: v0.1.3...v0.1.4

Contributors

guocuimi

Assets 26

07 Jun 04:59

github-actions

v0.1.3

c4cba4a

v0.1.3

Major changes

Model arg hotfix for llama3
Added more help functions

What's Changed

fix: load vocab_size first then use it to decide model type for model sharing between llama3, llama2 and Yi. by @guocuimi in #230
feat: added with statement support to release memory and exposed help function for tokenizer by @guocuimi in #231

Full Changelog: v0.1.2...v0.1.3

Contributors

guocuimi

Assets 26

06 Jun 09:18

github-actions

v0.1.2

917c416

v0.1.2

Major changes

set up github pages for docs https://docs.vectorch.com/
set up whl repository to host published whls: https://whl.vectorch.com/
support pip install with different versions: for example: pip install scalellm -i https://whl.vectorch.com/cu121/torch2.3/
added latency and system metrics
added initial monitoring dashboard.
bug fix for decoder, rejection sampler, and default value for llama2

What's Changed

ci: added workflow to publish docs to GitHub Pages by @guocuimi in #206
docs: added docs skeleton by @guocuimi in #207
docs: fixed source directory and added announcement by @guocuimi in #208
feat: added monitoring docker compose for prometheus and grafana by @guocuimi in #209
feat: Added prometheus metrics by @guocuimi in #210
feat: added token related latency metrics by @guocuimi in #211
fix: fix weight load issue for fused qkv and added more unittests for weight loading by @guocuimi in #213
fix: use a consistent version for whl by @guocuimi in #214
refactor: move setup.py to top level by @guocuimi in #217
feat: carry over prompt to output for feature parity by @guocuimi in #218
added missing changes for carrying over prompt by @guocuimi in #219
fix: set correct default value of rope_theta for llama2 by @guocuimi in #223
feat: convert pickle to safetensors for fast loading by @guocuimi in #224
docs: add livehtml for docs development by @guocuimi in #225
fix: use error instead of CHECK when prompt input is empty by @guocuimi in #226
fix: avoid tensor convertion for converted ones. by @guocuimi in #228
feat: added time_to_first_token and inter_token metrics for both stream and non-stream requests by @guocuimi in #227
fix: decode ending tokens one by one to handle unfinished tokens by @guocuimi in #229

Full Changelog: v0.1.1...v0.1.2

Contributors

guocuimi

Assets 26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

Major changes

What's Changed

Contributors

Major changes

What's Changed

Contributors

Major changes

What's Changed

Contributors

Major changes

What's Changed

Contributors

Releases: vectorch-ai/ScaleLLM

v0.2.1

What's Changed

Contributors

v0.2.0

What's Changed

Contributors

v0.1.9

What's Changed

Contributors

v0.1.8

What's Changed

Contributors

v0.1.7

What's Changed

Contributors

v0.1.6

What's Changed

Contributors

v0.1.5

Major changes

What's Changed

Contributors

v0.1.4

Major changes

What's Changed

Contributors

v0.1.3

Major changes

What's Changed

Contributors

v0.1.2

Major changes

What's Changed

Contributors