0 dependency tokenizer for the RWKV project
Should also work for EleutherAI neox and pythia, as they use the same tokenizer
npm i rwkv-tokenizer-node
const tokenizer = require("RWKV-tokenizer-node");
// Encode into token int : [12092, 3645, 2]
const tokens = tokenizer.encode("Hello World!");
// Decode back to "Hello World!"
const decoded = tokenizer.decode(tokens);
Its primary purpose is for use in implementing RWKV-cpp-node , though it could probably be used for other use cases (eg. pure-JS implementaiton of gpt-neox or RWKV)
- performance: its kinda disappointing that this is easily 10x slower then the python implementation (which i believe is using the rust library), however this is generally still good enough for most usecases
- Why not use the hugging face library? Sadly the official huggingface tokenizer lib for nodejs is broken : huggingface/tokenizers#911
PS: Anyone who has any ideas on how to improve its performance, while not failing the test suite, is welcomed to do so.
# This run the sole test file test/tokenizer.test.js
npm run test
The python script used to seed the refence data (using huggingface tokenizer) is found at test/build-test-token-json.py This test includes a very extensive UTF-8 test file covering all major (and many minor) languages
@picocreator - is the current maintainer of the project, ping him on the RWKV discord if you have any questions on this project
@saharNooby - which the current implementation is heavily based on
@cztomsik @josephrocca @BlinkDL - for their various implementation, which is used as refence to squash out mismatching encoding with HF implementation.