Japanese version README-ja.md
- Japanese dataset pre-cleaning
- Japanese dataset quality filtering
- Japanese dataset dedup
- Incremental pre-training
- Fine-tuning with Japanese finetuning dataset.
- (Mini)conda
- Python 3.10+
- Python 3.8+ may work.
- CMake and C++17 compiler
- Install via
sudo apt-get install build-essential
for Ubuntu, or conda install -c conda-forge cxx-compiler
conda install -c conda-forge cmake
- Install via
- KenLM
Build and install python module.
$ sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev
$ git clone https://github.com/kpu/kenlm
$ cd kenlm
$ python setup.py bdist_wheel
$ python -m pip install -U dist/kenlm*.whl
- sentencepiece
$ sudo apt install sentencepiece
- Download SentencePiece and KenLM pretrained model(for
ja
language)
$ bash download_lm.sh
Setup python environment using conda.
We need to create two conda environment, since spacy-transformers
module(used in ginza
module) requires older transformers
version, which does not support Llama class(fail to import LlamaTokenizer from transformers)
$ conda create -n jp-llama-experiment python=3.10
$ conda activate jp-llama-experiment
$ python -m pip install -r requirements.txt
$ conda deactivate
$ conda create -n jp-llama-experiment-nlp python=3.10
$ conda activate jp-llama-experiment-nlp
$ python -m pip install -r requirements-ja-nlp.txt
- Download datasets.
- Run dataset cleaner
- Train Japanese Tokeniezr
- Merge Japanese Tokenizer into LLaMa Tokenizer
- LoRA incremental training using Japanese Tokenizer
- Finetune with Japanese dataset(e.g. Alpaca)
This is a required stop to train Tokenier, build KenLM model, etc.
- cc100ja
- mc4 ja
- OSCAR2301 ja
- wiki40b/ja
See 00_download_dataset
for details.
-
01_prepare_dataset
-
02_normalize/
-
03_clean_step1/
-
04_lm_scoring/
-
05_dedup/
W.I.P.
cc100 ja で日本語 tokenizer を huggingface tokenizers で train するメモ https://zenn.dev/syoyo/articles/8647ae42a3be63
for details(in Japanese)
Train Japanese Tokenizer from cc100 ja.
It will download 40 GB of cc100 ja datset(75 GB uncompressed).
train_jp_tokenizer.py
128 GB CPU memory is required to train Japanese Tokenizer. After downloading
T.B.W.
This step take a time to train.
T.B.W.
T.B.W.
- Japanese specific line-wise filtering
- Exact Dedup using Suffix Array
MIT license unless licensing terms is not explicitly denoted. Some scripts are licensed under Apache 2.0 or BSD.
- Chinese LLaMa: Apache 2.0: https://github.com/ymcui/Chinese-LLaMA-Alpaca
- cc_net: MIT License https://github.com/facebookresearch/cc_net
- utf8proc: MIT license + permissive Unicode data license https://github.com/JuliaStrings/utf8proc
- jagger: We choose BSD license. https://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jagger/
- c4-dataset-script: MIT license. https://github.com/shjwudp/c4-dataset-script