Japanese LLaMa experiment.

Japanese version README-ja.md

Status

Japanese dataset pre-cleaning
Japanese dataset quality filtering
Japanese dataset dedup
Incremental pre-training
Fine-tuning with Japanese finetuning dataset.

Requirements

(Mini)conda
Python 3.10+
- Python 3.8+ may work.
CMake and C++17 compiler
- Install via sudo apt-get install build-essential for Ubuntu, or
- conda install -c conda-forge cxx-compiler
- conda install -c conda-forge cmake

Setup

To prepare Japanese dataset

KenLM

Build and install python module.

$ sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

$ git clone https://github.com/kpu/kenlm
$ cd kenlm
$ python setup.py bdist_wheel
$ python -m pip install -U dist/kenlm*.whl

sentencepiece

$ sudo apt install sentencepiece

Download SentencePiece and KenLM pretrained model(for ja language)

$ bash download_lm.sh

Install

Setup python environment using conda.

We need to create two conda environment, since spacy-transformers module(used in ginza module) requires older transformers version, which does not support Llama class(fail to import LlamaTokenizer from transformers)

$ conda create -n jp-llama-experiment python=3.10
$ conda activate jp-llama-experiment
$ python -m pip install -r requirements.txt

$ conda deactivate
$ conda create -n jp-llama-experiment-nlp python=3.10
$ conda activate jp-llama-experiment-nlp
$ python -m pip install -r requirements-ja-nlp.txt

Steps

Download datasets.
Run dataset cleaner
Train Japanese Tokeniezr
Merge Japanese Tokenizer into LLaMa Tokenizer
LoRA incremental training using Japanese Tokenizer
Finetune with Japanese dataset(e.g. Alpaca)

Download datasets

This is a required stop to train Tokenier, build KenLM model, etc.

cc100ja
mc4 ja
OSCAR2301 ja
wiki40b/ja

See 00_download_dataset for details.

Run dataset cleaner & dedup

Train Japanese Tokenizer

W.I.P.

cc100 ja で日本語 tokenizer を huggingface tokenizers で train するメモ https://zenn.dev/syoyo/articles/8647ae42a3be63

for details(in Japanese)

Train Japanese Tokenizer from cc100 ja.

It will download 40 GB of cc100 ja datset(75 GB uncompressed).

train_jp_tokenizer.py

128 GB CPU memory is required to train Japanese Tokenizer. After downloading

Merge Japanese Tokenizer vocab into LLaMa tokenizer

T.B.W.

Incremental training using Japanese Tokenizer.

This step take a time to train.

T.B.W.

Finetune with Japanese dataset(e.g. Alpaca)

T.B.W.

TODO

Japanese specific line-wise filtering
Exact Dedup using Suffix Array

License

MIT license unless licensing terms is not explicitly denoted. Some scripts are licensed under Apache 2.0 or BSD.

Third party licenses

Chinese LLaMa: Apache 2.0: https://github.com/ymcui/Chinese-LLaMA-Alpaca
cc_net: MIT License https://github.com/facebookresearch/cc_net
utf8proc: MIT license + permissive Unicode data license https://github.com/JuliaStrings/utf8proc
jagger: We choose BSD license. https://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jagger/
c4-dataset-script: MIT license. https://github.com/shjwudp/c4-dataset-script

Name		Name	Last commit message	Last commit date
Latest commit History 240 Commits
00_download_dataset		00_download_dataset
01_prepare_dataset		01_prepare_dataset
02_normalize		02_normalize
03_clean_step1		03_clean_step1
03_clean_step2		03_clean_step2
03_clean_step3_linewise		03_clean_step3_linewise
03_clean_step4_kanji		03_clean_step4_kanji
04_lm_scoring		04_lm_scoring
05_minhash		05_minhash
06_dedup		06_dedup
07_beauty		07_beauty
10_incremental_pretrain		10_incremental_pretrain
20_finetune		20_finetune
20_sft		20_sft
attic		attic
benchmark		benchmark
build_rwkv_world_ja_tokenizer		build_rwkv_world_ja_tokenizer
center		center
commncrawl		commncrawl
common_util		common_util
cpp		cpp
data		data
dict		dict
exact_dedup_at_scale		exact_dedup_at_scale
imgs		imgs
jagger		jagger
kenml		kenml
llamacpp		llamacpp
models		models
ner		ner
pretrain_ui		pretrain_ui
progressive_minhashing		progressive_minhashing
rnnlm		rnnlm
sandbox		sandbox
test_data		test_data
text-dedup		text-dedup
third_party		third_party
train_tokenizer		train_tokenizer
.gitattributes		.gitattributes
.gitmodules		.gitmodules
LICENSE		LICENSE
README-ja.md		README-ja.md
README.md		README.md
normalize_test.txt		normalize_test.txt
requirements-download.txt		requirements-download.txt
requirements-ja-nlp.txt		requirements-ja-nlp.txt
requirements.txt		requirements.txt
run_tokenizer.py		run_tokenizer.py
split-cc100.sh		split-cc100.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Japanese LLaMa experiment.

Status

Requirements

Setup

To prepare Japanese dataset

Install

Steps

Download datasets

Run dataset cleaner & dedup

Train Japanese Tokenizer

Merge Japanese Tokenizer vocab into LLaMa tokenizer

Incremental training using Japanese Tokenizer.

Finetune with Japanese dataset(e.g. Alpaca)

TODO

License

Third party licenses

About

Releases

Packages

Languages

License

lighttransport/japanese-llama-experiment

Folders and files

Latest commit

History

Repository files navigation

Japanese LLaMa experiment.

Status

Requirements

Setup

To prepare Japanese dataset

Install

Steps

Download datasets

Run dataset cleaner & dedup

Train Japanese Tokenizer

Merge Japanese Tokenizer vocab into LLaMa tokenizer

Incremental training using Japanese Tokenizer.

Finetune with Japanese dataset(e.g. Alpaca)

TODO

License

Third party licenses

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages