-
Notifications
You must be signed in to change notification settings - Fork 234
int4推理
Li Yudong (李煜东) edited this page May 22, 2023
·
4 revisions
使用 llama.cpp 进行 int4 推理需要格式转换和模型量化
python3 scripts/convert_tencentpretrain_to_llama.py --input_model_path chatflow_7b.bin \
--output_model_path consolidated.00.pth \
--layers 32
git clone https://github.com/ggerganov/llama.cpp
将转换后的模型复制的 models/ 目录下并创建对应配置文件,配置文件格式
├── models
│ ├── chatflow_7b
│ │ ├── consolidated.00.pth
│ │ └── params.json
│ └── tokenizer.model
转换模型
python3 convert-pth-to-ggml.py models/chatflow_7b 1
./quantize ./models/chatflow_7b/ggml-model-f16.bin ./models/chatflow_7b/ggml-model-q4_0.bin 2
./main -m ./models/chatflow_7b/ggml-model-q4_0.bin -p "北京有什么好玩的地方?\n" -n 256
{"dim": 4096, "multiple_of": 256, "n_heads": 32, "n_layers": 32, "norm_eps": 1e-06, "vocab_size": -1}
{"dim": 5120, "multiple_of": 256, "n_heads": 40, "n_layers": 40, "norm_eps": 1e-06, "vocab_size": -1}
{"dim": 6656, "multiple_of": 256, "n_heads": 52, "n_layers": 60, "norm_eps": 1e-06, "vocab_size": -1}
{"dim": 8192, "multiple_of": 256, "n_heads": 64, "n_layers": 80, "norm_eps": 1e-05, "vocab_size": -1}