Skip to content

int4推理

Li Yudong (李煜东) edited this page May 22, 2023 · 4 revisions

使用 llama.cpp 进行 int4 推理需要格式转换和模型量化

转换到 llama 格式

转换脚本

python3 scripts/convert_tencentpretrain_to_llama.py --input_model_path chatflow_7b.bin \
                                                    --output_model_path consolidated.00.pth \
                                                    --layers 32

转换到 ggml

git clone https://github.com/ggerganov/llama.cpp

将转换后的模型复制的 models/ 目录下并创建对应配置文件,配置文件格式

├── models
│   ├── chatflow_7b
│   │   ├── consolidated.00.pth
│   │   └── params.json
│   └── tokenizer.model

转换模型 python3 convert-pth-to-ggml.py models/chatflow_7b 1

模型量化

./quantize ./models/chatflow_7b/ggml-model-f16.bin ./models/chatflow_7b/ggml-model-q4_0.bin 2

运行

./main -m ./models/chatflow_7b/ggml-model-q4_0.bin -p "北京有什么好玩的地方?\n" -n 256


配置文件格式

7b

{"dim": 4096, "multiple_of": 256, "n_heads": 32, "n_layers": 32, "norm_eps": 1e-06, "vocab_size": -1}

13b

{"dim": 5120, "multiple_of": 256, "n_heads": 40, "n_layers": 40, "norm_eps": 1e-06, "vocab_size": -1}

30

{"dim": 6656, "multiple_of": 256, "n_heads": 52, "n_layers": 60, "norm_eps": 1e-06, "vocab_size": -1}

65b

{"dim": 8192, "multiple_of": 256, "n_heads": 64, "n_layers": 80, "norm_eps": 1e-05, "vocab_size": -1}