PPL LLM Serving

Overview

ppl.llm.serving is a part of PPL.LLM system.

We recommend users who are new to this project to read the Overview of system.

ppl.llm.serving is a serving based on ppl.nn for various Large Language Models(LLMs). This repository contains a server based on gRPC and inference support for LLaMA.

Prerequisites

Linux running on x86_64 or arm64 CPUs
GCC >= 9.4.0
CMake >= 3.18
Git >= 2.7.0
CUDA Toolkit >= 11.4. 11.6 recommended. (for CUDA)

Quick Start

Here is a brief tutorial, refer to LLaMA Guide for more details.

Installing Prerequisites(on Debian or Ubuntu for example)
```
apt-get install build-essential cmake git
```

Cloning Source Code

git clone https://github.com/openppl-public/ppl.llm.serving.git

Building from Source

./build.sh -DPPLNN_USE_LLM_CUDA=ON -DPPLNN_CUDA_ENABLE_NCCL=ON -DPPLNN_ENABLE_CUDA_JIT=OFF -DPPLNN_CUDA_ARCHITECTURES="'80;86;87'" -DPPLCOMMON_CUDA_ARCHITECTURES="'80;86;87'"

NCCL is required if multiple GPU devices are used.

Exporting Models

Refer to ppl.pmx for details.
Running Server
```
./ppl-build/ppl_llama_server /path/to/server/config.json
```
Server config examples can be found in src/models/llama/conf. You are expected to give the correct values before running the server.
- model_dir: path of models exported by ppl.pmx.
- model_param_path: params of models. $model_dir/params.json.
- tokenizer_path: tokenizer files for sentencepiece.
Running client: send request through gRPC to query the model
```
./ppl-build/client_sample 127.0.0.1:23333
```
See tools/client_sample.cc for more details.
Benchmarking
```
./ppl-build/client_qps_measure --target=127.0.0.1:23333 --tokenizer=/path/to/tokenizer/path --dataset=tools/samples_1024.json --request_rate=inf
```
See tools/client_qps_measure.cc for more details. --request_rate is the number of request per second, and value inf means send all client request with no interval.
Running inference offline:
```
./ppl-build/offline_inference /path/to/server/config.json
```
See tools/offline_inference.cc for more details.

License

This project is distributed under the Apache License, Version 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PPL LLM Serving

Overview

Prerequisites

Quick Start

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

PPL LLM Serving

Overview

Prerequisites

Quick Start

License