llama-cpp-python serving

Serving Large Language Models model using kserve and llama-cpp-python.

See llama.cpp's docs for a list of supported models

Requirements

kserve See the getting started guide

Getting Started

llama.cpp uses ggml as model format (.gguf extensions). So models will have to be converted to this format, see the guide or use pre-converted models.

This example uses mistral-7b-q2k-extra-small.gguf from ikawrakow/mistral-7b-quantized-gguf.

Other models can be deployed by providing a patch to specify an URL to a gguf model, check manifests/models/ for examples.

bash scripts/setup_model.sh

# Create the ServingRuntime/InferenceService (GPU INFERENCE)
bash scripts/deploy.sh gpu

## or, to create the ServingRuntime/InferenceService (CPU INFERENCE)
# bash scripts/deploy.sh cpu

export ISVC_URL=$(kubectl get isvc llama-cpp-python -o jsonpath='{.status.components.predictor.url}')

# Check that the model is available:
curl -k ${ISVC_URL}/v1/models

# perform inference
bash scripts/inference_simple.sh "The quick brown fox "

Note that since the llama-cpp-python application uses fastapi and swagger, the server exposes a /docs endpoint with exposed endpoints as well as an OpenAPI spec at /openapi.json:

xdg-open ${ISVC_URL}/docs # linux
open ${ISVC_URL}/docs # MacOS

Using the OpenAI python api

Note: OpenAI uses httpx and certifi to perform requests. It seems like it's not currently possible to perform queries disabling TLS verification.

# (Optional) Set up a virtualenv for inference
python -m venv .venv
source .venv/bin/activate

pip install openai

# perform inference
export ISVC_URL=$(oc get isvc llama-cpp-python -o jsonpath='{.status.components.predictor.url}')
python examples/inference.py

Hack

This makes it possible to use a self signed certificate with the OpenAI API:

python utils/server_cert.py ${ISVC_URL}
cat *pem >> .venv/lib/*/site-packages/certifi/cacert.pem

python examples/inference.py

Using a console chat

export ISVC_URL=$(oc get isvc llama-cpp-python -o jsonpath='{.status.components.predictor.url}')

python -m venv .venv
source .venv/bin/activate
pip install requests

python scripts/inference_chat.py

Troubleshooting

The InferenceService fails to come up with error: llama_load_model_from_file: failed to load model. Workaround: The GPU memory might be insufficient to hold the model in memory. N_GPU_LAYERS can be set to the number of layers to offload to the GPU, leaving the remaining layers to be handled by the CPU.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github		.github
examples		examples
manifests		manifests
scripts		scripts
utils		utils
Dockerfile		Dockerfile
Dockerfile.cuda		Dockerfile.cuda
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama-cpp-python serving

Requirements

Getting Started

Using the OpenAI python api

Hack

Using a console chat

Troubleshooting

About

Releases

Packages

Contributors 3

Languages

License

dtrifiro/llama-cpp-python-serving

Folders and files

Latest commit

History

Repository files navigation

llama-cpp-python serving

Requirements

Getting Started

Using the OpenAI python api

Hack

Using a console chat

Troubleshooting

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages