Serving Large Language Models model using kserve and llama-cpp-python.
See llama.cpp's docs for a list of supported models
kserve
See the getting started guide
llama.cpp
uses ggml as model format (.gguf
extensions). So models will have to be converted to this format, see the guide or use pre-converted models.
This example uses mistral-7b-q2k-extra-small.gguf
from ikawrakow/mistral-7b-quantized-gguf.
Other models can be deployed by providing a patch to specify an URL to a gguf
model, check manifests/models/ for examples.
bash scripts/setup_model.sh
# Create the ServingRuntime/InferenceService (GPU INFERENCE)
bash scripts/deploy.sh gpu
## or, to create the ServingRuntime/InferenceService (CPU INFERENCE)
# bash scripts/deploy.sh cpu
export ISVC_URL=$(kubectl get isvc llama-cpp-python -o jsonpath='{.status.components.predictor.url}')
# Check that the model is available:
curl -k ${ISVC_URL}/v1/models
# perform inference
bash scripts/inference_simple.sh "The quick brown fox "
Note that since the llama-cpp-python application uses fastapi and swagger, the server exposes a /docs
endpoint with exposed endpoints as well
as an OpenAPI spec at /openapi.json
:
xdg-open ${ISVC_URL}/docs # linux
open ${ISVC_URL}/docs # MacOS
Note: OpenAI uses httpx and certifi to perform requests. It seems like it's not currently possible to perform queries disabling TLS verification.
# (Optional) Set up a virtualenv for inference
python -m venv .venv
source .venv/bin/activate
pip install openai
# perform inference
export ISVC_URL=$(oc get isvc llama-cpp-python -o jsonpath='{.status.components.predictor.url}')
python examples/inference.py
This makes it possible to use a self signed certificate with the OpenAI API:
python utils/server_cert.py ${ISVC_URL}
cat *pem >> .venv/lib/*/site-packages/certifi/cacert.pem
python examples/inference.py
export ISVC_URL=$(oc get isvc llama-cpp-python -o jsonpath='{.status.components.predictor.url}')
python -m venv .venv
source .venv/bin/activate
pip install requests
python scripts/inference_chat.py
- The InferenceService fails to come up with error:
llama_load_model_from_file: failed to load model
. Workaround: The GPU memory might be insufficient to hold the model in memory.N_GPU_LAYERS
can be set to the number of layers to offload to the GPU, leaving the remaining layers to be handled by the CPU.