This guide demonstrates how to run FastChat
serving with IPEX-LLM
on Intel GPUs via Docker.
Follow the instructions in this guide to install Docker on Linux.
# This image will be updated every day
docker pull intelanalytics/ipex-llm-serving-xpu:latest
To map the xpu
into the container, you need to specify --device=/dev/dri
when booting the container. Change the /path/to/models
to mount the models.
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
export CONTAINER_NAME=ipex-llm-serving-xpu-container
sudo docker run -itd \
--net=host \
--device=/dev/dri \
-v /path/to/models:/llm/models \
-e no_proxy=localhost,127.0.0.1 \
--memory="32G" \
--name=$CONTAINER_NAME \
--shm-size="16g" \
$DOCKER_IMAGE
After the container is booted, you could get into the container through docker exec
.
docker exec -it ipex-llm-serving-xpu-container /bin/bash
To verify the device is successfully mapped into the container, run sycl-ls
to check the result. In a machine with Arc A770, the sampled output is:
root@arda-arc12:/# sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
For convenience, we have provided a script named /llm/start-fastchat-service.sh
for you to start the service.
However, the script only provide instructions for the most common scenarios. If this script doesn't meet your needs, you can always find the complete guidance for FastChat at Serving using IPEX-LLM and FastChat.
Before starting the service, you can refer to this section to setup our recommended runtime configurations.
Now we can start the FastChat service, you can use our provided script /llm/start-fastchat-service.sh
like the following way:
# Only the MODEL_PATH needs to be set, other parameters have default values
export MODEL_PATH=YOUR_SELECTED_MODEL_PATH
export LOW_BIT_FORMAT=sym_int4
export CONTROLLER_HOST=localhost
export CONTROLLER_PORT=21001
export WORKER_HOST=localhost
export WORKER_PORT=21002
export API_HOST=localhost
export API_PORT=8000
# Use the default model_worker
bash /llm/start-fastchat-service.sh -w model_worker
If everything goes smoothly, the result should be similar to the following figure:
By default, we are using the ipex_llm_worker
as the backend engine. You can also use vLLM
as the backend engine. Try the following examples:
# Only the MODEL_PATH needs to be set, other parameters have default values
export MODEL_PATH=YOUR_SELECTED_MODEL_PATH
export LOW_BIT_FORMAT=sym_int4
export CONTROLLER_HOST=localhost
export CONTROLLER_PORT=21001
export WORKER_HOST=localhost
export WORKER_PORT=21002
export API_HOST=localhost
export API_PORT=8000
# Use the default model_worker
bash /llm/start-fastchat-service.sh -w vllm_worker
The vllm_worker
may start slowly than normal ipex_llm_worker
. The booted service should be similar to the following figure:
Note
To verify/use the service booted by the script, follow the instructions in this guide.
After a request has been sent to the openai_api_server
, the corresponding inference time latency can be found in the worker log as shown below: