FastChat provides OpenAI-compatible APIs for its supported models, so you can use FastChat as a local drop-in replacement for OpenAI APIs. The FastChat server is compatible with both openai-python library and cURL commands.
The following OpenAI APIs are supported:
- Chat Completions. (Reference: https://platform.openai.com/docs/api-reference/chat)
- Completions. (Reference: https://platform.openai.com/docs/api-reference/completions)
- Embeddings. (Reference: https://platform.openai.com/docs/api-reference/embeddings)
The REST API can be seamlessly operated from Google Colab, as demonstrated in the FastChat_API_GoogleColab.ipynb notebook, available in our repository. This notebook provides a practical example of how to utilize the API effectively within the Google Colab environment.
First, launch the controller
python3 -m fastchat.serve.controller
Then, launch the model worker(s)
python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.5
Finally, launch the RESTful API server
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
Now, let us test the API server.
The goal of openai_api_server.py
is to implement a fully OpenAI-compatible API server, so the models can be used directly with openai-python library.
First, install OpenAI python package >= 1.0:
pip install --upgrade openai
Then, interact with the Vicuna model:
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"
model = "vicuna-7b-v1.5"
prompt = "Once upon a time"
# create a completion
completion = openai.completions.create(model=model, prompt=prompt, max_tokens=64)
# print the completion
print(prompt + completion.choices[0].text)
# create a chat completion
completion = openai.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Hello! What is your name?"}]
)
# print the completion
print(completion.choices[0].message.content)
Streaming is also supported. See test_openai_api.py. If your api server is behind a proxy you'll need to turn off buffering, you can do so in Nginx by setting proxy_buffering off;
in the location block for the proxy.
cURL is another good tool for observing the output of the api.
List Models:
curl http://localhost:8000/v1/models
Chat Completions:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.5",
"messages": [{"role": "user", "content": "Hello! What is your name?"}]
}'
Text Completions:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.5",
"prompt": "Once upon a time",
"max_tokens": 41,
"temperature": 0.5
}'
Embeddings:
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.5",
"input": "Hello world!"
}'
If you want to run multiple models on the same machine and in the same process,
you can replace the model_worker
step above with a multi model variant:
python3 -m fastchat.serve.multi_model_worker \
--model-path lmsys/vicuna-7b-v1.5 \
--model-names vicuna-7b-v1.5 \
--model-path lmsys/longchat-7b-16k \
--model-names longchat-7b-16k
This loads both models into the same accelerator and in the same process. This
works best when using a Peft model that triggers the PeftModelAdapter
.
TODO: Base model weight optimization will be fixed once this Peft issue is resolved.
This OpenAI-compatible API server supports LangChain. See LangChain Integration for details.
By default, a timeout error will occur if a model worker does not response within 100 seconds. If your model/hardware is slower, you can change this timeout through an environment variable:
export FASTCHAT_WORKER_API_TIMEOUT=<larger timeout in seconds>
If you meet the following OOM error while creating embeddings. You can use a smaller batch size by setting
export FASTCHAT_WORKER_API_EMBEDDING_BATCH_SIZE=1
Some features to be implemented:
- Support more parameters like
logprobs
,logit_bias
,user
,presence_penalty
andfrequency_penalty
- Model details (permissions, owner and create time)
- Edits API
- Rate Limitation Settings