-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Proxy w/ retry #3395
[Serve] Proxy w/ retry #3395
Conversation
5e21426
to
9d449e7
Compare
9d449e7
to
f8bb15e
Compare
@Michaelvll This is ready for a look now 🫡 I'm still running smoke tests and adding a new streaming test for now, will report back later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @cblmemo! Went through the PR and the code looks mostly good to me! Left several comments. I will have a try soon.
@@ -33,7 +33,7 @@ Client disconnected, stopping computation. | |||
You can also run | |||
|
|||
```bash | |||
curl -L http://<endpoint>/ | |||
curl http://<endpoint>/ | |||
``` | |||
|
|||
and manually Ctrl + C to cancel the request and see logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use an OpenAI API client to stream the output and check if abortion/cancellation work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The smoke test test_skyserve_cancel
passed. Do you think that is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we should test with the integration with OpenAI API for both client and server (vllm.entrypoints.openai.api_server
) side to make sure our solution works for the most commonly used library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested with the example of llama2 (both service YAML using vLLM and openai client) w/ and w/o streaming and it works well 🫡
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the abortion work for it as well?
sky/serve/load_balancer.py
Outdated
response = await self._proxy_request_to(ready_replica_url, request) | ||
if response is not None: | ||
return response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if the replica dies when the streaming is not finished? Will we retry here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we dont. One challenge here is to recover the state of streaming, i.e. after a recovery the new replica needs to start streaming from the middle to make the client does not aware that an error has happened. (e.g. the total steaming array is [hello, nice, to, meet, you] and an error happened after [hello, nice]. the new replica needs to stream [to, meet, you] but not the whole array [hello, nice, to, meet, you]) But how could the new replica know where to start? It might be possible to do some special case handling for the LLM workloads (record some information, ...) but it will be really hard to support general streaming workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. In that case, if an underlying replica is scaled down or preempted, will the client just hang and take a long time to timeout, it might be good to include a TODO here if that is the case to optimize for replica failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will immediately closed due to connection lost
Hello
world!
Nice
to
curl: (18) transfer closed with outstanding read data remaining
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you decide to store state:
... special case handling for the LLM workloads (record some information, ...) ...
It would be ideal to turn this off in order to ensure one can boot a zero retention for security.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah definitely! We will provide an option to enable this feature or not if we implemented this feature. Thanks for the advise!
All skyserve smoke test passed 🫡 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @cblmemo! Mostly looks good to me. Left several comments
sky/serve/load_balancer.py
Outdated
response = await self._proxy_request_to(ready_replica_url, request) | ||
if response is not None: | ||
return response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. In that case, if an underlying replica is scaled down or preempted, will the client just hang and take a long time to timeout, it might be good to include a TODO here if that is the case to optimize for replica failure?
TODO:
|
sky/serve/load_balancer.py
Outdated
for client in client_to_close: | ||
asyncio.run(client.aclose()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
profile the performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I benchmarked time to close a client and it is in the order of milliseconds, mostly ranging from 2 to 6 ms. I also change it to async execution to run it in the background for a safeguard.
Tested for abortion and it works as well. Use the following script to launch LB & worker, and UPD: I'll test e2e LLM workloads later. import fastapi, uvicorn, asyncio
import multiprocessing
from sky.serve import load_balancer
REPLICA_URLS = []
PROCESSES = []
CONTROLLER_PORT = 20018
WORD_TO_STREAM = 'Hello world! Nice to meet you!'
TIME_TO_SLEEP = 0.2
def _start_streaming_replica(port):
app = fastapi.FastAPI()
@app.get('/')
async def stream():
async def generate_words():
for word in WORD_TO_STREAM.split()*1000:
yield word + "\n"
print('===========WORKER', word)
await asyncio.sleep(TIME_TO_SLEEP)
return fastapi.responses.StreamingResponse(generate_words(), media_type="text/plain")
@app.get('/non-stream')
async def non_stream():
return {'message': WORD_TO_STREAM}
@app.get('/error')
async def error():
raise fastapi.HTTPException(status_code=500, detail='Internal Server Error')
uvicorn.run(app, host='0.0.0.0', port=port)
def _start_streaming_replica_in_process(port):
global PROCESSES, REPLICA_URLS
STREAMING_REPLICA_PROCESS = multiprocessing.Process(target=_start_streaming_replica, args=(port,))
STREAMING_REPLICA_PROCESS.start()
PROCESSES.append(STREAMING_REPLICA_PROCESS)
REPLICA_URLS.append(f'http://0.0.0.0:{port}')
def _start_controller():
app = fastapi.FastAPI()
flip_flop = False
@app.post('/controller/load_balancer_sync')
async def lb_sync(request: fastapi.Request):
return {'ready_replica_urls': REPLICA_URLS}
uvicorn.run(app, host='0.0.0.0', port=CONTROLLER_PORT)
def _start_controller_in_process():
global PROCESSES
CONTROLLER_PROCESS = multiprocessing.Process(target=_start_controller)
CONTROLLER_PROCESS.start()
PROCESSES.append(CONTROLLER_PROCESS)
if __name__ == '__main__':
try:
_start_streaming_replica_in_process(7001)
_start_controller_in_process()
lb = load_balancer.SkyServeLoadBalancer(
controller_url=f'http://0.0.0.0:{CONTROLLER_PORT}',
load_balancer_port=7000)
lb.run()
finally:
for p in PROCESSES:
p.terminate() |
Just tested with a modified version of fastchat and the abortion works well. I uses the OpenAI Client here and manually Ctrl+C to abort the request. YAML i used: service:
readiness_probe: /v1/models
replicas: 1
resources:
ports: 8087
memory: 32+
accelerators: L4:1
disk_size: 1024
disk_tier: best
envs:
MODEL_SIZE: 7
HF_TOKEN: <huggingface-token>
setup: |
conda activate chatbot
if [ $? -ne 0 ]; then
conda create -n chatbot python=3.9 -y
conda activate chatbot
fi
# Install dependencies
git clone https://github.com/cblmemo/fschat-print-streaming.git fschat
cd fschat
git switch print-stream
pip install -e ".[model_worker,webui]"
python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"
run: |
conda activate chatbot
echo 'Starting controller...'
python -u -m fastchat.serve.controller --host 0.0.0.0 > ~/controller.log 2>&1 &
sleep 10
echo 'Starting model worker...'
python -u -m fastchat.serve.model_worker --host 0.0.0.0 \
--model-path meta-llama/Llama-2-${MODEL_SIZE}b-chat-hf \
--num-gpus $SKYPILOT_NUM_GPUS_PER_NODE 2>&1 \
| tee model_worker.log &
echo 'Waiting for model worker to start...'
while ! `cat model_worker.log | grep -q 'Uvicorn running on'`; do sleep 1; done
echo 'Starting openai api server...'
python -u -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8087 | tee ~/openai_api_server.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome and an important update @cblmemo! The code looks good to me. One tests are passed, I think it should be good to go.
Co-authored-by: Zhanghao Wu <[email protected]>
Thanks! Rerunning all smoke test now, will merge after all of them are passed 🫡 |
Fixed a bug introduced in #3484. Merging now |
Sure. The smoke test |
@cblmemo @Michaelvll This seems to have broken my service, which exposes the Gradio UI rather than the API itself. Here's what my service config looks like:
The UI looks all messed up and I just get a bunch of error messages when trying to use it. If I open the URL of the instance directly, it works fine. |
Hi @anishchopra ! Thanks for the feedback. This is probably due to the proxy trying to send queries on both replicas, but it might contain some session related information which makes it error out; thus single replica deployment works well. Previously, a user will be redirected to one replica and all interactions are made with the redirected replica. Currently, we would suggest using SkyServe to host your API endpoint and launching the Gradio server manually to point to the service endpoint. If you have further suggestions or requirements, could you help filing an issue for this? Thanks! |
@cblmemo I actually only had one replica running in this case. Your suggestion of launching a gradio server separately does work, however I bring up this issue because it points to something not being proxied correctly. |
Humm, could you share the output of |
Using proxy on our load balancer w/ retry. It is useful for spot-based serving.
TODO:
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py --serve
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh