Skip to content

Commit

Permalink
[AI Gallery] Add quantized LLMs with Ollama (#3422)
Browse files Browse the repository at this point in the history
* WIP

* arm64 support

* wip

* wip

* add ollama to ai gallery

* minor edits

* minor edits

* Updates

* comments

* Add 'new' tag
  • Loading branch information
romilbhardwaj authored Apr 11, 2024
1 parent d0f20ab commit 226c1eb
Show file tree
Hide file tree
Showing 7 changed files with 391 additions and 3 deletions.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@

----
:fire: *News* :fire:
- [Apr, 2024] Using [**Ollama**](https://github.com/ollama/ollama) to deploy quantized LLMs on CPUs and GPUs: [**example**](./llm/ollama/)
- [Mar, 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/)
- [Feb, 2024] Deploying and scaling [**Gemma**](https://blog.google/technology/developers/gemma-open-models/) with SkyServe: [**example**](./llm/gemma/)
- [Feb, 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/)
Expand Down Expand Up @@ -159,14 +160,15 @@ Runnable examples:
- [Vicuna chatbots: Training & Serving](./llm/vicuna/) (from official Vicuna team)
- [Train your own Vicuna on Llama-2](./llm/vicuna-llama-2/)
- [Self-Hosted Llama-2 Chatbot](./llm/llama-2/)
- [Ollama: Quantized LLMs on CPUs](./llm/ollama/)
- [LoRAX](./llm/lorax/)
- [QLoRA](https://github.com/artidoro/qlora/pull/132)
- [LLaMA-LoRA-Tuner](https://github.com/zetavg/LLaMA-LoRA-Tuner#run-on-a-cloud-service-via-skypilot)
- [Tabby: Self-hosted AI coding assistant](https://github.com/TabbyML/tabby/blob/bed723fcedb44a6b867ce22a7b1f03d2f3531c1e/experimental/eval/skypilot.yaml)
- [LocalGPT](./llm/localgpt)
- [Falcon](./llm/falcon)
- Add yours here & see more in [`llm/`](./llm)!
- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [Ray Train](examples/distributed_ray_train/ray_train.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), [Cog](https://github.com/skypilot-org/skypilot/blob/master/examples/cog/), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), and [many more (`examples/`)](./examples).
- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [Ray Train](examples/distributed_ray_train/ray_train.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), [Cog](https://github.com/skypilot-org/skypilot/blob/master/examples/cog/), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), [Ollama](https://github.com/skypilot-org/skypilot/blob/master/llm/ollama) and [many more (`examples/`)](./examples).

Follow updates:
- [Twitter](https://twitter.com/skypilot_org)
Expand Down
1 change: 1 addition & 0 deletions docs/source/_gallery_original/frameworks/ollama.md
3 changes: 2 additions & 1 deletion docs/source/_gallery_original/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,11 @@ Contents
.. toctree::
:maxdepth: 1
:caption: Inference Engines
:caption: Inference Frameworks

vLLM <frameworks/vllm>
Hugging Face TGI <frameworks/tgi>
Ollama <frameworks/ollama>
SGLang <frameworks/sglang>
LoRAX <frameworks/lorax>

Expand Down
1 change: 1 addition & 0 deletions docs/source/_static/custom.js
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ document.addEventListener('DOMContentLoaded', () => {
{ selector: '.caption-text', text: 'SkyServe: Model Serving' },
{ selector: '.toctree-l1 > a', text: 'Running on Kubernetes' },
{ selector: '.toctree-l1 > a', text: 'DBRX (Databricks)' },
{ selector: '.toctree-l1 > a', text: 'Ollama' },
];
newItems.forEach(({ selector, text }) => {
document.querySelectorAll(selector).forEach((el) => {
Expand Down
3 changes: 2 additions & 1 deletion docs/source/docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ Runnable examples:
* `Vicuna chatbots: Training & Serving <https://github.com/skypilot-org/skypilot/tree/master/llm/vicuna>`_ (from official Vicuna team)
* `Train your own Vicuna on Llama-2 <https://github.com/skypilot-org/skypilot/blob/master/llm/vicuna-llama-2>`_
* `Self-Hosted Llama-2 Chatbot <https://github.com/skypilot-org/skypilot/tree/master/llm/llama-2>`_
* `Ollama: Quantized LLMs on CPUs <https://github.com/skypilot-org/skypilot/tree/master/llm/ollama>`_
* `LoRAX <https://github.com/skypilot-org/skypilot/tree/master/llm/lorax/>`_
* `QLoRA <https://github.com/artidoro/qlora/pull/132>`_
* `LLaMA-LoRA-Tuner <https://github.com/zetavg/LLaMA-LoRA-Tuner#run-on-a-cloud-service-via-skypilot>`_
Expand All @@ -86,7 +87,7 @@ Runnable examples:
* `Falcon <https://github.com/skypilot-org/skypilot/tree/master/llm/falcon>`_
* Add yours here & see more in `llm/ <https://github.com/skypilot-org/skypilot/tree/master/llm>`_!

* Framework examples: `PyTorch DDP <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml>`_, `DeepSpeed <https://github.com/skypilot-org/skypilot/blob/master/examples/deepspeed-multinode/sky.yaml>`_, `JAX/Flax on TPU <https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml>`_, `Stable Diffusion <https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion>`_, `Detectron2 <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_, `Distributed <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py>`_ `TensorFlow <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml>`_, `NeMo <https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml>`_, `programmatic grid search <https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py>`_, `Docker <https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml>`_, `Cog <https://github.com/skypilot-org/skypilot/blob/master/examples/cog/>`_, `Unsloth <https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml>`_, and `many more <https://github.com/skypilot-org/skypilot/tree/master/examples>`_.
* Framework examples: `PyTorch DDP <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml>`_, `DeepSpeed <https://github.com/skypilot-org/skypilot/blob/master/examples/deepspeed-multinode/sky.yaml>`_, `JAX/Flax on TPU <https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml>`_, `Stable Diffusion <https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion>`_, `Detectron2 <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_, `Distributed <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py>`_ `TensorFlow <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml>`_, `NeMo <https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml>`_, `programmatic grid search <https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py>`_, `Docker <https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml>`_, `Cog <https://github.com/skypilot-org/skypilot/blob/master/examples/cog/>`_, `Unsloth <https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml>`_, `Ollama <https://github.com/skypilot-org/skypilot/blob/master/llm/ollama>`_ and `many more <https://github.com/skypilot-org/skypilot/tree/master/examples>`_.

Follow updates:

Expand Down
299 changes: 299 additions & 0 deletions llm/ollama/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,299 @@
# Ollama: Run quantized LLMs on CPUs and GPUs
<p align="center">
<img src="https://i.imgur.com/HfqnGVA.png" width="400">
</p>

[Ollama](https://github.com/ollama/ollama) is popular library for running LLMs on both CPUs and GPUs.
It supports a wide range of models, including quantized versions of `llama2`, `llama2:70b`, `mistral`, `phi`, `gemma:7b` and many [more](https://ollama.com/library).
You can use SkyPilot to run these models on CPU instances on any cloud provider, Kubernetes cluster, or even on your local machine.
And if your instance has GPUs, Ollama will automatically use them for faster inference.

In this example, you will run a quantized version of Llama2 on 4 CPUs with 8GB of memory, and then scale it up to more replicas with SkyServe.

## Prerequisites
To get started, install the latest version of SkyPilot:

```bash
pip install "skypilot-nightly[all]"
```

For detailed installation instructions, please refer to the [installation guide](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).

Once installed, run `sky check` to verify you have cloud access.

### [Optional] Running locally on your machine
If you do not have cloud access, you also can run this recipe on your local machine by creating a local Kubernetes cluster with `sky local up`.

Make sure you have KinD installed and Docker running with 5 or more CPUs and 10GB or more of memory allocated to the [Docker runtime](https://kind.sigs.k8s.io/docs/user/quick-start/#settings-for-docker-desktop).

To create a local Kubernetes cluster, run:

```console
sky local up
```

<details>
<summary>Example outputs:</summary>

```console
$ sky local up
Creating local cluster...
To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-04-09-19-14-03-599730/local_up.log
I 04-09 19:14:33 log_utils.py:79] Kubernetes is running.
I 04-09 19:15:33 log_utils.py:117] SkyPilot CPU image pulled.
I 04-09 19:15:49 log_utils.py:123] Nginx Ingress Controller installed.
⠸ Running sky check...
Local Kubernetes cluster created successfully with 16 CPUs.
`sky launch` can now run tasks locally.
Hint: To change the number of CPUs, change your docker runtime settings. See https://kind.sigs.k8s.io/docs/user/quick-start/#settings-for-docker-desktop for more info.
```
</details>

After running this, `sky check` should show that you have access to a Kubernetes cluster.

## SkyPilot YAML
To run Ollama with SkyPilot, create a YAML file with the following content:

<details>
<summary>Click to see the full recipe YAML</summary>

```yaml
envs:
MODEL_NAME: llama2 # mistral, phi, other ollama supported models
OLLAMA_HOST: 0.0.0.0:8888 # Host and port for Ollama to listen on

resources:
cpus: 4+
memory: 8+ # 8 GB+ for 7B models, 16 GB+ for 13B models, 32 GB+ for 33B models
# accelerators: L4:1 # No GPUs necessary for Ollama, but you can use them to run inference faster
ports: 8888

service:
replicas: 2
# An actual request for readiness probe.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1

setup: |
# Install Ollama
if [ "$(uname -m)" == "aarch64" ]; then
# For apple silicon support
sudo curl -L https://ollama.com/download/ollama-linux-arm64 -o /usr/bin/ollama
else
sudo curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/bin/ollama
fi
sudo chmod +x /usr/bin/ollama
# Start `ollama serve` and capture PID to kill it after pull is done
ollama serve &
OLLAMA_PID=$!
# Wait for ollama to be ready
IS_READY=false
for i in {1..20};
do ollama list && IS_READY=true && break;
sleep 5;
done
if [ "$IS_READY" = false ]; then
echo "Ollama was not ready after 100 seconds. Exiting."
exit 1
fi
# Pull the model
ollama pull $MODEL_NAME
echo "Model $MODEL_NAME pulled successfully."
# Kill `ollama serve` after pull is done
kill $OLLAMA_PID
run: |
# Run `ollama serve` in the foreground
echo "Serving model $MODEL_NAME"
ollama serve
```
</details>
You can also get the full YAML [here](https://github.com/skypilot-org/skypilot/tree/master/llm/ollama/ollama.yaml).
## Serving Llama2 with a CPU instance
Start serving Llama2 on a 4 CPU instance with the following command:
```console
sky launch ollama.yaml -c ollama --detach-run
```

Wait until the model command returns successfully.

<details>
<summary>Example outputs:</summary>

```console
...
== Optimizer ==
Target: minimizing cost
Estimated cost: $0.0 / hour

Considered resources (1 node):
-------------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
-------------------------------------------------------------------------------------------------------
Kubernetes 4CPU--8GB 4 8 - kubernetes 0.00 ✔
AWS c6i.xlarge 4 8 - us-east-1 0.17
Azure Standard_F4s_v2 4 8 - eastus 0.17
GCP n2-standard-4 4 16 - us-central1-a 0.19
Fluidstack rec3pUyh6pNkIjCaL 6 24 RTXA4000:1 norway_4_eu 0.64
-------------------------------------------------------------------------------------------------------
...
```

</details>

**💡Tip:** You can further reduce costs by using the `--use-spot` flag to run on spot instances.

To launch a different model, use the `MODEL_NAME` environment variable:

```console
sky launch ollama.yaml -c ollama --detach-run --env MODEL_NAME=mistral
```

Ollama supports `llama2`, `llama2:70b`, `mistral`, `phi`, `gemma:7b` and many more models.
See the full list [here](https://ollama.com/library).

Once the `sky launch` command returns successfully, you can interact with the model via
- Standard OpenAPI-compatible endpoints (e.g., `/v1/chat/completions`)
- [Ollama API](https://github.com/ollama/ollama/blob/main/docs/api.md)

To curl `/v1/chat/completions`:
```console
ENDPOINT=$(sky status --endpoint 8888 ollama)
curl $ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
}'
```

<details>
<summary>Example curl response:</summary>

```json
{
"id": "chatcmpl-322",
"object": "chat.completion",
"created": 1712015174,
"model": "llama2",
"system_fingerprint": "fp_ollama",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello there! *adjusts glasses* I am Assistant, your friendly and helpful AI companion. My purpose is to assist you in any way possible, from answering questions to providing information on a wide range of topics. Is there something specific you would like to know or discuss? Feel free to ask me anything!"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 29,
"completion_tokens": 68,
"total_tokens": 97
}
}
```
</details>

**💡Tip:** To speed up inference, you can use GPUs by specifying the `accelerators` field in the YAML.

To stop the instance:
```console
sky stop ollama
```

To shut down all resources:
```console
sky down ollama
```

If you are using a local Kubernetes cluster created with `sky local up`, shut it down with:
```console
sky local down
```

## Serving LLMs on CPUs at scale with SkyServe

After experimenting with the model, you can deploy multiple replicas of the model with autoscaling and load-balancing using SkyServe.

With no change to the YAML, launch a fully managed service on your infra:
```console
sky serve up ollama.yaml -n ollama
```

Wait until the service is ready:
```console
watch -n10 sky serve status ollama
```

<details>
<summary>Example outputs:</summary>

```console
Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
ollama 1 3m 15s READY 2/2 34.171.202.102:30001

Service Replicas
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
ollama 1 1 34.69.185.170 4 mins ago 1x GCP(vCPU=4) READY us-central1
ollama 2 1 35.184.144.198 4 mins ago 1x GCP(vCPU=4) READY us-central1
```
</details>


Get a single endpoint that load-balances across replicas:
```console
ENDPOINT=$(sky serve status --endpoint ollama)
```

**💡Tip:** SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs.

To curl the endpoint:
```console
curl -L $ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
}'
```

To shut down all resources:
```console
sky serve down ollama
```

See more details in [SkyServe docs](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html).
Loading

0 comments on commit 226c1eb

Please sign in to comment.