/v1/embeddings please #310

yuhai-china · 2023-06-21T08:47:29Z

yuhai-china
Jun 21, 2023

when will /v1/embeddings API available?
Thank you

zhuohan123 · 2023-06-21T10:24:32Z

zhuohan123
Jun 21, 2023
Maintainer

Hi! Adding support to return embeddings is definitely on our road map. In addition, I believe the modifications to support embedding should not be very complicated and is a very good first issue. If you're interested, feel free to contribute!

0 replies

Vinno97 · 2023-06-26T16:06:05Z

Vinno97
Jun 26, 2023

I looked into it to maybe pick it up as a "good first issue", but did not find it to be straightforward to implement. I'm afraid any changes I would make, would just be hacks. If you have any pointers on how and where I could best add it, I'd be happy to give it a second look.

0 replies

bm777 · 2023-06-27T22:43:06Z

bm777
Jun 27, 2023

@zhuohan123 and @yuhai-china are you talking about a multilingual or the monolingual model?
I'm keen to contribute and would be interested in the sentence-transformer model from HF Zoo.

0 replies

bm777 · 2023-06-29T07:23:56Z

bm777
Jun 29, 2023

@Vinno97 are you still working on it? I would love to help because I'm interested to use it too.

0 replies

Vinno97 · 2023-06-29T13:23:50Z

Vinno97
Jun 29, 2023

No I haven't come back to it. I hoped I could just create a new endpoint that hooked into the model and returned the last hidden state. But I found that the LLMEngine was so written around text generation that I didn't see myself easily and cleanly adding embeddings into it.

But do give it a try! I must admit I spent less than an hour looking into it.

0 replies

zhuohan123 · 2023-06-29T14:20:12Z

zhuohan123
Jun 29, 2023
Maintainer

@yuhai-china @Vinno97 @bm777 Thanks for your interest in this. I previously misunderstood this API to be getting the hidden states for the generated sequence, and that should be easy. However, it turns out that this API is for a completely different set of models (i.e., BERT-like embedding models).

The current vLLM mainly focuses on autoregressive generation. For embedding, both paged attention and continuous batching cannot help performance. Therefore, I think it's better to use other libraries for embedding for now. In the future, when we are extending the scope of vLLM, we will look into this again.

4 replies

bm777 Jun 29, 2023

Completely understandable, and also embedding model generates only vectors from text, it doesn't swap operation, unlike it could increase the latency. Sentence transformers do already the job efficiently.

All this makes sense and @zhuohan123 thanks for reminding :)

Vinno97 Jun 30, 2023

I was actually looking into getting the hidden states for the input prompt. Some techniques like https://github.com/HazyResearch/TART use embeddings of generative LLMs. Especially for large models, it makes sense to only host them once, instead of hosting a separate instance just for getting the embeddings.

LaVieEnRose365 Apr 3, 2024

Hi there! Has the API of getting hidden states for the generated sequence been implemented? I think it's very helpful if we can get the hidden states.

kheyer Apr 29, 2024

Can you give a sense of the technical challenges in implementing this?

Echoing comments on hosting, right now I need to have two serving frameworks with twice the GPU allocation to generate text and compute embeddings from the same model. Even if performance isn't optimum for embeddings, being able to use a single serving framework and deploy a model once would make life a lot easier.

zhuohan123 · 2023-06-29T14:21:02Z

zhuohan123
Jun 29, 2023
Maintainer

Move this issue to discussions as it's more of a longer future plan.

0 replies

lambda-science · 2023-12-08T10:20:35Z

lambda-science
Dec 8, 2023

Does anyone have recommandation tools like vLLM for embedding models ?
vLLM is great for LLMs but I couldn't find anything for reliable hosting embedding models (with standardized API)

5 replies

bm777 Dec 8, 2023

@lambda-science i asked that question, but it seems not to be in their current roadmap. So I'm using now Candle by hf.

lambda-science Dec 8, 2023

I also found this: https://github.com/huggingface/text-embeddings-inference I guess it works

mattmalcher Jan 26, 2024

For others looking for alternatives - have stumbled on https://github.com/michaelfeil/infinity

kheyer Apr 29, 2024

+1 to the huggingface repo since they moved to Apache 2.0 in the 1.2.1 release

dtrckd Apr 30, 2024

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders : https://github.com/McGill-NLP/llm2vec

leoguillaume · 2024-04-17T10:45:17Z

leoguillaume
Apr 17, 2024

Waiting of this major feature of VLLM, I created a very simple merge version of VLLM and HuggingFace Text Embeddings Inference to have one API with full OpenAI endpoints : /v1/embeddings, /v1/chat/completion ... : https://github.com/leoguillaume/VLLMEmbeddings

0 replies

adrianlyjak · 2024-06-07T21:53:39Z

adrianlyjak
Jun 7, 2024

this looks like its been implemented! #3734

3 replies

GeraudBourdin Jun 9, 2024

I don't sée any reference on this subject in documentation. That would be awesome

yosefmih Jun 14, 2024

It is not salient, but there is an example in the docs https://docs.vllm.ai/en/latest/getting_started/examples/openai_embedding_client.html

GeraudBourdin Jun 16, 2024

unfortunately this is not working with mistral7b.
thanks for pointed me the doc :)

lauhaide · 2024-06-24T14:13:47Z

lauhaide
Jun 24, 2024

Hi all, as output in the generate() method, I would like to also get the hidden_states associated to the generated sequences. As fas as I searched, this wasnt available, has this been implemented now?

0 replies

yuanzhiyong1999 · 2024-07-26T04:03:36Z

yuanzhiyong1999
Jul 26, 2024

Have you found an alternative method?

1 reply

lauhaide Aug 9, 2024

no, not yet.

Huyueeer · 2024-08-12T05:25:34Z

Huyueeer
Aug 12, 2024

So now you support it?

0 replies

sudarshan-kamath · 2024-09-04T12:01:42Z

sudarshan-kamath
Sep 4, 2024

Is it supported? If possible, could this be added into the vllm_worker of Fastchat? Thanks

https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/v1/embeddings please #310

{{title}}

Replies: 14 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

/v1/embeddings please #310

Replies: 14 comments · 13 replies

zhuohan123 Jun 21, 2023 Maintainer

zhuohan123 Jun 29, 2023 Maintainer

zhuohan123 Jun 29, 2023 Maintainer

Replies: 14 comments 13 replies

zhuohan123
Jun 21, 2023
Maintainer

zhuohan123
Jun 29, 2023
Maintainer

zhuohan123
Jun 29, 2023
Maintainer