Vllm's sampling significantly more deterministic than HF? #471

syskn · 2023-07-15T18:06:40Z

syskn
Jul 15, 2023

Hi guys, great work!

I have been experimenting with the library for several weeks, and immediately noticed that sampled tokens (with the same temperature and such) are significantly more deterministic with Vllm vs. HF Transformers using the same models - with temperature lower than 0.7, often the first 5-10 sampled tokens are exactly same across few different generations, even recreating the original text in the datasets verbatim, like there is some greedy decoding going on (when it is not). This unfortunately leads to a significant repetition issue I've never seen with HF.

Initially I thought this has to do with special tokens (such as </s>) but it was not the case.

In the mean time, I have also been checking and modifying codebase to see if there is any discrepancy between the sampling process, but I am not sure about the difference.

Does anyone else experienced a similar behavior?

Might be related: #450

81549361 · 2023-07-17T06:22:44Z

81549361
Jul 17, 2023

Yes, I have also encountered the same prompt, the same sentence will be generated

0 replies

BaiMoHan · 2023-07-27T04:03:52Z

BaiMoHan
Jul 27, 2023

Yes, I've encountered the same problem too. I would also like to know how to solve it.

0 replies

anujnayyar1 · 2023-07-27T09:26:21Z

anujnayyar1
Jul 27, 2023

Same here.
Modifying the presence and frequency penalties appeared to help.
However they cause the model output to degridate significantly.

0 replies

syskn · 2023-08-06T21:56:43Z

syskn
Aug 6, 2023
Author

#590
Related topic with actual examples. The topic is about GPTJ, but the issue is there with other architectures such as Llama and NeoX.

To be clear, the repetitive/potential degradation issue is not related to special tokens such as <s></s>. I modified vllm so that it never generates those special tokens like HF's bad_words_ids, but the issue persists (those special tokens will also make the inference quality significantly worse especially with non-chat prompts, but it is a different issue).

0 replies

ehartford · 2023-08-08T21:31:20Z

ehartford
Aug 8, 2023

I have seen this behavior myself. It's blocking me.
Is there a reason this is a discussion instead of an issue?
It's bad enough to force me to use TGI instead

1 reply

syskn Aug 9, 2023
Author

Good point. I created a new issue.

ri938 · 2023-08-19T14:47:37Z

ri938
Aug 19, 2023

I also noticed this issue. I think it can be solved by someone implementing the option for a HF sampler. It should be possible to match exact performance.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vllm's sampling significantly more deterministic than HF? #471

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Vllm's sampling significantly more deterministic than HF? #471

syskn Jul 15, 2023

Replies: 6 comments · 1 reply

81549361 Jul 17, 2023

BaiMoHan Jul 27, 2023

anujnayyar1 Jul 27, 2023

syskn Aug 6, 2023 Author

ehartford Aug 8, 2023

syskn Aug 9, 2023 Author

ri938 Aug 19, 2023

syskn
Jul 15, 2023

Replies: 6 comments 1 reply

81549361
Jul 17, 2023

BaiMoHan
Jul 27, 2023

anujnayyar1
Jul 27, 2023

syskn
Aug 6, 2023
Author

ehartford
Aug 8, 2023

syskn Aug 9, 2023
Author

ri938
Aug 19, 2023