Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embedding model前向推理时,不同batchsize,相同文本,输出embedding小数点后几位不同 #1113

Open
liangleikun opened this issue Sep 18, 2024 · 4 comments

Comments

@liangleikun
Copy link

liangleikun commented Sep 18, 2024

embedding model前向推理时,不同batchsize,相同文本,输出embedding小数点后几位不同,请问这是什么原因导致的?理论上相同的输入应该会有完全相同的输出才对?
使用的是示例代码:

  sentences_1 = ["样例数据"]
  sentences_2 = ["样例数据", "样例数据", "样例数据", "样例数据"]
  model_path='./model_dir/bge-base-zh-v1.5'
  model = FlagModel(model_path, 
                  query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:",
                  # query_instruction_for_retrieval="",
                  use_fp16=False) # Setting use_fp16 to True speeds up computation with a slight performance degradation
  embeddings_1 = model.encode(sentences_1)
  embeddings_2 = model.encode(sentences_2)

  print(embeddings_1[0][0])
  print(embeddings_2[0][0])

输出
-0.0024678216
-0.0024678234

@smoothdvd
Copy link

试试把seed设置成0

@liangleikun
Copy link
Author

试试把seed设置成0

还是一样有差异,而且应该和seed没关系,不是训练阶段,是模型eval()推理的时候,理论上推理的时候不应该有某个随机结果,没想通差异是为啥

@ZiyiXia
Copy link
Collaborator

ZiyiXia commented Sep 24, 2024

感觉不是随机结果,更像是精度损失导致的。在用CPU推理的时候复现出了类似的情况,用GPU推理暂时没有发现这个问题。猜测有可能是硬件加载模型、数据传输或计算过程导致的误差

@smoothdvd
Copy link

Screenshot 2024-09-24 at 4 22 19 PM

在codespaces里试了一下是一样的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants