Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on Prompt Usage and Special Tokens in LLARA-Passage Code #1129

Open
jhy12 opened this issue Oct 7, 2024 · 1 comment
Open

Comments

@jhy12
Copy link

jhy12 commented Oct 7, 2024

Dear Authors,

Firstly, thank you for your insightful paper, "Llama2Vec: Unsupervised Adaptation of Large Language Models for Dense Retrieval." I found it highly informative and am excited about its potential applications.

While studying the paper and experimenting with the code provided, I noticed some discrepancies between the described methodology and the actual implementation in the BAAI/LLARA-passage model. Specifically, in Section 3.2 and the fine-tuning section of the paper, you mention:

For pretraining, the model uses the NEXT prompt to generate query embeddings and the SELF prompt for answer embeddings.
The NEXT prompt is defined as "The next sentence is:".
The SELF prompt is defined as "The input sentence is:".
You note that while fine-tuning, the formulation can be changed into N2N (NEXT-to-NEXT) or S2S (SELF-to-SELF) but do not indicate that the NEXT and SELF prompts themselves would be altered.
However, in the code from your GitHub repository, the prompts and use of special tokens differ:

Prompts in the Code:

Query Inputs:
prefix = '"'
suffix = '", predict the following passage within eight words: '
Passage Inputs:
prefix = '"'
suffix = '", summarize the above passage within eight words: '
Use of Special Tokens:

The code includes special tokens to , which are not mentioned in the paper.
Given these differences, I have a few questions:

  1. Prompt Discrepancy:

Why do the prompts in the code differ from those described in the paper?
In the paper, the prompts are simply NEXT ("The next sentence is:") and SELF ("The input sentence is:").
The code uses more elaborate prompts like "predict the following passage within eight words" and "summarize the above passage within eight words".
Could you please explain the rationale behind this change?

  1. Alignment with the Paper:

Is the code an exact implementation of the methods described in the paper (which is just a fine-tuned version of Llama2Vec on MS MARCO), or are there modifications specific to the BAAI/LLARA-passage model?
I'm curious if the Llama2Vec fine-tuned on MS MARCO described in your paper was trained using these modified prompts and special tokens, as shown in the code, but not mentioned in detail in your paper.
Or, during fine-tuning on MS MARCO, did you use the original NEXT ("The next sentence is:") and SELF ("The input sentence is:") prompts as described in the paper?
To accurately reproduce the implementation detailed in your paper, how should I construct the prompts for queries and passages while fine-tuning on MS MARCO?

To conclude, should I use the modified prompts and special tokens as in the code, or adhere to the original NEXT and SELF prompts mentioned in the paper?
Are there any additional details about the prompts or special tokens that are important for replication but were not included in the paper? (For example, you write that we should use the NEXT and SELF prompts, but did not wrote in detail that we should change the NEXT and SELF prompt itself while fine-tuning.)

Thank you for your time and for contributing such valuable research to the community.

@545999961
Copy link
Collaborator

  1. In the paper, the terms "SELF" and "NEXT" are used merely as referential labels to better describe the contents that "SELF" and "NEXT" represent.
  2. The N2N and S2S prompts are not employed during the fine-tuning process. Instead, the fine-tuned models are capable of using them.
  3. BAAI/LLARA-passage is the passage version of Llama2Vec.
  4. Regarding the use of prompts and special tokens, these do not significantly influence the overall results. If fine-tuning is based on LLARA-pretrain, it is essential to use the same prompts as those used during the pretraining phase, as mentioned in finetune section. However, if you start from Llama and then proceed to pre-training and fine-tuning, you may use either the prompts mentioned in the code or those discussed in the paper. The key requirement is to maintain consistency in the prompts used during both pre-training, fine-tuning and inferencing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants