Data Processing in Details (General Format) #57
Replies: 2 comments
-
Single Tag Statistic We starts with using only GPT4-GPT4 conversations. We have total 90 scenarios, each with 5 dialogues. For easy scenario, we observe: |
Beta Was this translation helpful? Give feedback.
-
One issue with our previous data is that it could be too long to exceed the maximum allowed token size for FastChat to finetune without OOM. The tolerable token size is 2048, but our maximum length of tokens is about 2900. Since we are finetuning the model rather than training from scratch, it is not harmful to remove the formatting string we append at the end of each prompt, which gives instruction to the model on how the generation should be format. Once we removed the formatting portion, we still have about 95 data points with over the tolerable token amount. Sliding window is hence implemented for these long conversation. The main idea is:
For example, if the dialogue prompt has 10 turns and has more than 2048 tokens, we first divide the prompt into context + dialogues. We keep the context unchanged, and starts with the first turn (#turn 0) in the dialogues. If the remaining total token (num_token(context)+num_token(dialogues - first turn)) is less than 2048, then the truncate prompt would miss only the dialogue from turn 0. If the truncate prompt is still too long, we iterate to the next turn, #turn 1, and see how many tokens are left after removing dialogues for turn 0 and 1. Until we reach the target token number, we stop removing dialogues and combine context with the remaining dialogues to form the truncate prompt. Note that tokens of result field, which is the generation of each prompt dialogue, is not counted for each data point. Since none of the "to-be-generated" sentence is longer than 2048 tokens, we do not handle the extreme scenario that the prediction would exceed 2048 tokens. Formatting that are removed:Your available action types are Please only generate a JSON string including the action type and the argument. ======================================================== |
Beta Was this translation helpful? Give feedback.
-
Data Processing involves 4 main steps:
Select a tag and pull all episodelog. For now, we mainly consider those clean tags involving GPT4. The different tags decide the model behind the LLM agents for the dialogues.
Filter the dialogues by quality, using goal achieving score per agent per scenario. Note that for each scenario, we want to guarantee there are some dialogues included in the dataset, so we specify the minimum amount of dialogue per agent within a scenario. Currently we are using half of the # dialogue per scenario as the minimum, so for GPT4-GPT4 tag where each scenario has 5 dialogues, we require at least 2 dialogues per agent from the scenario.
Split the scenarios into train and test sets by scenario difficulty. This has been defined in redis, and we collected 76 easy scenarios vs 14 hard scenarios.
For each selected dialogues and the agent position, convert the episodelog format of dialogue into model, prompt and result format, or "completion format". The model is the LLM model of the agent in interest. The prompt concatenate the background of the dialogue as well as all previous conversation between two agents until specific turn. The results is the to-be-predicted dialogue/action by agent in interest, which is the next sentence the agent would say or next action the agent will do, given all previous info in prompt. We save each completion format of dialogue into json file.
For 4, note that for a given dialogue and agent in interest, there could be multiple json files created, i.e., multiple data points. We only want to predict the next sentence/action by the agent in interest, but we want to predict all sentences in sequential turns. So if the dialogues have 10 turns, 5 each for each agent, then we are going to generate 5 json files for this dialogue and the selected agent.
For 2, it is important that the selection is per scenario per agent position. We apply the filtering using the goal reward score for each agent, based on the distribution of the scores per agent per scenario. Say the scenario has 5 dialogues in total, then each agent would have 5 scores. We plot the distribution of scores for each agent, and derived the average scores per agent.
Then, we rank the dialogues for both agents. We first select the top x dialogues for each agent, where x is the minimum number of dialogues we require to have per scenario. Then, for the remaining dialogues, for each rank i, we look at both dialogues at rank i by agent 1 and agent 2. For each dialogue, we check if the score is above the minimum of (7, avg agent score). 7 is the global score that indicate good quality, derived from the distribution of goal scores for all scenarios and all agent. This number could be adjusted depending on the need.
If both dialogues at rank i satisfy the requirement, i.e., have scores above the min (7, avg agent score), then we add both dialogue and agent to the list. As long as one does not satisfy the condition, we don't add both.
In a concrete example, consider a scenario with 5 dialogues, and the rank of dialogue goal score from high to low for agent 1 is [5, 4, 3, 2, 1] and for agent 2 is [1, 3, 2, 4, 5]. Then since we require at least two dialogues per agent, we first add (agent1, 5), (agent1, 4), (agent2, 1), (agent2, 3) to the scenario list.
Then we look at rank 3, which is (agent 1, 3) and (agent 2, 2). If score(agent1, dia3) > min(avg agent1 score, 7) and score(agent2, dia2) > min(avg agent2 score, 7), we add both (agent 1, 3) and (agent 2, 2). Else, we don't add both. By doing so, we guarantee every scenario has data point presented in the datasets, and the dialogues by each agent position is balanced.
Beta Was this translation helpful? Give feedback.
All reactions