Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmlu evaluation not working #28

Open
zhengyang-wang opened this issue Oct 24, 2024 · 1 comment
Open

mmlu evaluation not working #28

zhengyang-wang opened this issue Oct 24, 2024 · 1 comment

Comments

@zhengyang-wang
Copy link

Thanks for open-sourcing this! I tested the repo with a small job and found that the 0-shot MMLU score remains 0.22945449366187154 throughout the training. I checked the dumped evaluation details and found that the negative log-likelihood is always 0. For example,

{'doc_id': 0,
 'doc': {'question': " Just war theory's principle of military necessity belongs to",
  'subject': 'moral_disputes',
  'choices': ['jus in bello.',
   'jus ad bellum.',
   'moral nihilism.',
   'all of the above'],
  'answer': 0},
 'target': 0,
 'arguments': [["The following are multiple choice questions (with answers) about moral disputes.\n\nJust war theory's principle of military necessity belongs to\nA. jus in bello.\nB. jus ad bellum.\nC. moral nihilism.\nD. all of the above\nAnswer:",
   ' A'],
  ["The following are multiple choice questions (with answers) about moral disputes.\n\nJust war theory's principle of military necessity belongs to\nA. jus in bello.\nB. jus ad bellum.\nC. moral nihilism.\nD. all of the above\nAnswer:",
   ' B'],
  ["The following are multiple choice questions (with answers) about moral disputes.\n\nJust war theory's principle of military necessity belongs to\nA. jus in bello.\nB. jus ad bellum.\nC. moral nihilism.\nD. all of the above\nAnswer:",
   ' C'],
  ["The following are multiple choice questions (with answers) about moral disputes.\n\nJust war theory's principle of military necessity belongs to\nA. jus in bello.\nB. jus ad bellum.\nC. moral nihilism.\nD. all of the above\nAnswer:",
   ' D']],
 'resps': [[[0.0, True]], [[0.0, True]], [[0.0, True]], [[0.0, True]]],
 'filtered_resps': [[0.0, True], [0.0, True], [0.0, True], [0.0, True]],
 'doc_hash': 'bb0de79f79411c47783968714ec9fe3c69d89753e22c88f044420a7e00049a15',
 'prompt_hash': 'c0f269e09cf44177328b16fb43734d6675416710a24286fea49dfad5663d2fb4',
 'target_hash': '5feceb66ffc86f38d952786c6d696c79c2dbc239dd4e91b46729d73a27fb57e9',
 'acc': 1.0}

The other benchmarks seem normal, e.g. (piqa)

{'doc_id': 0,
 'doc': {'goal': "How do I ready a guinea pig cage for it's new occupants?",
  'sol1': 'Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish.',
  'sol2': 'Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish.',
  'label': 0},
 'target': 0,
 'arguments': [["Question: How do I ready a guinea pig cage for it's new occupants?\nAnswer:",
   ' Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish.'],
  ["Question: How do I ready a guinea pig cage for it's new occupants?\nAnswer:",
   ' Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish.']],
 'resps': [[[-120.0, False]], [[-122.5, False]]],
 'filtered_resps': [[-120.0, False], [-122.5, False]],
 'doc_hash': 'ab177c9b9ad0fd48149e873e3d4804752991338a90c2072f52b975f86a7ca78e',
 'prompt_hash': '14e2e90bdc64add59c76a88c2efb217a132b459947f9f2e3bbe8580b71beb533',
 'target_hash': '5feceb66ffc86f38d952786c6d696c79c2dbc239dd4e91b46729d73a27fb57e9',
 'acc': 1.0,
 'acc_norm': 1.0}

The saved eval config is

harness:
  tasks:
  - hellaswag
  - nq_open
  - piqa
  - winogrande
  - arc
  - race
  - mmlu
  num_fewshot: null
  device: null
  use_cache: null
  cache_requests: false
  rewrite_requests_cache: false
  delete_requests_cache: false
  limit: null
  bootstrap_iters: 100000
  check_integrity: false
  write_out: false
  log_samples: true
  system_instruction: null
  apply_chat_template: false
  fewshot_as_multiturn: false
  gen_kwargs: null
  verbosity: INFO
  predict_only: false
  random_seed: 0
  numpy_random_seed: 1234
  torch_random_seed: 1234
  fewshot_random_seed: 1234
wandb: null
global_step: 60000

It might be because MMLU's answer is only one token and the eval script misses the last token? Looking forward to fixes.

@BadrYoubiIdrissi
Copy link
Contributor

Hey thank you for the issue ! Let me try to reproduce that and see what is the issue. We've only recently switched to lm-harness from our internal evals so some things might be broken, sorry about that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants