Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Evaluation] Add DiscoveryBench Benchmark #4465

Open
suranah opened this issue Oct 18, 2024 · 1 comment
Open

[Evaluation] Add DiscoveryBench Benchmark #4465

suranah opened this issue Oct 18, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@suranah
Copy link
Contributor

suranah commented Oct 18, 2024

What problem or use case are you trying to solve?

Add DiscoveryBench to OpenHands' evaluation suite. DiscoveryBench contains 264 tasks collected
across 6 diverse domains, such as biology, economics, and sociology. It incorporates discovery workflows from published papers to approximate the real-world challenges faced by researchers.

https://github.com/allenai/discoverybench/
https://x.com/mbodhisattwa/status/1811524569410531333

Do you have thoughts on the technical implementation?

The implementation will consist of:

  1. Inference script to solve a DiscoveryBench task (goal & datasets)
  2. Facetted evaluation script to rigorously evaluate the answers
  3. Documentation for the OpenHands users

Additional context

We are working on a PR for this and will seek OpenHands contributors' input to finalize it. Tagging other contributors to the PR - @Ethan0456, @majumderb and @neubig who helped us chart out the integration.

@suranah suranah added the enhancement New feature or request label Oct 18, 2024
@majumderb
Copy link

Read details of the benchmark in this paper: https://arxiv.org/pdf/2407.01725v1
(A new version with more analysis is coming soon to arXiv)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants