This repository contains various datasets for advanced RAG over a multiple documents. We created these since we noticed that existing eval datasets were not adequately reflecting RAG use cases that we see in production. Specifically, they were doing Q&A over a single (or just a few) docs when in reality customers often need to RAG over larger sets of documents.
The goal with our dataset is to reflect real-life customer usage by incorporating:
- QnA over multiple documents, more than just a few
- Use more realistic long-form documents that are similar to documents customers use, not just standard academic examples
- Include questions of varying degree of difficulty, including:
- Single-Doc, Single-Chunk RAG: Questions where the answer can be found in a contiguous region (text or table chunk) of a single doc. To correctly answer, the RAG system needs to retrieve the correct chunk and pass it to the LLM context. For example:
What did Microsoft report as its net cash from operating activities in the Q3 2022 10-Q?
- Single-Doc, Multi-Chunk RAG: Questions where the answer can be found in multiple non-contiguous regions (text or table chunks) of a single doc. To correctly answer, the RAG system needs to retrieve multiple correct chunks from a single doc which can be challenging for certain types of questions. For example:
For Amazon's Q1 2023, how does the share repurchase information in the financial statements correlate with the equity section in the management discussion?
- Multi-Doc RAG: Questions where the answer can be found in multiple non-contiguous regions (text or table chunks) across multiple docs. To correctly answer, the RAG system needs to retrieve multiple correct chunks from multiple docs. For example:
How has Apple's revenue from iPhone sales fluctuated across quarters?
- Single-Doc, Single-Chunk RAG: Questions where the answer can be found in a contiguous region (text or table chunk) of a single doc. To correctly answer, the RAG system needs to retrieve the correct chunk and pass it to the LLM context. For example:
Current status for each dataset:
Dataset | Status | # of Documents | # of QnA pairs |
---|---|---|---|
SEC 10-Q | v1 | 20 | 195 |
NTSB Aviation Incident Accident Reports | Draft | 20 | in progress |
NIH Clinical Trial Protocols | Draft | 20 | in progress |
US Federal Agency Reports | Draft | 20 | in progress |