Docugami Knowledge Graph Retrieval Augmented Generation (KG-RAG) Datasets

This repository contains various datasets for advanced RAG over a multiple documents. We created these since we noticed that existing eval datasets were not adequately reflecting RAG use cases that we see in production. Specifically, they were doing Q&A over a single (or just a few) docs when in reality customers often need to RAG over larger sets of documents.

The goal with our dataset is to reflect real-life customer usage by incorporating:

QnA over multiple documents, more than just a few
Use more realistic long-form documents that are similar to documents customers use, not just standard academic examples
Include questions of varying degree of difficulty, including:
1. Single-Doc, Single-Chunk RAG: Questions where the answer can be found in a contiguous region (text or table chunk) of a single doc. To correctly answer, the RAG system needs to retrieve the correct chunk and pass it to the LLM context. For example: What did Microsoft report as its net cash from operating activities in the Q3 2022 10-Q?
2. Single-Doc, Multi-Chunk RAG: Questions where the answer can be found in multiple non-contiguous regions (text or table chunks) of a single doc. To correctly answer, the RAG system needs to retrieve multiple correct chunks from a single doc which can be challenging for certain types of questions. For example: For Amazon's Q1 2023, how does the share repurchase information in the financial statements correlate with the equity section in the management discussion?
3. Multi-Doc RAG: Questions where the answer can be found in multiple non-contiguous regions (text or table chunks) across multiple docs. To correctly answer, the RAG system needs to retrieve multiple correct chunks from multiple docs. For example: How has Apple's revenue from iPhone sales fluctuated across quarters?

Status

Current status for each dataset:

Dataset	Status	# of Documents	# of QnA pairs
SEC 10-Q	v1	20	195
NTSB Aviation Incident Accident Reports	Draft	20	in progress
NIH Clinical Trial Protocols	Draft	20	in progress
US Federal Agency Reports	Draft	20	in progress

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
nih-clinical-trial-protocols		nih-clinical-trial-protocols
ntsb-aviation-incident-accident-reports		ntsb-aviation-incident-accident-reports
sec-10-q		sec-10-q
us-fed-agency-reports		us-fed-agency-reports
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docugami Knowledge Graph Retrieval Augmented Generation (KG-RAG) Datasets

Status

About

Releases

Packages

Languages

License

wylansford/KG-RAG-datasets

Folders and files

Latest commit

History

Repository files navigation

Docugami Knowledge Graph Retrieval Augmented Generation (KG-RAG) Datasets

Status

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages