Skip to content

pmcwhannel/treccastweb

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TREC Conversational Assistance Track (CAsT)

There are currently few datasets appropriate for training and evaluating models for Conversational Information Seeking (CIS). The main aim of TREC CAsT is to advance research on conversational search systems. The goal of the track is to create a reusable benchmark for open-domain information centric conversational dialogues.

The track will run in 2020 and establish a concrete and standard collection of data with information needs to make systems directly comparable.

This is the second year of TREC CAsT, which will run as a track in TREC. This year we aim to focus on candidate information ranking in context:

  • Read the dialogue context: Track the evolution of the information need in the conversation, identifying salient information needed for the current turn in the conversation
  • Retrieve Candidate Response Information: Perform retrieval over a large collection of paragraphs (or knowledge base content) to identify relevant information

Year 2 (TREC 2020)

Data

Topics

Baselines

  • NEW - BM25 + BERT baseline - We provide a BM25 + BERT reranked baseline run for the raw utterances, automatically rewritten utterances, and the manually rewritten utterances.
  • NEW - Interactive web UI - A simple web UI with the BM25 + BERT model used to create the baseline runs. No rewriting is performed.

Collection

Guidelines

News

  • May 2020: Year 2 guidelines released
  • July 2020: Year 2 evaluation topics released

Contact

Important Dates

  • Training data release: See previous data
  • Test topic release: July 9th
  • Run submission: August 19th

Organizers

Year 1 (TREC 2019)

2019 Data

Topics

Resolved Topic Annotations

  • To facilitate work on passage ranking only we performed manual resolution of coreference as well as conversational ambiguity for topics. We make these available to participants who may not have access to automatic methods. Runs using this data manual runs. The annotations are provided in a tab separated format with the turn id (query id) and the rewritten query in text form.
  • TRAIN: Sample annotations on two training queries (for exemplars)
  • EVALUTION: Complete annotations on the evaluation topics for the year 1 evaluation queries.

Baselines

  • Indri search interface - We provide an Indri index of the CAsT collection. See the help page for details on indexing parameters and statistics. It includes a standard batch search API limited to 50 queries per batch.)
  • Baseline retrieval - We provide the queries and run files in trec eval format: train queries, train run file, test queries, test run file - We provide an Indri baseline run with Query Likelihood run, including both the topics and run files. Queries are generated by running AllenNLP coreference resolution to perform rewriting and stopwords are removed using the Indri stopword list.

Collection

Document ID format

  • The document id format is [collection_id_paragraph_id] with collection id and paragraph id separated by an underscore.
  • The collection ids are in the set: {MARCO, CAR, WAPO}.
  • The paragraph ids are: standard provided by MARCO and CAR. For WAPO the paragraph ID is [article_id-paragraph_index] where the paragraph_index is the starting from 1-based index of the paragraph using the provided paragraph markup separated by a single dash.
  • Example WaPo combined document id: [WAPO_903cc1eab726b829294d1abdd755d5ab-1], or CAR: [CAR_6869dee46ab12f0f7060874f7fc7b1c57d53144a]

Duplicate handling

  • Early analysis found that both the MARCO and WaPo corpora both contain a significant number of near duplicate paragraphs. We have run near-dupliate detection to cluster results; only one result per duplicate cluster will be evaluated. It is suggested that you remove dupliates (keeping the canonical document) from your indices.
  • A README with the process and file format.
  • Washington Post duplicate file
  • MARCO duplicate file
  • Note: The tools in the repository below require these files as input for processing the collection and perform deduplication when the data is generated.

Code and tools

  • TREC-CAsT Tools repository with code and scripts for processing data.
  • The tools contain scripts for parsing the collection into standard indexing formats. It also provides APIs for working with the topics (in text, json, and protocol buffer formats).
  • Note: This will evolve over time, it currently contains topic definition files and scripts for reading and loading topics.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published