Skip to content

Latest commit

 

History

History
248 lines (199 loc) · 12.2 KB

README.md

File metadata and controls

248 lines (199 loc) · 12.2 KB

Maverick Coref

Conference License: CC BY-NC 4.0 Pip Package git

This is the official repository for Maverick: Efficient and Accurate Coreference Resolution Defying Recent Trends.

Python Package

The maverick-coref Python package provides an easy API to use Maverick models, enabling efficient and accurate coreference resolution with few lines of code.

Install the library from PyPI

pip install maverick-coref

or from source

git clone https://github.com/SapienzaNLP/maverick-coref.git
cd maverick-coref
pip install -e .

Loading a Pretrained Model

Maverick models can be loaded using huggingface_id or local path:

from maverick import Maverick
model = Maverick(
  hf_name_or_path = "maverick_hf_name" | "maverick_ckpt_path", default = "sapienzanlp/maverick-mes-ontonotes"
  device = "cpu" | "cuda", default = "cuda:0"
)

Available Models

Available models at SapienzaNLP huggingface hub:

hf_model_name training dataset Score Singletons
"sapienzanlp/maverick-mes-ontonotes" OntoNotes 83.6 No
"sapienzanlp/maverick-mes-litbank" LitBank 78.0 Yes
"sapienzanlp/maverick-mes-preco" PreCo 87.4 Yes

N.B. Each dataset has different annotation guidelines, choose your model according to your use case.

Inference

Inputs

Maverick inputs can be formatted as either:

  • plain text:
    text = "Barack Obama is traveling to Rome. The city is sunny and the president plans to visit its most important attractions"
  • word-tokenized text, as a list of tokens:
    word_tokenized = ['Barack', 'Obama', 'is', 'traveling', 'to', 'Rome', '.',  'The', 'city', 'is', 'sunny', 'and', 'the', 'president', 'plans', 'to', 'visit', 'its', 'most', 'important', 'attractions']
  • sentence split, word-tokenized text, i.e., OntoNotes like input, as a list of lists of tokens:
    ontonotes_format = [['Barack', 'Obama', 'is', 'traveling', 'to', 'Rome', '.'], ['The', 'city', 'is', 'sunny', 'and', 'the', 'president', 'plans', 'to', 'visit', 'its', 'most', 'important', 'attractions']] 

Predict

You can use model.predict() to obtain coreference predictions. For a sample input, the model will a dictionary containing:

  • tokens, word tokenized version of the input.
  • clusters_token_offsets, a list of clusters containing mentions' token offsets.
  • clusters_text_mentions, a list of clusters containing mentions in plain text.

Example:

model.predict(ontonotes_format)
>>> {
'tokens': ['Barack', 'Obama', 'is', 'traveling', 'to', 'Rome', '.', 'The', 'city', 'is', 'sunny', 'and', 'the', 'president', 'plans', 'to', 'visit', 'its', 'most', 'important', 'monument', ',', 'the', 'Colosseum'], 
'clusters_token_offsets': [[(5, 5), (7, 8), (17, 17)], [(0, 1), (12, 13)]], 
'clusters_text_mentions': [['Rome', 'The city', 'its'], ['Barack Obama', 'the president']]
}

If you input plain text, the model will include also char level offsets as clusters_char_offsets:

model.predict(text)
>>> {
  'tokens': [...], 
  'clusters_token_offsets': [...], 
  'clusters_char_offsets': [[(29, 32), (35, 42), (86, 88)], [(0, 11), (57, 69)]], 
  'clusters_text_mentions': [...]
  }

🚨Additional Features🚨

Since Coreference Resolution may serve as a stepping stone for many downstream use cases, in this package we cover multiple additional features:

  • Singletons, either include or exclude singletons (i.e., single mention clusters) prediction by setting singletons to True or False. (hint: for accurate singletons use preco- or litbank-based models, since ontonotes does not include singletons and therefore the model is not trained to extract any)

    #supported input: ontonotes_format
    model.predict(ontonotes_format, singletons=True)
    {'tokens': [...], 
    'clusters_token_offsets': [((5, 5), (7, 8), (17, 17)), ((0, 1), (12, 13)), ((17, 20),)],
    'clusters_char_offsets': None, 
    'clusters_token_text': [['Rome', 'The city', 'its'], ['Barack Obama', 'the president'], ['its most important attractions']], 
    'clusters_char_text': None
    }
  • Clustering-only, predict with predefined mentions (clustering-only), by passing mentions as a list of token offsets.

    #supported input: ontonotes_format
    mentions = [(0, 1), (5, 5), (7, 8)]
    model.predict(ontonotes_format, predefined_mentions=mentions)
    >>> {'tokens': [...], 
    'clusters_token_offsets': [((5, 5), (7, 8))],
    'clusters_char_offsets': None, 
    'clusters_token_text': [['Rome', 'The city']], 
    'clusters_char_text': None}
  • Starting from gold clusters, predict starting from gold clusters, by passing the model the mentions as a list of token offsets. (Note: since starting clusters will be the first in the token offset outputs, to obtain the coreference resolution predictions only for starting clusters it is enough to take the first N clusters, where N is the number of starting clusters.)

    #supported input: ontonotes_format 
    clusters = [[(5, 5), (7, 8)], [(0, 1)]]
    model.predict(ontonotes_format, add_gold_clusters=clusters)
    >>> {'tokens': [...], 'clusters_token_offsets': [((5, 5), (7, 8), (17, 17)), ((0, 1), (12, 13))], 'clusters_char_offsets': None, 'clusters_token_text': [['Rome', 'The city', 'its'], ['Barack Obama', 'the president']], 'clusters_char_text': None}
  • Speaker information, since OntoNotes models are trained with additional speaker information (more info here), you can specify speaker information with OntoNotes format.

  #supported input: ontonotes_format 
  speakers = [["Mark", "Mark", "Mark", "Mark", "Mark"],["Jhon", "Jhon", "Jhon", "Jhon"]]
  model.predict(ontonotes_format, speakers=clusters)

Using the official Training and Evaluation Script

This same repository contains also the code to train and evaluate Maverick systems using pytorch-lightning and Hydra.

We strongly suggest to directly use the python package for easier inference and downstream usage.

Environment

To set up the training and evaluation environment, run the bash script setup.sh that you can find at top level in this repository. This script will handle the creation of a new conda environment and will take care of all the requirements and data preprocessing for training and evaluating a model on OntoNotes.

Simply run on the command line:

bash ./setup.sh

N.B. Remember to put the zip file ontonotes-release-5.0_LDC2013T19.tgz in the folder data/prepare_ontonotes/ if you want to preprocess Ontonotes with the standard preprocessing proposed by e2e-coref. OntoNotes can be downloaded, upon registration, at the following link

Data

Official Links:

Since those datasets usually require a preprocessing step to obtain the OntoNotes-like jsonlines format, we release ready-to-use version: https://drive.google.com/drive/u/3/folders/18dtd1Qt4h7vezlm2G0hF72aqFcAEFCUo.

Hydra

This repository uses Hydra configuration environment.

  • In conf/data/ each yaml file contains a dataset configuration.
  • conf/evaluation/ contains the model checkpoint file path and device settings for model evaluation.
  • conf/logging/ contains details for wandb logging.
  • In conf/model/, each yaml file contains a model setup.
  • conf/train/ contains training configurations.
  • conf/root.yaml regulates the overall configuration of the environment.

Train

To train a Maverick model, modify conf/root.yaml with your custom setup. By default, this file contains the settings for training and evaluating on the OntoNotes dataset.

To train a new model, follow the steps in Environment section and run the following script:

conda activate maverick_env
python maverick/train.py

Evaluate

To evaluate an existing model, it is necessary to set up two different environment variables.

  1. Set the dataset path in conf/root.yaml, by default it is set to OntoNotes.
  2. Set the model checkpoint path in conf/evaluation/default_evaluation.yaml.

Finally run the following:

conda activate env_name
python maverick/evaluate.py

This will directly output the CoNLL-2012 scores, and, under the experiments/ folder, a output.jsonlines file containing the model outputs in OntoNotes style.

Replicate paper results

The weights of each model can be found in the SapienzaNLP huggingface hub. To replicate any of the paper results, download the weights.ckpt of a model from the its model card files and follow the steps reported in the Evaluate section.

E.G. to replicate the state of the art results of Maverick_mes on OntoNotes:

  • download the weights from here.
  • copy the local path of the weights in conf/evaluation/default_evaluation.yaml.
  • activate the project's conda environment with conda activate maverick_coref.
  • run python maverick/evaluate.py

Citation

This work has been published at ACL 2024 main conference. If you use any part, please consider citing our paper as follows:

@inproceedings{martinelli-etal-2024-maverick,
    title = "Maverick: Efficient and Accurate Coreference Resolution Defying Recent Trends",
    author = "Martinelli, Giuliano  and
      Barba, Edoardo  and
      Navigli, Roberto",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.722",
    pages = "13380--13394",
}

License

The data and software are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0.