Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

naive string matching model #666

Open
jameshowison opened this issue Aug 13, 2020 · 1 comment
Open

naive string matching model #666

jameshowison opened this issue Aug 13, 2020 · 1 comment

Comments

@jameshowison
Copy link
Contributor

jameshowison commented Aug 13, 2020

We talked about creating a model using naive string matching. Primary use is to identify areas likely to be "mention-rich," given our finding from the manual annotation that mentions tend to cluster together in papers. Expectation is that using "go" lists of known software, adjusted for well-known ambiguous phrases, can find those mention rich chunks for further annotation. Expect decent recall, but very low precision!

To that end commit a02b847 moved the software_lists I was playing around with into data/software_lists/ @kermitt2 is going to use those to implement a matching model, resulting in json files with entity_spans with resp="naive_string_match" or something similar.

Might be interesting to compare that effort against our gold standard annotations (after removing the specific strings for software names from that set), and against the trained model.

@kermitt2 kermitt2 mentioned this issue Aug 17, 2020
@kermitt2
Copy link
Member

Back to the #666

Everything is available now.

Under softcite-dataset/code/corpus/ we have a set scripts to convert TEI files to JSON with annotations corresponding to basic matching using the data/software_lists/. Documentation here.

The sequence would be as follow:

  • we start from a set of PDF (from the Softcite corpus or any new PDF), they can be converted into TEI XML via Grobid (Grobid must be installed, via its Docker image for example), the simplest begin using the grobid python client

  • we convert the TEI XML into the JSON format via TEI2LossyJSON.py: we have paragraph-level segments with "ref-spans"

  • we add software annotations, via corpus2JSON.py if we started with the PDF from the Softcite corpus and want to inject the manual annotations, or via enrichJSON.py to add annotations with the "naive string matching" method or alternatively via the software mention service (it must be installed and running)

When annotations are added, we will have sentence-level segments with "ref-spans" and the added "entity-spans" for the software annotations.

About the corresponding added data introduced in #665:

  • softcite-dataset/data/tei contains the TEI files for the softcite corpus PDF (obtained with Grobid)
  • softcite-dataset/data/json contains the JSON files with the softcite corpus manual annotations (obtained with corpus2JSON.py).
  • softcite-dataset/data/json_with_whitelist contains the JSON files with the softcite corpus manual annotations and the "naive string matching" annotations (identified with"resp": "whitelist") (obtained with enrichJSON.py).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants