Grobid augmenter body sections, paragraphs, sentences #275
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR makes the grobid augmenter return what Grobid provides in the
<body>
section of the XML for what we call "Sections" made of "Headers" and "Body Text" which Grobid provides as coordinates of<head>
(headers)<p>
(paragraphs) and<s>
(sentences).While working on the test for this, I noticed that the number of sentences found in the body text (249) was not the same as the number of times the
<s>
tag was found in the actual XML (271). I found that some of the extras (8 of them) were from the paper Abstract which Grobid does not return as part of the body text but under<teiHeader>ProfileDesc>Abstract>Div>
, and the rest were from Figure and Table<div>
s (14 of them).I decided to leave all of those sentences out (lone sentences without any encompassing section) since for our current purposes we're just interested in the body text within "Sections"