Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grobid augmenter body sections, paragraphs, sentences #275

Merged
merged 4 commits into from
Aug 15, 2023

Conversation

geli-gel
Copy link
Contributor

@geli-gel geli-gel commented Aug 15, 2023

This PR makes the grobid augmenter return what Grobid provides in the <body> section of the XML for what we call "Sections" made of "Headers" and "Body Text" which Grobid provides as coordinates of <head> (headers) <p> (paragraphs) and <s> (sentences).

While working on the test for this, I noticed that the number of sentences found in the body text (249) was not the same as the number of times the <s> tag was found in the actual XML (271). I found that some of the extras (8 of them) were from the paper Abstract which Grobid does not return as part of the body text but under <teiHeader>ProfileDesc>Abstract>Div>, and the rest were from Figure and Table <div>s (14 of them).

I decided to leave all of those sentences out (lone sentences without any encompassing section) since for our current purposes we're just interested in the body text within "Sections"

Copy link
Contributor

@regan-huff regan-huff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚂

@geli-gel geli-gel merged commit dcf5b2c into main Aug 15, 2023
5 checks passed
@geli-gel geli-gel deleted the grobid_augmenter_sections branch August 15, 2023 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants