Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group boxes into lines with tolerance #274

Merged
merged 4 commits into from
Aug 17, 2023

Conversation

gituser768
Copy link
Contributor

This PR groups boxes into lines without assuming a perfect match of box.t. We use the 3rd decimal point which seems small enough but also big enough to catch most cases.

This PR also merges adjacent boxes belonging to a mention which might have 2 spans that are far apart. Instead of not doing anything in that case, it only merges boxes with associated spans that are close.

@geli-gel
Copy link
Contributor

Looks like this will improve things but wondering if we could take advantage of PDFPlumber's line segmentation (rows) downstream to decide how to draw boxes

@geli-gel
Copy link
Contributor

geli-gel commented Aug 15, 2023

Just realized a version bump is needed here, I'm setting it to 0.9.11 in my PR so you can take 0.9.12

@gituser768 gituser768 merged commit e9708d6 into main Aug 17, 2023
5 checks passed
@gituser768 gituser768 deleted the dh-robust-merge-citation-boxes branch August 17, 2023 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants