-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix incorrect tm_matrix in call to visitor_text #2060
Conversation
Supply the old tm_matrix when flushing out `text` to the `visitor_text` in `crlf_space_check`.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #2060 +/- ##
==========================================
+ Coverage 94.17% 94.28% +0.10%
==========================================
Files 41 41
Lines 7333 7365 +32
Branches 1442 1451 +9
==========================================
+ Hits 6906 6944 +38
+ Misses 266 262 -4
+ Partials 161 159 -2
☔ View full report in Codecov by Sentry. |
@troethe |
Check the coordinates computed from the tm_matrix and cm_matrix that are given to `visitor_text` when extracting text. The test computes the coordinates of three letters in different parts of a test page based on the matrices given to the `visitor_text` function and checks, if they are roughly where they should be.
@pubpub-zz : All right, I added a new test on a simplified test document where the correct positioning is a bit more apparent. Let me know if you have any feedback. Edit: Also note that my original PR was wrong, because I passed the |
@troethe From my side we are good to go :-) Please adjust the top comment of the PR in such a way that it can be the commit message of the squash-merge. For example:
One last question:
What do you mean by "preliminary"? Doesn't this fix the mentioned issue? |
@pubpub-zz I guess this PR is fine from your perspective? |
Yes, it does, but when I originally submitted this, I didn't quite understand all of the code yet and I had not thoroughly tested it as shown by the fact that it ended up not being quite right. I guess I should have called it a WIP/draft instead. |
Ok, then please let me know when you think I can merge :-) |
Supply the old tm_matrix when flushing out `text` to the `visitor_text` in `crlf_space_check`. The new one might already be changed and unrelated to the current text. Also add a test for the tm_matrix and cm_matrix that are given to `visitor_text` when extracting text. The test computes the coordinates of three letters in different parts of a test page based on the matrices and checks, if they are roughly where they should be. Fixes py-pdf#2059
7c4c1c4
to
2e42262
Compare
@MartinThoma Ok, I think it's good now! |
No problem :-) Thank you for the contribution 🙏 I will make a release that contains it today |
If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-) |
## What's new ### Performance Improvements (PI) - optimize _decode_png_prediction (#2068) ### Bug Fixes (BUG) - Fix incorrect tm_matrix in call to visitor_text (#2060) - Writing German characters into form fields (#2047) - Prevent stall when accessing image in corrupted pdf (#2081) - append() fails when articles do not have /T (#2080) ### Robustness (ROB) - Cope with xref not followed by separator (#2083) [Full Changelog](3.15.0...3.15.1)
Sure, why not. :) Thank you! |
Supply the old tm_matrix when flushing out
text
to thevisitor_text
in
crlf_space_check
. The new one might already be changed andunrelated to the current text.
Also add a test for the tm_matrix and cm_matrix that are given to
visitor_text
when extracting text.The test computes the coordinates of three letters in different
parts of a test page based on the matrices and checks, if they are
roughly where they should be.
Fixes #2059