-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[]TJ command parsed improperly #710
Comments
This likely has something to do with some representation of newlines in strings that isn't being escaped. I'll need to know the initial state of the document stream @DisabledMonkey. Please add the following line in your PDFObject.php as the first line of the
And let me know what output you see, specifically around this TJ command. Thanks. |
Trying to look at just a small portion, this is what it looks like before any processing in that
so does seem like the |
Thanks, but there is something that's not showing up here since I can copy paste that string into the unit tests and it parses properly. Change the added line to |
here you go:
thanks |
Does the match appear more than once? The string you posted doesn't quite match with the one from your previous post. |
so i dug through until eventually i was able to find a portion that trips it up, so almost seems like something higher in the pdf causes it to break in future parts or something. So this chunk here should at least return one bad []TJ block
|
In your line-numbered screenshot from the OP, can you provide lines 525 to 600? |
Sorry for all the back and forth. My boss gave me permission to share a similar doc that has the same problem from 2018 |
Thanks for that. I cannot reproduce the broken TJ command behaviour with my current copy of PdfParser, but your document does contain inline images. Can you check whether this issue might be resolved by the recently merged #693? If not that, it might have something to do with your PHP version of 7.2 perhaps? |
Can confirm master branch is working, so does appear like the changes in #693 fixed this. Thanks for your time and assistance. |
Description:
Pdf parses incorrectly for a pdf with []TJ commands, resulting in partial command getting returned as part of the parsed text.
Per this section of code
pdfparser/src/Smalot/PdfParser/PDFObject.php
Lines 358 to 365 in 14adf31
But looking at the results of that, we can see the
formatContent
method returned that []TJ command across several linesPDF input
Is a document containing financial information for the company i work for so I can't provide it.
Expected output & actual output
Expected: actual English text
Actual: a partial of the []TJ command is returned
In version 2.7 the pdf in question parsed correctly but hasn't in any version since.
Code
While this might not be the proper way to fix the problem....
i saw the dictionary command had a similar thing accounted for, so added a bit of code to do the same thing for this TJ command and it fixed my output
The text was updated successfully, but these errors were encountered: