Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[]TJ command parsed improperly #710

Closed
DisabledMonkey opened this issue May 10, 2024 · 10 comments
Closed

[]TJ command parsed improperly #710

DisabledMonkey opened this issue May 10, 2024 · 10 comments
Labels

Comments

@DisabledMonkey
Copy link

  • PHP Version: 7.2
  • PDFParser Version: > 2.7.0

Description:

Pdf parses incorrectly for a pdf with []TJ commands, resulting in partial command getting returned as part of the parsed text.

Per this section of code

// A cleaned stream has one command on every line, so split the
// cleaned stream content on \r\n into an array
$textCleaned = preg_split(
'/(\r\n|\n|\r)/',
$this->formatContent($content),
-1,
\PREG_SPLIT_NO_EMPTY
);
It sounds like each command is supposed to be returned as a single line of text.
But looking at the results of that, we can see the formatContent method returned that []TJ command across several lines
image

PDF input

Is a document containing financial information for the company i work for so I can't provide it.

Expected output & actual output

Expected: actual English text
Actual: a partial of the []TJ command is returned
image
In version 2.7 the pdf in question parsed correctly but hasn't in any version since.

Code

While this might not be the proper way to fix the problem....
i saw the dictionary command had a similar thing accounted for, so added a bit of code to do the same thing for this TJ command and it fixed my output

while (preg_match('/(\[.*?\] *)(TJ)/', $content, $dicttext)) {
    $dictid = uniqid('DICT_', true);
    $dictstore[$dictid] = $dicttext[1];
    $content = preg_replace(
        '/'.preg_quote($dicttext[0], '/').'/',
        ' ###'.$dictid.'###'.$dicttext[2],
        $content,
        1
    );
}
@k00ni k00ni added the bug label May 13, 2024
@GreyWyvern
Copy link
Contributor

This likely has something to do with some representation of newlines in strings that isn't being escaped. I'll need to know the initial state of the document stream @DisabledMonkey.

Please add the following line in your PDFObject.php as the first line of the formatContent() function:

var_dump($content);

And let me know what output you see, specifically around this TJ command. Thanks.

@DisabledMonkey
Copy link
Author

Trying to look at just a small portion, this is what it looks like before any processing in that formatContent() function

[(.)35.2013(\r)73.2169(\x05)39.5429(\r)73.2169(\n)18.7748(\x1E)-5.3566(\x1F)54.1166(\n)4.20113(#)13.7749(\v)36.1006(\x1E)9.21705(\t)91.217(\x17)83.1011(\x17)83.1011(\x02)74.2167(\x1E)9.21705(\x06)57.1009(\x1F)39.5421(\t)91.217(\n)18.7757(\x03)446]TJ

so does seem like the (\n) in there are what cause the problem, so escaping those in some manner should hopefully fix it

@GreyWyvern
Copy link
Contributor

Thanks, but there is something that's not showing up here since I can copy paste that string into the unit tests and it parses properly.

Change the added line to var_dump(bin2hex($content)); then do a search for 5b282e2933352e32303133285c722937332e3231 in the output and paste 500 characters here, beginning from the matched text.

@DisabledMonkey
Copy link
Author

DisabledMonkey commented May 13, 2024

here you go:

5b282e2933352e32303133285c722937332e3231363928052933392e35343239285c722937332e32313639285c6e2931382e37373438281e292d352e33353636281f2935342e3131365b282e2933352e32303133285c722937332e323136285c6e29342e323031313328232931332e37373439280b2933362e31303036281e29392e323137303528092939312e32313728172938332e3130313128172938332e3130313128022937342e32313637281e29392e323137303528062935372e31303039281f2933392e3534323128092939312e323137285c6e2931382e373735372803293434365d544a0a45540a510a302e36323839303620670a

thanks

@GreyWyvern
Copy link
Contributor

Does the match appear more than once? The string you posted doesn't quite match with the one from your previous post.

@DisabledMonkey
Copy link
Author

so i dug through until eventually i was able to find a portion that trips it up, so almost seems like something higher in the pdf causes it to break in future parts or something.

So this chunk here should at least return one bad []TJ block

44503c3c2f507265646963746f722031350a2f436f6c756d6e732031320a2f436f6c6f727320333e3e0a494420789c637cfefafdcea3171980e00703047c40625b1b8a9bea6b322ed87020dedf1e24f093e12344e60744dd47202320a3e0c28e053045d85400190950452b0ec407d863550124130a608a0202ecb1aa0002a8a2092b0e240015fd607088486440020b264cf8f083a1a002aec8c31ed38c0f6012a128c0c31eab0a54450e06585500b9050d70451606585500051b208a4e5fbc9e5adec98003ccee2c0785382e69640000f755840d0a454920510a3020670a710a382e33333333332030203020382e33333333332030203020636d2042540a2f52313420382e32352054660a302e3939383036352030203020312035342e3731362036343520546d0a5b28072934352e373133281a2931312e37373533280b2933362e3130303628172938332e3130303228022937342e3231363728032938312e36353833280429342e323031313328052935342e3131363628062935372e31303039281f2935342e31313636285c6e29342e323031313328232931332e37373439280b2933362e31303036281d292d302e373938373735285c722937332e3231363928052935342e31313636285c722935382e36343333285c6e2931382e37373438281e29392e32313739342802293532365d544a0a45540a510a302e36323839303620670a3330302035333331203236383820362072650a660a710a343338382035333631203339332038332072652057206e0a3020670a710a382e33333333332030203020382e33333333332030203020636d20

@GreyWyvern
Copy link
Contributor

In your line-numbered screenshot from the OP, can you provide lines 525 to 600?

@DisabledMonkey
Copy link
Author

DisabledMonkey commented May 13, 2024

Sorry for all the back and forth.
Obviously difficult to pick and chose pieces of it.

My boss gave me permission to share a similar doc that has the same problem from 2018

@GreyWyvern
Copy link
Contributor

Thanks for that. I cannot reproduce the broken TJ command behaviour with my current copy of PdfParser, but your document does contain inline images. Can you check whether this issue might be resolved by the recently merged #693?

If not that, it might have something to do with your PHP version of 7.2 perhaps?

@DisabledMonkey
Copy link
Author

Can confirm master branch is working, so does appear like the changes in #693 fixed this.
I should have just waited a few more days i guess.

Thanks for your time and assistance.

@DisabledMonkey DisabledMonkey closed this as not planned Won't fix, can't repro, duplicate, stale May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants