Fix hexadecimal decoding (fixes #715) #716

krzyc · 2024-05-27T17:56:33Z

Type of pull request

Bug fix (involves code and configuration changes)

About

Fix for #715 as requested by @k00ni. Tests are passing but I am not confident that this will not break pdfparser because I lack knowledge of internal work of library and PDF format.

k00ni · 2024-05-28T05:43:32Z

There are syntax errors, please check: https://github.com/smalot/pdfparser/actions/runs/9259258917/job/25483604622?pr=716#step:6:13

krzyc · 2024-05-28T08:16:49Z

@k00ni I have fixed PHP 7.x incompatibility.
Regarding testSpecialCharsEncodedAsHex test failed - I have still not found a method to fix it - I have added this test because during additional inspection I have discovered that there is more related problems with round brackets.
But I think that PR is already in state that allows us to verify whether my approach is correct.

k00ni · 2024-05-28T13:39:51Z

Is there a reference in the PDF specification you referring to with this changes in general?

krzyc · 2024-05-28T14:44:00Z

Probably, but it was not intentional. Let it be "PDF Reference sixth edition" by Adobe Systems Incorporated (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf) page 53:
"Any characters may appear in a string except unbalanced parentheses and the backslash, which must be treated specially" e.g.

( Strings may contain balanced parentheses ( ) and
special characters ( * ! & } ^ % and so on ) . )

(Maybe tests should include this string instead of mine.)

I was mainly interested in reading Sig objects (https://www.adobe.com/devnet-docs/acrobatetk/tools/DigSig/Acrobat_DigitalSignatures_in_PDF.pdf), but pdfparser was truncating binary data so I was forced to dive into basics and found more problems as in attached tests.

My concern is I do not understand why pdfparser is parsing strings like this:

pdfparser/src/Smalot/PdfParser/Element/ElementString.php

Lines 62 to 74 in 4b86c66

    
           // Find next ')' not escaped. 
        
           $cur_start_text = $start_search_end = 0; 
        
           while (false !== ($cur_start_pos = strpos($name, ')', $start_search_end))) { 
        
               $cur_extract = substr($name, $cur_start_text, $cur_start_pos - $cur_start_text); 
        
               preg_match('/(?P<escape>[\\\]*)$/s', $cur_extract, $match); 
        
               if (!(\strlen($match['escape']) % 2)) { 
        
                   break; 
        
               } 
        
               $start_search_end = $cur_start_pos + 1; 
        
           } 
        
           // Extract string. 
        
           $name = substr($name, 0, (int) $cur_start_pos);

so I am not comfortable at modifying it.

I am looking for advice or help in fixing problem as I written earlier in #715.

krzyc · 2024-05-29T11:21:12Z

I have rewritten element string parser and now tests pass.
I have two observations:

backslash followed by space should return backslash and space, but this triggers fail of tests, so I kept old behavior (returning space only) for compatibility - to be discussed.
octal parsing should be probably moved to parsing loop, but I kept current behavior - to be discussed.

I am awaiting your feedback.

k00ni · 2024-05-30T06:59:59Z

Thanks for your effort! I try to give you feedback as soon as possible.

- Moved some test-related code to separate function to improve code readability - renamed and refined testSpecialCharsEncodedAsHex: you don't need an if-clause in a test if you insist on certain values along the way, just use assertX to check for expected values ("fail early") - ElementString: moved the part which handles escaped characters to a separate function to improve code readability - added references / comments

... which fail (mostly) without the fixes in this branch.

k00ni

@krzyc Again, thanks for your effort. I dove into your code and the PDF specification 1.7 and holy cow, the spec has over 1300 pages. This topic is not easy but I think I got what you try to achieve with your patch as well as what the spec says in this regard. But further feedback is welcomed as always, CC: @j0k3r @GreyWyvern.

I took the liberty to change some of your code. Here is a summary of my changes:

refactored some code parts
added references and some comments
added a few tests

(for further details, please see my commit notes)

About the tests: I copied a few example texts from the spec (page 54 + 55) to see if PDFParser fails without your fixes. It does most of the time. Only this one works pre- and post patch.

What is great is the fact that it now parses strings like (ref):

(Strings may contain balanced parentheses ( ) and
special characters (*!&}^% and so on).)

Pre-patch PDFParser would have only got until Strings may contain balanced parentheses (. This change is super relevant for scientific papers which often use parentheses in some way (( ), [ ], { }, ...).

I can image there are further edge cases which still might not work. But we should aim for the following: It can get new features/capabilities as long as it retains its current ones.

To your knowledge, are there any cases left which worked before but don't now?

k00ni · 2024-06-05T08:14:18Z

src/Smalot/PdfParser/Element/ElementString.php

+            case ' ': // TODO: this should probably be removed - kept for compatibility
+                $processedName .= $nextChar;
+                $name = substr($name, 1);
+                ++$position;
+                break;


What do you mean with "kept for compatibility"? To what?

k00ni · 2024-06-05T08:17:51Z

src/Smalot/PdfParser/Element/ElementHexa.php

-            // repackage string as standard
-            $name = '('.self::decode($name).')';
-            $element = ElementDate::parse($name, $document);
+            $name = self::decode($name);
+            $element = ElementDate::parse('('.$name.')', $document);

            if (!$element) {
-                $element = ElementString::parse($name, $document);
+                $element = new self($name);
            }


Please give me a short summary what you try to accomplish here.

Why removing the reference to $document?

Also why switching from ElementString to ElementHexa? Is it to match others existing behavior from other elements classes?

GreyWyvern · 2024-06-05T15:55:14Z

After an initial review, this appears to be an issue solely with ElementString::parse. There are zero changes required to the ElementHexa.php file IMHO. ElementHexa::parse sends ElementString::parse a perfectly valid string, and it's ElementString that proceeds to mess it up.

The only reason ElementHexa is involved at all is because it's a vector that allows this sort of tricky (but still valid!) string to reach ElementString, whereas normally ElementString would rarely see such strings.

ElementString::parse expects all internal parentheses to be escaped, and I would wager that in >99% of PDFs, they are. But according to the PDF Reference, balanced, unescaped parentheses are allowed. So this is a matter of just updating ElementString::parse to handle balanced, unescaped parentheses. I haven't fully examined the updated code in ElementString::parse in this PR, but as long as it does that, there should be no other edits necessary to ElementHexa.php.

Edit: In fact, when I update the test hexadecimal string to '\(\(\\\\' the output is '()\', not truncated at all, because all the parentheses were escaped as ElementString::parse expected.

k00ni · 2024-06-21T11:37:08Z

What are your plans here @krzyc?

As mentioned, the change in ElementHexa does not seem to be required.

GreyWyvern · 2024-06-24T13:19:24Z

I've been thinking about this some more and it's a little more complicated than it seems at first glance.

ElementString accepts (and currently expects) strings with escaped slashes and parentheses. 99% of the time strings from PDF document streams will be escaped, BUT unescaped, balanced parentheses are allowed. Unescaped backslashes are NOT allowed; according to the PDF reference if they aren't followed by a valid escape character, they should be ignored.

And that is the problem with strings from ElementHexa. It passes a completely unescaped (but valid) string, but if it contains literal backslashes, then further parsing it will remove them or result in an incorrect special character.

Strings that originate from ElementHexa should be passed to ElementString to create the appropriate object, but ElementString::parse() should probably not be called since the string is already unescaped. Or if necessary, a new function to process already unescaped strings can be added.

k00ni · 2024-07-19T10:42:38Z

I don't want to let this PR become stale, but my time is limited currently. What is the minimal amount of work to finish this? A revert of the ElementHexa part and some refinement should be enough. Any suggestions?

CC @GreyWyvern @j0k3r

k00ni · 2024-08-16T06:46:04Z

Closed due to inactivity and no response from the author.

krzyc added 2 commits May 27, 2024 19:45

Fix hexadecimal decoding (issue smalot#715)

a604278

More round bracket related tests

edaa7f4

k00ni added needs work fix labels May 28, 2024

Fix PHP 7.x compatibility

8e6d3b7

Rewrite ElementString parser

4ba5b8d

krzyc marked this pull request as ready for review May 29, 2024 11:22

k00ni linked an issue Jun 5, 2024 that may be closed by this pull request

Closing round bracket encoded in hexadecimal format breaks parsing #715

Open

ElementStringTest: added further tests, ...

53d5ccd

... which fail (mostly) without the fixes in this branch.

k00ni reviewed Jun 5, 2024

View reviewed changes

k00ni closed this Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hexadecimal decoding (fixes #715) #716

Fix hexadecimal decoding (fixes #715) #716

krzyc commented May 27, 2024

k00ni commented May 28, 2024 •

edited

Loading

krzyc commented May 28, 2024

k00ni commented May 28, 2024

krzyc commented May 28, 2024

krzyc commented May 29, 2024 •

edited

Loading

k00ni commented May 30, 2024

k00ni left a comment •

edited

Loading

k00ni Jun 5, 2024

k00ni Jun 5, 2024

j0k3r Jun 5, 2024

GreyWyvern commented Jun 5, 2024 •

edited

Loading

k00ni commented Jun 21, 2024

GreyWyvern commented Jun 24, 2024

k00ni commented Jul 19, 2024 •

edited

Loading

k00ni commented Aug 16, 2024

Fix hexadecimal decoding (fixes #715) #716

Fix hexadecimal decoding (fixes #715) #716

Conversation

krzyc commented May 27, 2024

Type of pull request

About

k00ni commented May 28, 2024 • edited Loading

krzyc commented May 28, 2024

k00ni commented May 28, 2024

krzyc commented May 28, 2024

krzyc commented May 29, 2024 • edited Loading

k00ni commented May 30, 2024

k00ni left a comment • edited Loading

Choose a reason for hiding this comment

k00ni Jun 5, 2024

Choose a reason for hiding this comment

k00ni Jun 5, 2024

Choose a reason for hiding this comment

j0k3r Jun 5, 2024

Choose a reason for hiding this comment

GreyWyvern commented Jun 5, 2024 • edited Loading

k00ni commented Jun 21, 2024

GreyWyvern commented Jun 24, 2024

k00ni commented Jul 19, 2024 • edited Loading

k00ni commented Aug 16, 2024

k00ni commented May 28, 2024 •

edited

Loading

krzyc commented May 29, 2024 •

edited

Loading

k00ni left a comment •

edited

Loading

GreyWyvern commented Jun 5, 2024 •

edited

Loading

k00ni commented Jul 19, 2024 •

edited

Loading