Account for inline images in formatContent() #693

GreyWyvern · 2024-03-15T17:59:32Z

Type of pull request

Bug fix (involves code and configuration changes)

About

formatContent() now accounts for inline image BI ... ID ... EI commands in document streams. Resolves #691.

Checklist for code / configuration changes

In case you changed the code/configuration, please read each of the following checkboxes as they contain valuable information:

Please add at least one test case (unit test, system test, ...) to demonstrate that the change is working. If existing code was changed, your tests cover these code parts as well.
Please run PHP-CS-Fixer before committing, to confirm with our coding styles. See https://github.com/smalot/pdfparser/blob/master/.php-cs-fixer.php for more information about our coding styles.
In case you fix an existing issue, please do one of the following:
- Write in this text something like fixes #1234 to outline that you are providing a fix for the issue #1234.

`formatContent()` now accounts for inline image `BI ... ID ... EI` commands in document streams.

Include the `BI` command in the regexp, and move inline image detection after string replacement to prevent false-positives.

k00ni

@GreyWyvern 👍

tests/PHPUnit/Integration/PDFObjectTest.php

GreyWyvern · 2024-03-25T15:38:21Z

Converting this to a draft for now. @iGrog supplied another PDF that still had the issue, and in fixing it, I'm sure there is an edge case: if a (string) contains the BI keyword, and then ID and EI can be found further on in the document, the potential is there for a large chunk of the document to be ignored. Very small chance this happens, but it's there.

The internal content of the captured BI ... ID ... EI needs to be checked to verify that it is indeed inline image content before allowing the replace. I'll work on this and update this PR when ready.

@iGrog

Add the /s modifier so the `.` token matches newlines as well. Thanks to @iGrog for supplying another PDF that demonstrated this issue. Add the same modifier for dictionaries as well, fixing this oversight. Move the inline image replacement before string replacement. Parentheses in binary image data may be interpreted as the start of a string. Move the inline images test to its own function and add a newline to the sample data to test for the dotall modifier change.

k00ni · 2024-03-25T15:45:25Z

I really appreciate you taking the time!

`BI` "commands" within strings should not be parsed as the beginning of inline image blocks. Detect if the `BI` we found is inside a (string) and if it is, note the offset and move past it for the next match.

src/Smalot/PdfParser/PDFObject.php

In the case where a valid inline image dictionary isn't found, or if the dictionary doesn't include the required parameters Height and Width, also bump the search offset forward by the current match position so we don't fall into a loop here.

GreyWyvern · 2024-05-07T19:16:22Z

So, the last thing left here that the code wouldn't cover is a proper inline image, that doesn't have a proper image-properties dictionary with a width and height. The code in this PR then skips over it, but the potential is there for such an inline image (probably very rare if it happens at all) to contain binary content that can potentially cause errors in the way PdfParser interprets the document stream. (Like unbalanced Q/q etc.)

We can:

Just accept it as is; the document with such an inline image is malformed anyways. There should be no expectation of an error-free parsing in such a case.
Not check for the height and width in the dictionary at all, and just accept all BI ... ID ... EI sequences outside of strings as "valid" inline images. This allows the possibility (miniscule?) of finding false-positive inline image sequences.

I've no data to back it up, but I believe the second case, where formatContent() finds a BI ... ID ... EI sequence outside of a string, but in error, is a probably rarer than an inline image dictionary not containing a height and width. But then again I could be wrong!

Regardless, I would recommend keeping the dictionary check just in case. If it gets released and users find the array-access error again, then we can always remove it. In this case, this PR is ready to be taken out of draft status as-is.

k00ni

Regardless, I would recommend keeping the dictionary check just in case. If it gets released and users find the array-access error again, then we can always remove it. In this case, this PR is ready to be taken out of draft status as-is.

In my opinion, if something is not according to the specification, you can go yolo and do whatever you want. As you described in high detail, the only thing we can do with ill-formed PDFs is to try to make the best of it. As a user/developer I surely appreciate if software can handle ill-formed data to some extent. It keeps me sane. On the other hand we are a community which maintains the library in our sparetime, so there must be a balance.

That being said, your arguments make sense and I will follow your advice here @GreyWyvern. Please do the final preparations and mark the PR ready for review.

In the following just a few remarks/suggestions.

src/Smalot/PdfParser/PDFObject.php

Add "Step X:" to the comments to better define what the inline image replacement code is doing. Small adjustment to the balanced parentheses regexp to also exclude open parenthesis '(' from the matching. This will ensure replacing balanced parentheses from the innermost to the outermost.

k00ni · 2024-05-13T06:33:58Z

Thank you very much @GreyWyvern

GreyWyvern added 2 commits March 15, 2024 13:37

Account for inline image data in formatContent()

adb1194

`formatContent()` now accounts for inline image `BI ... ID ... EI` commands in document streams.

Include BI command in the regexp

4ae52e7

Include the `BI` command in the regexp, and move inline image detection after string replacement to prevent false-positives.

k00ni added the fix label Mar 25, 2024

k00ni requested changes Mar 25, 2024

View reviewed changes

tests/PHPUnit/Integration/PDFObjectTest.php Outdated Show resolved Hide resolved

GreyWyvern marked this pull request as draft March 25, 2024 15:33

GreyWyvern and others added 2 commits March 25, 2024 16:10

More robust check for BI within strings

8d00508

`BI` "commands" within strings should not be parsed as the beginning of inline image blocks. Detect if the `BI` we found is inside a (string) and if it is, note the offset and move past it for the next match.

Merge branch 'master' into inline-images

1d2b0ac

k00ni requested changes Apr 2, 2024

View reviewed changes

src/Smalot/PdfParser/PDFObject.php Outdated Show resolved Hide resolved

GreyWyvern mentioned this pull request Apr 22, 2024

Trying to access array offset on value of type null (PDFObject.php line 795) #691

Closed

k00ni reviewed May 10, 2024

View reviewed changes

src/Smalot/PdfParser/PDFObject.php Show resolved Hide resolved

src/Smalot/PdfParser/PDFObject.php Show resolved Hide resolved

src/Smalot/PdfParser/PDFObject.php Outdated Show resolved Hide resolved

GreyWyvern marked this pull request as ready for review May 10, 2024 15:16

k00ni merged commit a19d555 into smalot:master May 13, 2024
29 checks passed

GreyWyvern mentioned this pull request May 13, 2024

[]TJ command parsed improperly #710

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Account for inline images in formatContent() #693

Account for inline images in formatContent() #693

GreyWyvern commented Mar 15, 2024

k00ni left a comment

GreyWyvern commented Mar 25, 2024

k00ni commented Mar 25, 2024

GreyWyvern commented May 7, 2024

k00ni left a comment

k00ni commented May 13, 2024

Account for inline images in formatContent() #693

Account for inline images in formatContent() #693

Conversation

GreyWyvern commented Mar 15, 2024

Type of pull request

About

Checklist for code / configuration changes

k00ni left a comment

Choose a reason for hiding this comment

GreyWyvern commented Mar 25, 2024

k00ni commented Mar 25, 2024

GreyWyvern commented May 7, 2024

k00ni left a comment

Choose a reason for hiding this comment

k00ni commented May 13, 2024