Fix returning empty text in some cases #666

xAzoom · 2024-01-12T11:11:13Z

Type of pull request

Bug fix (involves code and configuration changes)

About

In some cases, we may have received empty text, due to an invalid !\in_array($xobject->getUniqueId(), self::$recursionStack) comparison in PDFObject (e.g., in_array('0000000000000e110000000000000000', ['0000000000000e030000000000000000']) will return true). For more information, see https://www.php.net/manual/en/function.spl-object-hash.php notes. In this PR strict parameter to in_array has been added.

k00ni

Thank you @xAzoom for this PR!

Please provide at least one unit test to cover this change. Or is there already one?

README.md

xAzoom · 2024-01-14T16:22:42Z

Thank you @xAzoom for this PR!

Please provide at least one unit test to cover this change. Or is there already one?

Unfortunately, I can't mock the internal protected getUniqueId() function in PDFObject to ensure that the comparison is already correct. But I think it's most important that this change doesn't break any part of the library and passes all tests.

Is it possible to release v2.7.1 with this fix?

k00ni · 2024-01-19T14:02:40Z

Unfortunately, I can't mock the internal protected getUniqueId() function in PDFObject to ensure that the comparison is already correct. But I think it's most important that this change doesn't break any part of the library and passes all tests.

I also have no idea currently how to mock this code part. It might be possible, but is not worth my time.

Can you provide a PDF which triggered the problem before this fix? That would also fit.

Is it possible to release v2.7.1 with this fix?

It could be part of our next release (v2.8.1) when its ready. But I can give you no fixed date.

GreyWyvern · 2024-01-24T17:02:19Z

After examining this PR, my first thought was: "There's gotta be more than just two places in PdfParser that use in_array()!" So I checked, and it is in fact real. These are the only two places in the whole source that use that function. Amazing!

While it would be easy to create a test to directly compare strict vs. non-strict in_array(), the two locations changed are extracting id text from the document to use. So it would be much easier if we had a test document to use for the test suite.

If that's impossible, I can see if I can somehow create some fake document code that tests this, but I ain't looking forward to it!

k00ni · 2024-01-25T13:27:32Z

Please merge in master branch, to get rid of these coding style issues.

xAzoom · 2024-01-31T20:13:59Z

Unfortunately, I can't mock the internal protected getUniqueId() function in PDFObject to ensure that the comparison is already correct. But I think it's most important that this change doesn't break any part of the library and passes all tests.

I also have no idea currently how to mock this code part. It might be possible, but is not worth my time.

Can you provide a PDF which triggered the problem before this fix? That would also fit.

Is it possible to release v2.7.1 with this fix?

It could be part of our next release (v2.8.1) when its ready. But I can give you no fixed date.

Here I also have the bad news, it totally does not depend on the PDF document, but rather on the structure of the code. My quick fix in my project was to move one of the test methods up, and all tests passed (I didn't actually change any line of code, just reordered the methods). So I suppose spl_object_hash works in a deterministic way, but it will be difficult to simulate the creation of hashes like "0000000000000e030000000000000000000000000000".

k00ni · 2024-03-01T09:51:00Z

Sorry for the delay here. We recently fixed Scrutinizer which provides code coverage. Its report says that the lines in question are covered by

I think its reasonable to merge it without any tests.

k00ni requested changes Jan 12, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

k00ni added fix parsing fail When (almost) nothing can be extracted from a given PDF labels Jan 12, 2024

xAzoom force-pushed the master branch from 47ed3b2 to bdc35f8 Compare January 14, 2024 16:13

Fix returning empty text in some cases

208196e

xAzoom force-pushed the master branch from bdc35f8 to 208196e Compare January 31, 2024 20:11

Merge branch 'master' into xAzoom_pr_666

dc85416

k00ni merged commit 6b53144 into smalot:master Mar 1, 2024
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix returning empty text in some cases #666

Fix returning empty text in some cases #666

xAzoom commented Jan 12, 2024

k00ni left a comment

xAzoom commented Jan 14, 2024

k00ni commented Jan 19, 2024 •

edited

Loading

GreyWyvern commented Jan 24, 2024

k00ni commented Jan 25, 2024

xAzoom commented Jan 31, 2024

k00ni commented Mar 1, 2024

Fix returning empty text in some cases #666

Fix returning empty text in some cases #666

Conversation

xAzoom commented Jan 12, 2024

Type of pull request

About

k00ni left a comment

Choose a reason for hiding this comment

xAzoom commented Jan 14, 2024

k00ni commented Jan 19, 2024 • edited Loading

GreyWyvern commented Jan 24, 2024

k00ni commented Jan 25, 2024

xAzoom commented Jan 31, 2024

k00ni commented Mar 1, 2024

k00ni commented Jan 19, 2024 •

edited

Loading