Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix returning empty text in some cases #666

Merged
merged 2 commits into from
Mar 1, 2024
Merged

Conversation

xAzoom
Copy link
Contributor

@xAzoom xAzoom commented Jan 12, 2024

Type of pull request

Bug fix (involves code and configuration changes)

About

In some cases, we may have received empty text, due to an invalid !\in_array($xobject->getUniqueId(), self::$recursionStack) comparison in PDFObject (e.g., in_array('0000000000000e110000000000000000', ['0000000000000e030000000000000000']) will return true). For more information, see https://www.php.net/manual/en/function.spl-object-hash.php notes. In this PR strict parameter to in_array has been added.

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @xAzoom for this PR!

Please provide at least one unit test to cover this change. Or is there already one?

README.md Outdated Show resolved Hide resolved
@k00ni k00ni added fix parsing fail When (almost) nothing can be extracted from a given PDF labels Jan 12, 2024
@xAzoom
Copy link
Contributor Author

xAzoom commented Jan 14, 2024

Thank you @xAzoom for this PR!

Please provide at least one unit test to cover this change. Or is there already one?

Unfortunately, I can't mock the internal protected getUniqueId() function in PDFObject to ensure that the comparison is already correct. But I think it's most important that this change doesn't break any part of the library and passes all tests.

Is it possible to release v2.7.1 with this fix?

@k00ni
Copy link
Collaborator

k00ni commented Jan 19, 2024

Unfortunately, I can't mock the internal protected getUniqueId() function in PDFObject to ensure that the comparison is already correct. But I think it's most important that this change doesn't break any part of the library and passes all tests.

I also have no idea currently how to mock this code part. It might be possible, but is not worth my time.

Can you provide a PDF which triggered the problem before this fix? That would also fit.

Is it possible to release v2.7.1 with this fix?

It could be part of our next release (v2.8.1) when its ready. But I can give you no fixed date.

@GreyWyvern
Copy link
Contributor

After examining this PR, my first thought was: "There's gotta be more than just two places in PdfParser that use in_array()!" So I checked, and it is in fact real. These are the only two places in the whole source that use that function. Amazing!

While it would be easy to create a test to directly compare strict vs. non-strict in_array(), the two locations changed are extracting id text from the document to use. So it would be much easier if we had a test document to use for the test suite.

If that's impossible, I can see if I can somehow create some fake document code that tests this, but I ain't looking forward to it!

@k00ni
Copy link
Collaborator

k00ni commented Jan 25, 2024

Please merge in master branch, to get rid of these coding style issues.

@xAzoom
Copy link
Contributor Author

xAzoom commented Jan 31, 2024

Unfortunately, I can't mock the internal protected getUniqueId() function in PDFObject to ensure that the comparison is already correct. But I think it's most important that this change doesn't break any part of the library and passes all tests.

I also have no idea currently how to mock this code part. It might be possible, but is not worth my time.

Can you provide a PDF which triggered the problem before this fix? That would also fit.

Is it possible to release v2.7.1 with this fix?

It could be part of our next release (v2.8.1) when its ready. But I can give you no fixed date.

Here I also have the bad news, it totally does not depend on the PDF document, but rather on the structure of the code. My quick fix in my project was to move one of the test methods up, and all tests passed (I didn't actually change any line of code, just reordered the methods). So I suppose spl_object_hash works in a deterministic way, but it will be difficult to simulate the creation of hashes like "0000000000000e030000000000000000000000000000".

@k00ni
Copy link
Collaborator

k00ni commented Mar 1, 2024

Sorry for the delay here. We recently fixed Scrutinizer which provides code coverage. Its report says that the lines in question are covered by

I think its reasonable to merge it without any tests.

@k00ni k00ni merged commit 6b53144 into smalot:master Mar 1, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix parsing fail When (almost) nothing can be extracted from a given PDF
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants