Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to ignore PDF encryption check #632

Closed
wants to merge 3 commits into from

Conversation

DivineOmega
Copy link

In some cases PDF files may be internally marked as encrypted even though the content is not encrypted and can be read.

This MR provides a config option to inform the PDF parser to ignore the encryption and attempt to read the PDF anyway.

This therefore provides a work around for the following issues:

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @DivineOmega, it is a helpful addition!

I only have a few things:

@GreyWyvern
Copy link
Contributor

This is a good addition, but as the OP says, this is a workaround. Eventually in the future the simple check in Parser::parseContent() should be modified to check if the document actually cannot be read.

        if (isset($xref['trailer']['encrypt'])) {
            throw new \Exception('Secured pdf file are currently not supported.');
        }

It should be taken into account that a future fix for this would obsolete the use of the config option being added here. That's probably the only thing I don't like about this change.

@k00ni
Copy link
Collaborator

k00ni commented Sep 11, 2023

@DivineOmega Are you still with us here?

@k00ni k00ni closed this Sep 12, 2023
@DivineOmega
Copy link
Author

Hi. Sorry for the delayed response. Things have been busy recently.

I didn't end up actually using this functionality myself. I found that a majority of the PDFs I ignored the encryption check for would actually be parsed as containing no text or limited useful text. I'm not sure why this is and so my workaround here ended up not being useful for my use case.

This library still provides some of the best parsing I've found. My solution was to use an alternative parser if this one detected an encrypted PDF.

@unixnut
Copy link
Contributor

unixnut commented Nov 21, 2023

@k00ni Can you please reopen and merge this, as in some cases the PDFs are from a predictable origin and are readable but are marked as encrypted. I believe it is up to the caller to test that the data they get is valid.

I am willing to write the test (using test.pdf from #488) and the docs. But first I would need agreement that the merge would be done if those conditions are met.

Thanks.

@k00ni
Copy link
Collaborator

k00ni commented Nov 21, 2023

@unixnut Thank you for your interest. You have my full support. It would be great if we could agree on the following list:

  • Create a new PR, with a reference to this one. This way we have a clean start.
  • Add at least 2 (very simple) tests: one with new option active and one with not active
  • Please add a note to the function header of the new Config options. Something like "this is a workaround, don't rely on it, may change in the future, further information in the following PR XXX" (see comment Add ability to ignore PDF encryption check #632 (comment)) (I can do that, if you want)
  • A note in the documentation (I can do that, if you want)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants