Add ability to ignore PDF encryption check #632

DivineOmega · 2023-08-14T07:49:27Z

In some cases PDF files may be internally marked as encrypted even though the content is not encrypted and can be read.

This MR provides a config option to inform the PDF parser to ignore the encryption and attempt to read the PDF anyway.

This therefore provides a work around for the following issues:

k00ni

Thank you @DivineOmega, it is a helpful addition!

I only have a few things:

Please add a simple test or two, to prove it is working as intended and to avoid regressions later on
Would you mind adding a section to https://github.com/smalot/pdfparser/blob/master/doc/CustomConfig.md as well?

src/Smalot/PdfParser/Parser.php

GreyWyvern · 2023-08-21T16:15:52Z

This is a good addition, but as the OP says, this is a workaround. Eventually in the future the simple check in Parser::parseContent() should be modified to check if the document actually cannot be read.

        if (isset($xref['trailer']['encrypt'])) {
            throw new \Exception('Secured pdf file are currently not supported.');
        }

It should be taken into account that a future fix for this would obsolete the use of the config option being added here. That's probably the only thing I don't like about this change.

k00ni · 2023-09-11T06:02:06Z

@DivineOmega Are you still with us here?

DivineOmega · 2023-09-12T06:52:45Z

Hi. Sorry for the delayed response. Things have been busy recently.

I didn't end up actually using this functionality myself. I found that a majority of the PDFs I ignored the encryption check for would actually be parsed as containing no text or limited useful text. I'm not sure why this is and so my workaround here ended up not being useful for my use case.

This library still provides some of the best parsing I've found. My solution was to use an alternative parser if this one detected an encrypted PDF.

unixnut · 2023-11-21T08:40:10Z

@k00ni Can you please reopen and merge this, as in some cases the PDFs are from a predictable origin and are readable but are marked as encrypted. I believe it is up to the caller to test that the data they get is valid.

I am willing to write the test (using test.pdf from #488) and the docs. But first I would need agreement that the merge would be done if those conditions are met.

Thanks.

k00ni · 2023-11-21T10:05:26Z

@unixnut Thank you for your interest. You have my full support. It would be great if we could agree on the following list:

Create a new PR, with a reference to this one. This way we have a clean start.
Add at least 2 (very simple) tests: one with new option active and one with not active
Please add a note to the function header of the new Config options. Something like "this is a workaround, don't rely on it, may change in the future, further information in the following PR XXX" (see comment Add ability to ignore PDF encryption check #632 (comment)) (I can do that, if you want)
A note in the documentation (I can do that, if you want)

Jordan Hall added 2 commits August 14, 2023 08:44

Add ability to ingore PDF encryption check

b3a1446

Switch to ! syntax

02c33bc

k00ni requested changes Aug 14, 2023

View reviewed changes

k00ni reviewed Aug 14, 2023

View reviewed changes

src/Smalot/PdfParser/Parser.php Outdated Show resolved Hide resolved

Update src/Smalot/PdfParser/Parser.php

44916ca

k00ni added the enhancement label Aug 21, 2023

k00ni closed this Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to ignore PDF encryption check #632

Add ability to ignore PDF encryption check #632

DivineOmega commented Aug 14, 2023

k00ni left a comment

GreyWyvern commented Aug 21, 2023

k00ni commented Sep 11, 2023

DivineOmega commented Sep 12, 2023

unixnut commented Nov 21, 2023 •

edited

Loading

k00ni commented Nov 21, 2023 •

edited

Loading

Add ability to ignore PDF encryption check #632

Add ability to ignore PDF encryption check #632

Conversation

DivineOmega commented Aug 14, 2023

k00ni left a comment

Choose a reason for hiding this comment

GreyWyvern commented Aug 21, 2023

k00ni commented Sep 11, 2023

DivineOmega commented Sep 12, 2023

unixnut commented Nov 21, 2023 • edited Loading

k00ni commented Nov 21, 2023 • edited Loading

unixnut commented Nov 21, 2023 •

edited

Loading

k00ni commented Nov 21, 2023 •

edited

Loading