Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSS-3207: added memory setting usage parameter @ dss-pades-pdfbox #177

Draft
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

roy20021
Copy link

Follow-up of #176

@roy20021 roy20021 marked this pull request as draft September 19, 2024 13:37
@roy20021
Copy link
Author

roy20021 commented Sep 24, 2024

@bsanchezb

Better to avoid changing the constructors of PdfBoxSignatureService, or other interfaces. It may have a big impact on the implementors. Instead, it would be better to configure the PDFSignatureService with a setter, as done in AbstractPdfObjFactory#configure.

Done. Actually, I have changed some PDFBoxUtil's methods' interface.

As you mentioned, support of similar setting in OpenPdf implementation would be highly desirable. Could you please evaluate the possibility to implement it? The MemoryUsageSetting from PdfBox looks to be easy to configure, so maybe we could introduce our own DSS wrapper for memory settings to be shared between two implementations, and to be transformed to implementation specific class when needed?

Done, however the mapping cannot be 1-to-1 due of different level of underlying support

For the formatting, try to simply follow our style. Please add JavaDocs for any protected or public methods and classes, if possible. Simply add a small description on what method does and comments for input and output objects. See other classes as an example.

I tried.. I propose (and I can do in another ad-hoc PR) to configure spotless to also format the code

For the testing you may see what we do in unit test PAdESLevelBSignWithTempFileHandlerTest.java, retrieving the memory usage in runtime (not ideal, but works well for big memory differences).

Done but I am not satisfied of the results. Often the memory consumption isn't reliable and test may fail (I commented the asserts).

Anyway the activity is fine, for example If I run test with small max heap value the memory only setting test fails instead the file one succeed.
Moreover, I think that the unit test should be about using properly the underlying libraries.. not checking their behaviour so unit testing also them.
Can you have a look here?

P.S. I run the spotless maven goal and it come out many files were without the license header.

@roy20021
Copy link
Author

@bsanchezb can you have a look, please?

Maybe there are too changes.. do I need to remove the license commit?

@bsanchezb
Copy link
Contributor

Dear Andrea,

Thank you for the PR, it looks very good.

I have added a few comments directly in the code, could you please take a look and address the issues?

I would also highly appreciate, if you could add a documentation about the near future, within the cookbook, just after the DSSResourcesHandler chapter.

KR,
Aleksandr

@roy20021
Copy link
Author

Dear Andrea,

Thank you for the PR, it looks very good.

I have added a few comments directly in the code, could you please take a look and address the issues?

I would also highly appreciate, if you could add a documentation about the near future, within the cookbook, just after the DSSResourcesHandler chapter.

KR, Aleksandr

@bsanchezb I do not see any comments, have you submitted a review?

Copy link
Contributor

@bsanchezb bsanchezb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please start from using PdfMemoryUsageSetting directly in PdfBoxDocumentReader/ITextDocumentReader constructors instead of implementation-specific classes. This should simplify the code and leave conversion to the very end.

FileDocument fileDocument = (FileDocument) dssDocument;
this.pdDocument = PDDocument.load(fileDocument.getFile(), passwordProtection, memoryUsageSetting);
} else if (dssDocument instanceof InMemoryDocument) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a special handling for InMemoryDocument? Is generic dssDocument.openStream() not sufficient?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PDDocument.load(stream, ..) needs twice the memory so if it is already in memory I skip such loading making it more efficient.
You can check the code of PdfBox at org.apache.pdfbox.io.ScratchFile.createBuffer(InputStream input)

if (ITextPdfMemoryUsageSetting.Mode.FULL.equals(memoryUsageSetting.getMode())) {
this.pdfReader = new PdfReader(filenameSource, passwordProtection);
} else {
this.pdfReader = new PdfReader(new RandomAccessFileOrArray(filenameSource), passwordProtection);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not the same, as new PdfReader(filenameSource, passwordProtection) ?

Should not be

new PdfReader(new RandomAccessFileOrArray(filenameSource, true, false), passwordProtection);

used instead?

I would even go further, and simplify the code to:

if (dssDocument instanceof FileDocument && ITextPdfMemoryUsageSetting.Mode.FULL.equals(memoryUsageSetting.getMode())) {
  FileDocument fileDocument = (FileDocument) dssDocument;
  String filenameSource = fileDocument.getFile().getAbsolutePath();
  this.pdfReader = new PdfReader(filenameSource, passwordProtection);
} else {
  try (InputStream is = dssDocument.openStream()) {
	  this.pdfReader = new PdfReader(is, passwordProtection);
  }
}

considering the implementation of RandomAccessFileOrArray class.

I also find ITextPdfMemoryUsageSetting modes FULL and PARTIAL confusing. I would prefer FILE and MEMORY to be used instead,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reply starting from the last sentences.

There are 2 different but very related topics: loading mode (full or partial) and memory setting (file, memory).
Clearly, they are related because if the loading mode is full than all pdf has to be loaded in memory.

At best of my knowledge, each library offers different level of configuration among such topics.
For OpenPdf, as a conversational reference, take a look at: https://stackoverflow.com/questions/62164906/use-of-randomaccessfileorarray
Check by yourself the code of the PdfReader constructors regards the last loading line: readPdfPartial(); for RandomAccessFileOrArray or readPdf(); for the other ones.

PdfBox models it differently and inherently uses partial mode when file usage is requested.


Regards new RandomAccessFileOrArray(filenameSource, true, false) won't be okay because will load all pdf into memory so as a full load (instead of partial load).
Anyway, thanks to your suggestion, I notice the com.lowagie.text.Document.plainRandomAccess attribute and the same semantic into the RandomAccessFileOrArray constructor.
In case of FileDocument AND partial loading I can set that flag to true so the content will be loaded without mapping memory.

I see here an issue of expressivity of the common model and the implementation specific ones. It might be desirable to config whatever combination but I guess that wrapping it is not a solution so just 'reasonable' combination can be provided.

* Load options
*/
public enum Mode {
FULL, PARTIAL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change to FILE and MEMORY modes. I do not see how "PARTIAL" is possible with RandomAccessFileOrArray implementation. Do I miss something?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have commented elsewhere regards FULL/PARTIAL and FILE/MEMORY topics. Hope that clarify.

* MemoryUsageSetting's load options
*/
public enum Mode {
MEMORY, FILE, MIXED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add javadocs for different mode enums.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per other comments, the mapping is pretty implementation biased. Shall I write that here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file essential for the unit test provided? Not possible to achieve the same result with "big_file.pdf" document? I would prefer to avoid committing this big document to the repository, if possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done but I am not satisfied of the results. Often the memory consumption isn't reliable and test may fail (I commented the asserts).

Anyway the activity is fine, for example If I run test with small max heap value the memory only setting test fails instead the file one succeed.
Moreover, I think that the unit test should be about using properly the underlying libraries.. not checking their behaviour so unit testing also them.
Can you have a look here?

Test will often fail with the "big_file.pdf". As per my previous comment, I have left this PR in draft mode due of this open topic.

I think it is the wrong approach to verify the memory consumption. I propose to check that the PDF libs are called as supposed to be and not "unit testing" also their behaviour.

Please advice.

* @throws IOException if an exception occurs
* @throws eu.europa.esig.dss.pades.exception.InvalidPasswordException if the password is not provided or
* invalid for a protected document
*/
public PdfBoxDocumentReader(DSSDocument dssDocument, String passwordProtection, MemoryUsageSetting memoryUsageSetting)
public PdfBoxDocumentReader(DSSDocument dssDocument, String passwordProtection,
MemoryUsageSetting memoryUsageSetting)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about propagating generic PdfMemoryUsageSetting till PdfBoxDocumentReader/ITextDocumentReader constructors, instead of using implementation specific classes? This would help to avoid calling PdfBoxUtils.getMemoryUsageSetting on every call of PdfBoxDocumentReader constructor. Additionally, this may help to avoid creation of OpenPdf specific class ITextPdfMemoryUsageSetting.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay to update PdfBoxDocumentReader/ITextDocumentReader constructors to take PdfMemoryUsageSetting instead of specific implementation class.

Instead, it will not avoid the creation of the specific OpenPdf one.

* @return {@link DSSDocument} PNG screenshot
*/
public static DSSDocument generateScreenshot(DSSDocument pdfDocument, int page) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep default simplified methods within PdfBoxUtils class, with an in-memory configuration of PdfMemoryUsageSetting loaded by default.

@@ -60,42 +63,42 @@ void init() {

@Test
void generateScreenshotTest() {
DSSDocument screenshot = PdfBoxUtils.generateScreenshot(sampleDocument, 1);
DSSDocument screenshot = PdfBoxUtils.generateScreenshot(sampleDocument, 1, MemoryUsageSetting.setupMainMemoryOnly());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you keep old PdfBoxUtils methods, no need to modify the corresponding unit tests

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want I create both version of each method?
I won't keep only the old PdfBoxUtils methods since that will always load into memory and it is not the deal. Am I correct?

import java.util.List;
import java.util.stream.Collectors;

import org.apache.commons.lang3.tuple.Pair;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid apache lang3 lib in the unit tests. It should be possible to count memory consumption for you use case "on the fly" within #doAllSigns method. That is the only used value, if I'm not mistaken.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is commented code that I left waiting your advice that's because...

Please have a look at the other comment about 'unit testing' the pdf libraries.

@bsanchezb
Copy link
Contributor

bsanchezb commented Oct 15, 2024

Hello @roy20021,

Thank you for the comment. I did not know it was not visible to you. Now you should have the review available.
Should you have any questions, please do not hesitate to ask.

KR,
Aleksandr

@roy20021
Copy link
Author

@bsanchezb I have commented your observations, can you see them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants