Very high memory allocation / consumption #330

s-kruschel · 2024-10-26T11:06:01Z

s-kruschel
Oct 26, 2024

Hey folks,

I have a question about a particular PDF that has very high memory consumption when trying to extract its texts.

I nailed down the issue to:

a specific page which
contains about 100.000 FPDF_PAGEOBJ_* pageobjects (which I'm quiet sure is the issue)

My idea now was to iterate over a PDFs page, count each pageobjects types and if there are more FPDF_PAGEOBJ_FORM objects than a certain threshold, remove them all from the respective page before I try to extract the text.

However, the memory profiling look like follows:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   176    117.7 MiB    117.7 MiB           1       @profile
   177                                             def _set_text(self):
   181    117.7 MiB      0.0 MiB           1           text = ""
   182    527.8 MiB      0.0 MiB           7           for page_nr in range(len(self.pdf)):
   183    685.8 MiB    567.3 MiB           6               page = self.pdf[page_nr]
   184                                                     # Count FPDF_PAGEOBJ_* objects in the page
   185    685.8 MiB   -316.0 MiB           6               counter = [0, 0, 0, 0, 0]
   186                                         
   187    686.2 MiB -60972.1 MiB        6577               for objects in page.get_objects():
   188    686.2 MiB -60656.7 MiB        6572                   counter[objects.type - 1] += 1
   189    686.2 MiB -60656.7 MiB        6572                   if counter[4] > 5000:
   190    686.2 MiB      0.0 MiB           1                       break
   191                                         
   192    686.2 MiB   -316.7 MiB           6               print(counter, page_nr)
   193                                         
   194                                                     # Remove Form objects if more than 5000 form object on page
   197    686.2 MiB   -316.7 MiB           6               tracemalloc.start()
   198    686.2 MiB   -316.7 MiB           6               if counter[4] > 5000:
   199    692.9 MiB      6.7 MiB           1                   bitmap = page.render(scale=2)
   200    700.4 MiB      7.5 MiB           1                   pil_image = bitmap.to_pil()
   201    709.5 MiB      9.1 MiB           1                   pil_image.show()
   202    709.5 MiB      0.0 MiB           1                   i = 0
   203    718.4 MiB -1534787.7 MiB      108919                   for objects in page.get_objects():
   204    718.4 MiB -1534749.3 MiB      108918                       i += 1
   205    718.4 MiB -1534749.6 MiB      108918                       if i % 10000 == 0:
   206    718.3 MiB   -121.1 MiB          10                           print(i)
   207    718.4 MiB -1534749.3 MiB      108918                       if objects.type == 5:
   208    718.4 MiB -1520109.9 MiB      108002                           try:
   209    718.4 MiB -1520142.1 MiB      108002                               page.remove_obj(objects)
   210    718.4 MiB -1520147.8 MiB      108000                           except (PdfiumError, ValueError):
   211                                                                     # ValueError: Page object is not part of the page
   212                                                                     # PdfiumError: Failed to remove page object.
   213                                                                     # Seems like one could ignore that?
   214    718.4 MiB -1520148.2 MiB      108000                               pass
   215    679.2 MiB    -39.2 MiB           1                   bitmap = page.render(scale=2)
   216    683.9 MiB      4.7 MiB           1                   pil_image = bitmap.to_pil()
   217    683.9 MiB     -0.0 MiB           1                   pil_image.show()
   218    683.9 MiB   -316.8 MiB           6               current, peak = tracemalloc.get_traced_memory()
   219    683.9 MiB   -312.1 MiB           6               print(f"Current memory usage: {current / 1024 / 1024:.2f} MB")
   220    683.9 MiB   -312.1 MiB           6               print(f"Peak memory usage: {peak / 1024 / 1024:.2f} MB")
   221    643.3 MiB   -352.8 MiB           6               tracemalloc.stop()
   222                                         
   223    643.3 MiB   -230.2 MiB           6               page_text = page.get_textpage().get_text_bounded()
   224    527.8 MiB   -346.3 MiB           6               page.close()
   225    527.8 MiB      0.0 MiB           6               text += "\r\n" + page_text
   226    527.8 MiB      0.0 MiB           1           self.text = text

So, it seems like even just loading the page object already consumes quiet a lot of memory.

Any idea how to solve this issue? Maybe I'm completely on the wrong track?

Thanks a lot!

Answered by mara004

Oct 26, 2024

Thanks for the report.
I can confirm there's a substantial peak in RAM usage when running

pypdfium2 extract-text "27754_Challenges_in_interpreta.pdf" --pages 4

with either --strategy range or --strategy bounded, while there is no peak with --pages 1-3.
However, this isn't a bindings issue. Please re-submit to pdfium instead.

Also, I think you are indeed on the wrong track with your "workaround". I would refrain from messing with deletion of some objects on certain conditions, as the peak occurs mainly when loading the page, whereas loading the textpage or extracting text does not cause further peaks (you can try this in a python console).
What is a bit concerning, page.close() (or other …

View full answer

s-kruschel · 2024-10-26T11:10:00Z

s-kruschel
Oct 26, 2024
Author

If someone wants to try it by themselves, the PDF I'm speaking about is this one, especially page 4 is the issue.

0 replies

mara004 · 2024-10-26T14:19:01Z

mara004
Oct 26, 2024
Maintainer

Thanks for the report.
I can confirm there's a substantial peak in RAM usage when running

pypdfium2 extract-text "27754_Challenges_in_interpreta.pdf" --pages 4

with either --strategy range or --strategy bounded, while there is no peak with --pages 1-3.
However, this isn't a bindings issue. Please re-submit to pdfium instead.

Also, I think you are indeed on the wrong track with your "workaround". I would refrain from messing with deletion of some objects on certain conditions, as the peak occurs mainly when loading the page, whereas loading the textpage or extracting text does not cause further peaks (you can try this in a python console).
What is a bit concerning, page.close() (or other close methods) do not free the memory. Not even destryoing the library does, only terminating the process. So this looks like a possible memory leak in pdfium.

Another observation is that pdfium isn't the only PDF engine that appears to have problems with this file. pdfjs and poppler, too, seem to have some trouble on page 4.

0 replies

s-kruschel · 2024-10-26T18:36:54Z

s-kruschel
Oct 26, 2024
Author

Ah, thanks for your fast response!

As I do not have any idea how fast this issue might be resolved: do you have any idea how to know whether it is "safe" to loading a page or not before the memory peak is happening?

Can you point me towards where I should report this issue? As I'm not completely sure where pdfium is "managed".

2 replies

mara004 Oct 26, 2024
Maintainer

Can you point me towards where I should report this issue? As I'm not completely sure where pdfium is "managed".

pdfium's bug tracker is at https://issues.chromium.org/issues?q=componentid:1586257%2B%20is:open
See https://github.com/pypdfium2-team#references for further links (pdfium repository, mailing list, CL panel)

As I do not have any idea how fast this issue might be resolved: do you have any idea how to know whether it is "safe" to loading a page or not before the memory peak is happening?

In general, issues likes this are rare. With almost every PDF engine, there will be a handful of documents/structures that cause problems. There is no magic way to predict whether a document belongs to these few exceptions.
No software of this degree of complexity can be expected to be bug-free. And some issues might be deeply rooted in a big codebase, so you can't expect them to get fixed overnight.

On the caller side, don't worry too much about these edge cases. Wasting time on odd workarounds often does more harm than good. If you are the creator of the PDF or have more files that trigger the same issue, you might rather want to look into the producer software, as all PDF engines I tried seem to have some performance trouble with your file. This hints that the producer's code might be doing something problematic or unusual.

mara004 Oct 28, 2024
Maintainer

I'd hazard a guess that the problem might be the plots being composed of a huge amount of separate points. The writer (some latex extension?) should probably reduce the number of points to a reasonable level, or maybe try to vectorize part of the graphs.

If there's a huge number of separate objects, this will always result in a long loop and any PDF engine will end up being slow. Though, of course, PDFium should not leak memory, and could perhaps perform better than it currently does. So it still makes sense to file an issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very high memory allocation / consumption #330

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Very high memory allocation / consumption #330

s-kruschel Oct 26, 2024

Replies: 3 comments · 2 replies

s-kruschel Oct 26, 2024 Author

mara004 Oct 26, 2024 Maintainer

s-kruschel Oct 26, 2024 Author

mara004 Oct 26, 2024 Maintainer

mara004 Oct 28, 2024 Maintainer

s-kruschel
Oct 26, 2024

Replies: 3 comments 2 replies

s-kruschel
Oct 26, 2024
Author

mara004
Oct 26, 2024
Maintainer

s-kruschel
Oct 26, 2024
Author

mara004 Oct 26, 2024
Maintainer

mara004 Oct 28, 2024
Maintainer