Very high memory allocation / consumption #330
-
Hey folks, I have a question about a particular PDF that has very high memory consumption when trying to extract its texts. I nailed down the issue to:
My idea now was to iterate over a PDFs page, count each pageobjects types and if there are more FPDF_PAGEOBJ_FORM objects than a certain threshold, remove them all from the respective page before I try to extract the text. However, the memory profiling look like follows: Line # Mem usage Increment Occurrences Line Contents
=============================================================
176 117.7 MiB 117.7 MiB 1 @profile
177 def _set_text(self):
181 117.7 MiB 0.0 MiB 1 text = ""
182 527.8 MiB 0.0 MiB 7 for page_nr in range(len(self.pdf)):
183 685.8 MiB 567.3 MiB 6 page = self.pdf[page_nr]
184 # Count FPDF_PAGEOBJ_* objects in the page
185 685.8 MiB -316.0 MiB 6 counter = [0, 0, 0, 0, 0]
186
187 686.2 MiB -60972.1 MiB 6577 for objects in page.get_objects():
188 686.2 MiB -60656.7 MiB 6572 counter[objects.type - 1] += 1
189 686.2 MiB -60656.7 MiB 6572 if counter[4] > 5000:
190 686.2 MiB 0.0 MiB 1 break
191
192 686.2 MiB -316.7 MiB 6 print(counter, page_nr)
193
194 # Remove Form objects if more than 5000 form object on page
197 686.2 MiB -316.7 MiB 6 tracemalloc.start()
198 686.2 MiB -316.7 MiB 6 if counter[4] > 5000:
199 692.9 MiB 6.7 MiB 1 bitmap = page.render(scale=2)
200 700.4 MiB 7.5 MiB 1 pil_image = bitmap.to_pil()
201 709.5 MiB 9.1 MiB 1 pil_image.show()
202 709.5 MiB 0.0 MiB 1 i = 0
203 718.4 MiB -1534787.7 MiB 108919 for objects in page.get_objects():
204 718.4 MiB -1534749.3 MiB 108918 i += 1
205 718.4 MiB -1534749.6 MiB 108918 if i % 10000 == 0:
206 718.3 MiB -121.1 MiB 10 print(i)
207 718.4 MiB -1534749.3 MiB 108918 if objects.type == 5:
208 718.4 MiB -1520109.9 MiB 108002 try:
209 718.4 MiB -1520142.1 MiB 108002 page.remove_obj(objects)
210 718.4 MiB -1520147.8 MiB 108000 except (PdfiumError, ValueError):
211 # ValueError: Page object is not part of the page
212 # PdfiumError: Failed to remove page object.
213 # Seems like one could ignore that?
214 718.4 MiB -1520148.2 MiB 108000 pass
215 679.2 MiB -39.2 MiB 1 bitmap = page.render(scale=2)
216 683.9 MiB 4.7 MiB 1 pil_image = bitmap.to_pil()
217 683.9 MiB -0.0 MiB 1 pil_image.show()
218 683.9 MiB -316.8 MiB 6 current, peak = tracemalloc.get_traced_memory()
219 683.9 MiB -312.1 MiB 6 print(f"Current memory usage: {current / 1024 / 1024:.2f} MB")
220 683.9 MiB -312.1 MiB 6 print(f"Peak memory usage: {peak / 1024 / 1024:.2f} MB")
221 643.3 MiB -352.8 MiB 6 tracemalloc.stop()
222
223 643.3 MiB -230.2 MiB 6 page_text = page.get_textpage().get_text_bounded()
224 527.8 MiB -346.3 MiB 6 page.close()
225 527.8 MiB 0.0 MiB 6 text += "\r\n" + page_text
226 527.8 MiB 0.0 MiB 1 self.text = text So, it seems like even just loading the page object already consumes quiet a lot of memory. Any idea how to solve this issue? Maybe I'm completely on the wrong track? Thanks a lot! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
If someone wants to try it by themselves, the PDF I'm speaking about is this one, especially page 4 is the issue. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the report.
with either Also, I think you are indeed on the wrong track with your "workaround". I would refrain from messing with deletion of some objects on certain conditions, as the peak occurs mainly when loading the page, whereas loading the textpage or extracting text does not cause further peaks (you can try this in a python console). Another observation is that pdfium isn't the only PDF engine that appears to have problems with this file. pdfjs and poppler, too, seem to have some trouble on page 4. |
Beta Was this translation helpful? Give feedback.
-
Ah, thanks for your fast response! As I do not have any idea how fast this issue might be resolved: do you have any idea how to know whether it is "safe" to loading a page or not before the memory peak is happening? Can you point me towards where I should report this issue? As I'm not completely sure where pdfium is "managed". |
Beta Was this translation helpful? Give feedback.
Thanks for the report.
I can confirm there's a substantial peak in RAM usage when running
with either
--strategy range
or--strategy bounded
, while there is no peak with--pages 1-3
.However, this isn't a bindings issue. Please re-submit to pdfium instead.
Also, I think you are indeed on the wrong track with your "workaround". I would refrain from messing with deletion of some objects on certain conditions, as the peak occurs mainly when loading the page, whereas loading the textpage or extracting text does not cause further peaks (you can try this in a python console).
What is a bit concerning,
page.close()
(or other …