-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text visitor example in docs does not work #2881
Comments
Thanks for the report. This is no issue with version 5.0, but apparently has already been broken in version 3. Using |
Great thanks for looking into it. Can you briefly explain to me how these matrices should be used? I've ready the docs but I am honestly still a bit confused. The docs here say "It is recommended to use the user_matrix as it takes into all transformations." (user_matrix which seems to also be called cm). Then a bit later it says:
And then here you are suggesting that we should actually be using the tm matrix and not the cm matrix at all? My goal is that I can extract text from a PDF and know what the bounding box coordinates are in pdf User Space. For example, pymupdf has get text blocks method which returns bbox coordinates. What would be the equivalent in pypdf? |
In this specific case (for the PDF given), AFAIK there is no way to get the bounding boxes at the moment, just the "reference position" from the visitors. To get full bounding boxes, you would have to further work with the font properties. |
Got it, sounds like this library is not a good fit for my use case, and pdfminer might be better. Just for my knowledge, where would you say pypdf excels vs other open source python pdf libraries? What is the intended use case? |
I consider pypdf the liberal licensed PDF library written in pure Python for reading, modifying and writing PDF files. This includes handling metadata, doing (basic) text extraction, extracting images, filling forms, adding watermarks and backgrounds to pages, removing or adding pages (including merging), transforming pages, ... Depending on your use-case, other libraries might be a better fit at the moment, which I am not going to deny. Working with signed PDF files like in pyhanko, extracting character-level data like in pdfminer.six or MuPDF CLI, rendering pages to images like with poppler is not supported and are common cases where I rely on other tools as well. |
I am trying to figure out how to extract text based on line coordinates, and using the example from here https://github.com/py-pdf/pypdf/blob/main/docs/user/extract-text.md#example-1-ignore-header-and-footer with the example document. However that does not seem to work. The y coordinates visited don't seem correct at all, or at least I dont understand what they mean. Is the example provided no longer how the code works? Or is something broken. The actual extracted text looks correct to me, but not the visitor.
Environment
Which environment were you using when you encountered the problem?
Code + PDF
The PDF used is the one in the example, https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf
The text was updated successfully, but these errors were encountered: