Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text visitor example in docs does not work #2881

Open
lucasgadams opened this issue Sep 27, 2024 · 5 comments
Open

Text visitor example in docs does not work #2881

lucasgadams opened this issue Sep 27, 2024 · 5 comments
Labels
nf-documentation Non-functional change: Documentation

Comments

@lucasgadams
Copy link

lucasgadams commented Sep 27, 2024

I am trying to figure out how to extract text based on line coordinates, and using the example from here https://github.com/py-pdf/pypdf/blob/main/docs/user/extract-text.md#example-1-ignore-header-and-footer with the example document. However that does not seem to work. The y coordinates visited don't seem correct at all, or at least I dont understand what they mean. Is the example provided no longer how the code works? Or is something broken. The actual extracted text looks correct to me, but not the visitor.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-14.5-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.0.0, crypt_provider=('cryptography', '43.0.1'), PIL=10.4.0

Code + PDF

In [7]: from pypdf import PdfReader
   ...:
   ...: reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
   ...: page = reader.pages[3]
   ...:
   ...: parts = []
   ...:
   ...:
   ...: def visitor_body(text, cm, tm, font_dict, font_size):
   ...:     y = cm[5]
   ...:     if 50 < y < 720:
   ...:         parts.append(text)
   ...:         print(f"Adding text within coordinates: {text}")
   ...:     else:
   ...:         print(f"Skipping text out of range: {y}")
   ...:
   ...:
   ...: extracted_text = page.extract_text(visitor_text=visitor_body)
   ...: text_body = "".join(parts)
   ...: print(f"Size extracted text: {len(extracted_text)}")
   ...: print(f"Size visited text: {len(text_body)}")
   ...:
   ...:
   ...:
   ...:
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Size extracted text: 1814
Size visited text: 0

The PDF used is the one in the example, https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf

@stefan6419846
Copy link
Collaborator

Thanks for the report. This is no issue with version 5.0, but apparently has already been broken in version 3. Using tm instead of cm fixes it in this case.

@stefan6419846 stefan6419846 added the nf-documentation Non-functional change: Documentation label Sep 28, 2024
@stefan6419846 stefan6419846 changed the title Text visitor broken in 5.0 Text visitor example in docs does not work Sep 28, 2024
@lucasgadams
Copy link
Author

Great thanks for looking into it. Can you briefly explain to me how these matrices should be used? I've ready the docs but I am honestly still a bit confused. The docs here say "It is recommended to use the user_matrix as it takes into all transformations." (user_matrix which seems to also be called cm). Then a bit later it says:

If you want to get the full transformation from text to user space, you can use the mult function (available in global import) as follows: txt2user = mult(tm, cm)). The font size is the raw text size and affected by the user_matrix.

And then here you are suggesting that we should actually be using the tm matrix and not the cm matrix at all?

My goal is that I can extract text from a PDF and know what the bounding box coordinates are in pdf User Space. For example, pymupdf has get text blocks method which returns bbox coordinates. What would be the equivalent in pypdf?

@stefan6419846
Copy link
Collaborator

In this specific case (for the PDF given), mult(tm, cm) should be equivalent to tm as far as I remember. Thus using tm in this case would work, but mult(tm, cm) is better.

AFAIK there is no way to get the bounding boxes at the moment, just the "reference position" from the visitors. To get full bounding boxes, you would have to further work with the font properties.

@lucasgadams
Copy link
Author

Got it, sounds like this library is not a good fit for my use case, and pdfminer might be better. Just for my knowledge, where would you say pypdf excels vs other open source python pdf libraries? What is the intended use case?

@stefan6419846
Copy link
Collaborator

I consider pypdf the liberal licensed PDF library written in pure Python for reading, modifying and writing PDF files. This includes handling metadata, doing (basic) text extraction, extracting images, filling forms, adding watermarks and backgrounds to pages, removing or adding pages (including merging), transforming pages, ...

Depending on your use-case, other libraries might be a better fit at the moment, which I am not going to deny. Working with signed PDF files like in pyhanko, extracting character-level data like in pdfminer.six or MuPDF CLI, rendering pages to images like with poppler is not supported and are common cases where I rely on other tools as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nf-documentation Non-functional change: Documentation
Projects
None yet
Development

No branches or pull requests

2 participants