Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x values in the tm_matrix are wrong #2075

Closed
rkiddy opened this issue Aug 10, 2023 · 16 comments · Fixed by #2206
Closed

x values in the tm_matrix are wrong #2075

rkiddy opened this issue Aug 10, 2023 · 16 comments · Fixed by #2206
Assignees
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-advanced-text-extraction Getting coordinates, font weight, font type, ...

Comments

@rkiddy
Copy link

rkiddy commented Aug 10, 2023

I am trying to read what seems to be a not very complex pdf. Here is a bit from one page:

Screenshot from 2023-08-09 17-44-08

I am pulling out the y and then x value from the tm_matrix and the text from the visitor_text. I am getting this:

[...]
{'text_matrix': [1, 0, 0, 1, 26, 541], 'text': 'ADOPTION \n'}
{'text_matrix': [1, 0, 0, 1, 123, 530], 'text': 'adults, adoption of,  AB 1756 '}
{'text_matrix': [1, 0, 0, 1, 273, 519], 'text': 'agencies, organizations, etc.: requirements, prohibitions, etc.,  SB 807 '}
{'text_matrix': [1, 0, 0, 1, 245, 508], 'text': 'assistance programs, adoption: nonminor dependents,  SB 9 '}
{'text_matrix': [1, 0, 0, 1, 114, 497], 'text': 'birth certificates,  AB 1302 '}
{'text_matrix': [1, 0, 0, 1, 35, 486], 'text': 'contact agreements, postadoption— \n'}
{'text_matrix': [1, 0, 0, 1, 110, 474], 'text': 'birth parents,  AB 1650 '}
{'text_matrix': [1, 0, 0, 1, 93, 463], 'text': 'siblings,  AB 20 '}
{'text_matrix': [1, 0, 0, 1, 130, 452], 'text': 'facilitators, adoption,  AB 120'}
{'text_matrix': [1, 0, 0, 1, 164, 452], 'text': ';  SB 120'}
{'text_matrix': [1, 0, 0, 1, 184, 452], 'text': ',  807 '}
{'text_matrix': [1, 0, 0, 1, 199, 441], 'text': 'failed adoptions: reproductive loss leave,  SB 848 '}
{'text_matrix': [1, 0, 0, 1, 300, 430], 'text': 'hearings, adoption finalization: remote proceedings, technology, etc.,  SB 21 '}
{'text_matrix': [1, 0, 0, 1, 135, 419], 'text': 'native american tribes,  AB 120'}
{'text_matrix': [1, 0, 0, 1, 168, 419], 'text': ';  SB 120 '}
{'text_matrix': [1, 0, 0, 1, 170, 408], 'text': 'parental rights, reinstatement of,  AB 20 '}
{'text_matrix': [1, 0, 0, 1, 265, 397], 'text': 'parents, prospective adoptive: criminal background checks,  SB 824 '}
{'text_matrix': [1, 0, 0, 1, 26, 386], 'text': 'ADULT EDUCATION \n'}
{'text_matrix': [1, 0, 0, 1, 150, 375], 'text': 'services, adult educational,  SB 877 '}
{'text_matrix': [1, 0, 0, 1, 140, 364], 'text': 'week, adult education,  ACR 31 '}
{'text_matrix': [1, 0, 0, 1, 26, 353], 'text': 'ADVERTISING. See also MARKETING; and particular subject matter (e.g., \n'}
{'text_matrix': [1, 0, 0, 1, 68, 342], 'text': 'ELECTIONS). \n'}
{'text_matrix': [1, 0, 0, 1, 211, 331], 'text': 'alcoholic beverages: tied-house restrictions,  AB 546'}
{'text_matrix': [1, 0, 0, 1, 231, 331], 'text': ',  840'}
{'text_matrix': [1, 0, 0, 1, 251, 331], 'text': ',  1294'}
{'text_matrix': [1, 0, 0, 1, 290, 331], 'text': ' ;  SB 392'}
{'text_matrix': [1, 0, 0, 1, 310, 331], 'text': ',  430 '}
{'text_matrix': [1, 0, 0, 1, 206, 320], 'text': 'campaign re social equity, civil rights, etc.,  SB 447 '}
{'text_matrix': [1, 0, 0, 1, 87, 309], 'text': 'cannabis,  AB 794'}
{'text_matrix': [1, 0, 0, 1, 107, 309], 'text': ',  1207 '}
{'text_matrix': [1, 0, 0, 1, 35, 298], 'text': 'elections. See ELECTIONS. \n'}
{'text_matrix': [1, 0, 0, 1, 35, 287], 'text': 'false, misleading, etc., advertising— \n'}
{'text_matrix': [1, 0, 0, 1, 155, 276], 'text': 'disgorgement, remedy of,  AB 1366 '}
{'text_matrix': [1, 0, 0, 1, 218, 265], 'text': 'master of divinity: prohibited title displays,  AB 1564 '}
{'text_matrix': [1, 0, 0, 1, 232, 254], 'text': 'pregnancy-related services: civil penalties, etc.,  AB 315'}
{'text_matrix': [1, 0, 0, 1, 253, 254], 'text': ',  602 '}
{'text_matrix': [1, 0, 0, 1, 172, 243], 'text': 'pricing for goods and services,  SB 478 '}
{'text_matrix': [1, 0, 0, 1, 321, 232], 'text': 'hotels, short-term rentals, etc., advertised rates: mandatory fee disclosures,  SB 683 '}
{'text_matrix': [1, 0, 0, 1, 247, 221], 'text': 'housing rental properties advertised rates: disclosures,  SB 611 '}
{'text_matrix': [1, 0, 0, 1, 25, 190], 'text': '*2023–24 First Extraordinary Session bills are designated (1X). '}

There are only 2 levels of indentation in the text, as you can see from the screenshot. And the x values are all over the place,

The amount of error in the x value seems to be somewhat proportional to the number of spaces that have been lost. I wonder if this is significant.

Environment

OS: Ubuntu 22.04.2 LTS

 % pip freeze | grep pdf
 pypdf==3.15.0

 $ python -m platform
 Linux-6.2.0-26-generic-x86_64-with-glibc2.35

 $ python -c "import pypdf;print(pypdf.__version__)"
 3.15.0

Code + PDF?

LegIndex-page6.pdf

from pypdf import PdfReader


def text_details(text, curr_trans_matrix, text_matrix, font_dict, font_size):
    info = {
        "text": text,
        "curr_trans_matrix": curr_trans_matrix,
        "text_matrix": text_matrix,
        "font_dict": font_dict,
        "font_size": font_size,
    }

    # put into a dictionary keyed by y value to enable sort.

    if info["text"] != "" and info["text"] != "\u200b" and info["text"] != "\n":
        global strings
        y_val = info["text_matrix"][5]
        if y_val not in strings:
            strings[y_val] = list()
        strings[y_val].append({"text_matrix": [int(el) for el in text_matrix], "text": text})


if __name__ == "__main__":
    path = "LegIndex-page6.pdf"

    strings = {}

    pdf = PdfReader(path)
    text_list = pdf.pages[0].extract_text().split("\n")
    pdf.pages[0].extract_text(visitor_text=text_details)
    y_vals = reversed(sorted(list(strings.keys())))
    for y_val in y_vals:
        for string in strings[y_val]:
            print(string)
@rkiddy
Copy link
Author

rkiddy commented Aug 10, 2023

Also, I get the exact same x values for:

 $ pip freeze | grep pdf
 pypdf @ git+https://github.com/py-pdf/PyPDF2.git@e81fbaefab18a5a9118f31eac6580824622b6ec6

@MartinThoma MartinThoma changed the title how off are x and y values in tm_matrix? in how complex a document? x and y values in the tm_matrix are wrong Aug 14, 2023
@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Aug 14, 2023
@MartinThoma
Copy link
Member

MartinThoma commented Aug 14, 2023

The relevant PR that might have fixed it: #2060 is included in pypdf >= 3.15.1

A related issue: #2059

@MartinThoma MartinThoma added the workflow-advanced-text-extraction Getting coordinates, font weight, font type, ... label Aug 14, 2023
@rkiddy
Copy link
Author

rkiddy commented Aug 14, 2023

I tried uninstalling and re-installing with the command below and it did not immediately seem to fix the problem. I will double-check and update.

 $ pip install git+https://github.com/py-pdf/pypdf.git
 $
 $ python -c "import pypdf;print(pypdf.__version__)"
 3.15.1
 $ source .venv/bin/activate
 (.venv) $ pip freeze | grep pdf
 pypdf @ git+https://github.com/py-pdf/pypdf.git@0ab320ce75bceaf054771842e83fc03340b623c6
 $

@rkiddy
Copy link
Author

rkiddy commented Aug 14, 2023

Well, the spaces are fixed but the x-values are still pretty wild.

(edit by Martin: Added code + PDF to the top comment)

@rkiddy rkiddy changed the title x and y values in the tm_matrix are wrong x values in the tm_matrix are wrong Aug 15, 2023
@py-pdf py-pdf deleted a comment from pubpub-zz Aug 15, 2023
@py-pdf py-pdf deleted a comment from rkiddy Aug 15, 2023
@py-pdf py-pdf deleted a comment from pubpub-zz Aug 15, 2023
@MartinThoma
Copy link
Member

Checking the results of other libraries:

pdfminer.six

Code:

from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator

fp = open("LegIndex-page6.pdf", "rb")
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)

for page in pages:
    interpreter.process_page(page)
    layout = device.get_result()
    for lobj in layout:
        if isinstance(lobj, LTTextBox):
            x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
            rt = text.replace('\n', '\\n')
            print(f"x={x:.0f}, y={y:.0f}: {rt}")

gives:

x=26, y=548: ADOPTION \n
x=35, y=537: adults, adoption of,  AB 1756 \nagencies, organizations, etc.: requirements, prohibitions, etc.,  SB 807 \nassistance programs, adoption: nonminor dependents,  SB 9 \nbirth certificates, \ncontact agreements, postadoption— \n
x=98, y=504:  AB 1302 \n
x=44, y=482: birth parents,  AB 1650 \nsiblings,  AB 20 \n
x=35, y=460: facilitators, adoption,  AB 120;  SB 120,  807 \nfailed adoptions: reproductive loss leave,  SB 848 \nhearings, adoption finalization: remote proceedings, technology, etc., \nnative american tribes,  AB 120;  SB 120 \nparental rights, reinstatement of,  AB 20 \nparents, prospective adoptive: criminal background checks,  SB 824 \n
x=285, y=438:  SB 21 \n
x=26, y=394: ADULT EDUCATION \n
x=35, y=383: services, adult educational,  SB 877 \nweek, adult education,  ACR 31 \n
x=26, y=361: ADVERTISING. See also MARKETING; and particular subject matter (e.g., \n
x=69, y=350: ELECTIONS). \n
x=35, y=339: alcoholic beverages: tied-house restrictions,  AB 546,  840,  1294;  SB 392,  430 \ncampaign re social equity, civil rights, etc.,  SB 447 \ncannabis,  AB 794,  1207 \nelections. See ELECTIONS. \nfalse, misleading, etc., advertising— \n
x=44, y=284: disgorgement, remedy of,  AB 1366 \nmaster of divinity: prohibited title displays,  AB 1564 \npregnancy-related services: civil penalties, etc.,  AB 315,  602 \npricing for goods and services,  SB 478 \n
x=35, y=240: hotels, short-term rentals, etc., advertised rates: mandatory fee disclosures,  SB 683 \nhousing rental properties advertised rates: disclosures,  SB 611 \n
x=25, y=196: *2023–24 First Extraordinary Session bills are designated (1X). \n

@MartinThoma
Copy link
Member

So I guess the x=26 of ADOPTION and ADULT EDUCATION is correct.

But the adults, adoption of, should have x=35. It currently has x=123

@MartinThoma
Copy link
Member

MartinThoma commented Aug 15, 2023

Taking https://stackoverflow.com/a/69151177/562769:

element                        x1  y1  x2  y2   text
------------------------------ --- --- --- ---- -----
    LTTextBoxHorizontal        26  539 74  548  ADOPTION
      LTTextLineHorizontal     26  539 74  548  ADOPTION
        LTChar                 26  539 32  548  A
        LTChar                 32  539 39  548  D
        LTChar                 39  539 45  548  O
        LTChar                 45  539 50  548  P
        LTChar                 50  539 56  548  T
        LTChar                 56  539 59  548  I
        LTChar                 59  539 65  548  O
        LTChar                 65  539 72  548  N



    LTTextBoxHorizontal        35  484 289 537  adults, adoption of,  AB 1756 
      LTTextLineHorizontal     35  528 144 537  adults, adoption of,  AB 1756
      LTTextLineHorizontal     35  517 289 526  agencies, organizations, etc.: requirements, prohibitions, etc.,  SB 807
      LTTextLineHorizontal     35  506 252 515  assistance programs, adoption: nonminor dependents,  SB 9

@MartinThoma
Copy link
Member

@rkiddy May I add LegIndex-page6.pdf to https://github.com/py-pdf/sample-files ? Then I would add a (currently failing) test so that we get the right values eventually.

@MartinThoma
Copy link
Member

PyMuPDF:

import fitz

with fitz.open("LegIndex-page6.pdf") as document:
    for page_number, page in enumerate(document):
        for x1, y1, x2, y2, text, a, b, c in page.get_text("words"):
            if "adoption" in text.lower() or "adults," in text.lower() or "agencies" in text.lower():
                print(f"{x1:>3.0f} {y1:>3.0f} {text}")

gives:

 26 241 ADOPTION
 35 252 adults,
 35 263 agencies,

@rkiddy
Copy link
Author

rkiddy commented Aug 15, 2023

@MartinThoma Please do add the pdf to whatever tests you wish to. It was from a document created by the Legislative Analyst Office of the California State Legislature, so it can be used (AFAIK) for this purpose without worry.

@rkiddy
Copy link
Author

rkiddy commented Aug 16, 2023

I know that you guys are not bored, sitting around with nothing to do. But I wanted to show you, this may be what I try to interpret next. I usually use tabula for tables but I am not sure it will be able to handle it. And, really, if this module worked for it, I would prefer it. This modules provides for control smarts.

To download, go to https://leginfo.legislature.ca.gov/, click on the "Publications" tab and click "Table of Sections Affected[PDF]". I cannot just point you to a URL because the LAO uses unnecessarily complex url handling. And it is stupid that they have to put this into a PDF but there it is.

Screenshot from 2023-08-16 11-15-41

@troethe
Copy link
Contributor

troethe commented Aug 18, 2023

@MartinThoma Hey, since I'm already familiar with this part of the code, I debugged this a bit yesterday and think I know the reason for this behavior.
What pypdf currently returns as tm_matrix for each line of text is the internal tm_prev at the point when it recognizes a new line being started. This delivers the expected results, if for the whole line the same tm_matrix was active. In the example pdf provided here however, many lines consist of multiple Tj-OPs with Tm's in between to create horizontal spacing.
For example after removing some unnecessary OPs, these are the operations printing "complaints, investigations, etc., AB 1264":

([], b'BT')
([35, 673.19500000000005], b'Td')
([['complaints, in', 40, 'v', 15, 'estig', 5, 'ations, etc., ']], b'TJ')
([1, 0, 0, 1, 150.209, 673.19500000000005], b'Tm')
([[' ', 55, 'AB ']], b'TJ')
([1, 0, 0, 1, 166.715, 673.19500000000005], b'Tm')
(['1264 '], b'Tj')

The coordinates the visitor_text gets for this line are x=166.715 y=673.195. So, the ones from the tm_matrix that was active while printing "1264".

I think the PR #2060, where we now return tm_prev instead of the current tm_matrix got us "closer" to the correct matrix we should return here, but fell short of doing what we should be doing, which is return the tm_matrix active during the first Tj of a line.

@pubpub-zz
Copy link
Collaborator

@rkiddy
I've proposed a new PR to fix the issue. I've used your test file https://github.com/py-pdf/pypdf/files/12318042/LegIndex-page6.pdf for the test succesfully. Can you try it elsewhere to confirm it solve all cases.

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 20, 2023
@rkiddy
Copy link
Author

rkiddy commented Sep 20, 2023

@pubpub-zz
I will try it later today, Much thanx,

@rkiddy
Copy link
Author

rkiddy commented Sep 21, 2023

Sorry about this. Before I mess it up and make things more confused, can you tell me if there is a way to 'pip install' the PR? I saw something like this in your doc but cannot find it now.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 21, 2023

you can directly installed a version from a git repo:
pip install git+https://github.com/pubpub-zz/pypdf.git@iss2200

note : I've added some documentation to explain how to do it

MartinThoma pushed a commit that referenced this issue Oct 8, 2023
Reworks and is still valid to close #2059

Closes #2200
Closes #2075
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-advanced-text-extraction Getting coordinates, font weight, font type, ...
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants