x values in the tm_matrix are wrong #2075

rkiddy · 2023-08-10T01:01:15Z

I am trying to read what seems to be a not very complex pdf. Here is a bit from one page:

I am pulling out the y and then x value from the tm_matrix and the text from the visitor_text. I am getting this:

[...]
{'text_matrix': [1, 0, 0, 1, 26, 541], 'text': 'ADOPTION \n'}
{'text_matrix': [1, 0, 0, 1, 123, 530], 'text': 'adults, adoption of,  AB 1756 '}
{'text_matrix': [1, 0, 0, 1, 273, 519], 'text': 'agencies, organizations, etc.: requirements, prohibitions, etc.,  SB 807 '}
{'text_matrix': [1, 0, 0, 1, 245, 508], 'text': 'assistance programs, adoption: nonminor dependents,  SB 9 '}
{'text_matrix': [1, 0, 0, 1, 114, 497], 'text': 'birth certificates,  AB 1302 '}
{'text_matrix': [1, 0, 0, 1, 35, 486], 'text': 'contact agreements, postadoption— \n'}
{'text_matrix': [1, 0, 0, 1, 110, 474], 'text': 'birth parents,  AB 1650 '}
{'text_matrix': [1, 0, 0, 1, 93, 463], 'text': 'siblings,  AB 20 '}
{'text_matrix': [1, 0, 0, 1, 130, 452], 'text': 'facilitators, adoption,  AB 120'}
{'text_matrix': [1, 0, 0, 1, 164, 452], 'text': ';  SB 120'}
{'text_matrix': [1, 0, 0, 1, 184, 452], 'text': ',  807 '}
{'text_matrix': [1, 0, 0, 1, 199, 441], 'text': 'failed adoptions: reproductive loss leave,  SB 848 '}
{'text_matrix': [1, 0, 0, 1, 300, 430], 'text': 'hearings, adoption finalization: remote proceedings, technology, etc.,  SB 21 '}
{'text_matrix': [1, 0, 0, 1, 135, 419], 'text': 'native american tribes,  AB 120'}
{'text_matrix': [1, 0, 0, 1, 168, 419], 'text': ';  SB 120 '}
{'text_matrix': [1, 0, 0, 1, 170, 408], 'text': 'parental rights, reinstatement of,  AB 20 '}
{'text_matrix': [1, 0, 0, 1, 265, 397], 'text': 'parents, prospective adoptive: criminal background checks,  SB 824 '}
{'text_matrix': [1, 0, 0, 1, 26, 386], 'text': 'ADULT EDUCATION \n'}
{'text_matrix': [1, 0, 0, 1, 150, 375], 'text': 'services, adult educational,  SB 877 '}
{'text_matrix': [1, 0, 0, 1, 140, 364], 'text': 'week, adult education,  ACR 31 '}
{'text_matrix': [1, 0, 0, 1, 26, 353], 'text': 'ADVERTISING. See also MARKETING; and particular subject matter (e.g., \n'}
{'text_matrix': [1, 0, 0, 1, 68, 342], 'text': 'ELECTIONS). \n'}
{'text_matrix': [1, 0, 0, 1, 211, 331], 'text': 'alcoholic beverages: tied-house restrictions,  AB 546'}
{'text_matrix': [1, 0, 0, 1, 231, 331], 'text': ',  840'}
{'text_matrix': [1, 0, 0, 1, 251, 331], 'text': ',  1294'}
{'text_matrix': [1, 0, 0, 1, 290, 331], 'text': ' ;  SB 392'}
{'text_matrix': [1, 0, 0, 1, 310, 331], 'text': ',  430 '}
{'text_matrix': [1, 0, 0, 1, 206, 320], 'text': 'campaign re social equity, civil rights, etc.,  SB 447 '}
{'text_matrix': [1, 0, 0, 1, 87, 309], 'text': 'cannabis,  AB 794'}
{'text_matrix': [1, 0, 0, 1, 107, 309], 'text': ',  1207 '}
{'text_matrix': [1, 0, 0, 1, 35, 298], 'text': 'elections. See ELECTIONS. \n'}
{'text_matrix': [1, 0, 0, 1, 35, 287], 'text': 'false, misleading, etc., advertising— \n'}
{'text_matrix': [1, 0, 0, 1, 155, 276], 'text': 'disgorgement, remedy of,  AB 1366 '}
{'text_matrix': [1, 0, 0, 1, 218, 265], 'text': 'master of divinity: prohibited title displays,  AB 1564 '}
{'text_matrix': [1, 0, 0, 1, 232, 254], 'text': 'pregnancy-related services: civil penalties, etc.,  AB 315'}
{'text_matrix': [1, 0, 0, 1, 253, 254], 'text': ',  602 '}
{'text_matrix': [1, 0, 0, 1, 172, 243], 'text': 'pricing for goods and services,  SB 478 '}
{'text_matrix': [1, 0, 0, 1, 321, 232], 'text': 'hotels, short-term rentals, etc., advertised rates: mandatory fee disclosures,  SB 683 '}
{'text_matrix': [1, 0, 0, 1, 247, 221], 'text': 'housing rental properties advertised rates: disclosures,  SB 611 '}
{'text_matrix': [1, 0, 0, 1, 25, 190], 'text': '*2023–24 First Extraordinary Session bills are designated (1X). '}

There are only 2 levels of indentation in the text, as you can see from the screenshot. And the x values are all over the place,

The amount of error in the x value seems to be somewhat proportional to the number of spaces that have been lost. I wonder if this is significant.

Environment

OS: Ubuntu 22.04.2 LTS

 % pip freeze | grep pdf
 pypdf==3.15.0

 $ python -m platform
 Linux-6.2.0-26-generic-x86_64-with-glibc2.35

 $ python -c "import pypdf;print(pypdf.__version__)"
 3.15.0

Code + PDF?

LegIndex-page6.pdf

from pypdf import PdfReader


def text_details(text, curr_trans_matrix, text_matrix, font_dict, font_size):
    info = {
        "text": text,
        "curr_trans_matrix": curr_trans_matrix,
        "text_matrix": text_matrix,
        "font_dict": font_dict,
        "font_size": font_size,
    }

    # put into a dictionary keyed by y value to enable sort.

    if info["text"] != "" and info["text"] != "\u200b" and info["text"] != "\n":
        global strings
        y_val = info["text_matrix"][5]
        if y_val not in strings:
            strings[y_val] = list()
        strings[y_val].append({"text_matrix": [int(el) for el in text_matrix], "text": text})


if __name__ == "__main__":
    path = "LegIndex-page6.pdf"

    strings = {}

    pdf = PdfReader(path)
    text_list = pdf.pages[0].extract_text().split("\n")
    pdf.pages[0].extract_text(visitor_text=text_details)
    y_vals = reversed(sorted(list(strings.keys())))
    for y_val in y_vals:
        for string in strings[y_val]:
            print(string)

The text was updated successfully, but these errors were encountered:

rkiddy · 2023-08-10T01:06:41Z

Also, I get the exact same x values for:

 $ pip freeze | grep pdf
 pypdf @ git+https://github.com/py-pdf/PyPDF2.git@e81fbaefab18a5a9118f31eac6580824622b6ec6

MartinThoma · 2023-08-14T06:58:15Z

The relevant PR that might have fixed it: #2060 is included in pypdf >= 3.15.1

A related issue: #2059

rkiddy · 2023-08-14T17:29:03Z

I tried uninstalling and re-installing with the command below and it did not immediately seem to fix the problem. I will double-check and update.

 $ pip install git+https://github.com/py-pdf/pypdf.git
 $
 $ python -c "import pypdf;print(pypdf.__version__)"
 3.15.1
 $ source .venv/bin/activate
 (.venv) $ pip freeze | grep pdf
 pypdf @ git+https://github.com/py-pdf/pypdf.git@0ab320ce75bceaf054771842e83fc03340b623c6
 $

rkiddy · 2023-08-14T23:43:55Z

Well, the spaces are fixed but the x-values are still pretty wild.

(edit by Martin: Added code + PDF to the top comment)

MartinThoma · 2023-08-15T07:28:14Z

Checking the results of other libraries:

pdfminer.six

Code:

from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator

fp = open("LegIndex-page6.pdf", "rb")
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)

for page in pages:
    interpreter.process_page(page)
    layout = device.get_result()
    for lobj in layout:
        if isinstance(lobj, LTTextBox):
            x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
            rt = text.replace('\n', '\\n')
            print(f"x={x:.0f}, y={y:.0f}: {rt}")

gives:

x=26, y=548: ADOPTION \n
x=35, y=537: adults, adoption of,  AB 1756 \nagencies, organizations, etc.: requirements, prohibitions, etc.,  SB 807 \nassistance programs, adoption: nonminor dependents,  SB 9 \nbirth certificates, \ncontact agreements, postadoption— \n
x=98, y=504:  AB 1302 \n
x=44, y=482: birth parents,  AB 1650 \nsiblings,  AB 20 \n
x=35, y=460: facilitators, adoption,  AB 120;  SB 120,  807 \nfailed adoptions: reproductive loss leave,  SB 848 \nhearings, adoption finalization: remote proceedings, technology, etc., \nnative american tribes,  AB 120;  SB 120 \nparental rights, reinstatement of,  AB 20 \nparents, prospective adoptive: criminal background checks,  SB 824 \n
x=285, y=438:  SB 21 \n
x=26, y=394: ADULT EDUCATION \n
x=35, y=383: services, adult educational,  SB 877 \nweek, adult education,  ACR 31 \n
x=26, y=361: ADVERTISING. See also MARKETING; and particular subject matter (e.g., \n
x=69, y=350: ELECTIONS). \n
x=35, y=339: alcoholic beverages: tied-house restrictions,  AB 546,  840,  1294;  SB 392,  430 \ncampaign re social equity, civil rights, etc.,  SB 447 \ncannabis,  AB 794,  1207 \nelections. See ELECTIONS. \nfalse, misleading, etc., advertising— \n
x=44, y=284: disgorgement, remedy of,  AB 1366 \nmaster of divinity: prohibited title displays,  AB 1564 \npregnancy-related services: civil penalties, etc.,  AB 315,  602 \npricing for goods and services,  SB 478 \n
x=35, y=240: hotels, short-term rentals, etc., advertised rates: mandatory fee disclosures,  SB 683 \nhousing rental properties advertised rates: disclosures,  SB 611 \n
x=25, y=196: *2023–24 First Extraordinary Session bills are designated (1X). \n

MartinThoma · 2023-08-15T07:31:11Z

So I guess the x=26 of ADOPTION and ADULT EDUCATION is correct.

But the adults, adoption of, should have x=35. It currently has x=123

MartinThoma · 2023-08-15T07:38:52Z

Taking https://stackoverflow.com/a/69151177/562769:

element                        x1  y1  x2  y2   text
------------------------------ --- --- --- ---- -----
    LTTextBoxHorizontal        26  539 74  548  ADOPTION
      LTTextLineHorizontal     26  539 74  548  ADOPTION
        LTChar                 26  539 32  548  A
        LTChar                 32  539 39  548  D
        LTChar                 39  539 45  548  O
        LTChar                 45  539 50  548  P
        LTChar                 50  539 56  548  T
        LTChar                 56  539 59  548  I
        LTChar                 59  539 65  548  O
        LTChar                 65  539 72  548  N



    LTTextBoxHorizontal        35  484 289 537  adults, adoption of,  AB 1756 
      LTTextLineHorizontal     35  528 144 537  adults, adoption of,  AB 1756
      LTTextLineHorizontal     35  517 289 526  agencies, organizations, etc.: requirements, prohibitions, etc.,  SB 807
      LTTextLineHorizontal     35  506 252 515  assistance programs, adoption: nonminor dependents,  SB 9

MartinThoma · 2023-08-15T07:40:19Z

@rkiddy May I add LegIndex-page6.pdf to https://github.com/py-pdf/sample-files ? Then I would add a (currently failing) test so that we get the right values eventually.

MartinThoma · 2023-08-15T11:21:28Z

PyMuPDF:

import fitz

with fitz.open("LegIndex-page6.pdf") as document:
    for page_number, page in enumerate(document):
        for x1, y1, x2, y2, text, a, b, c in page.get_text("words"):
            if "adoption" in text.lower() or "adults," in text.lower() or "agencies" in text.lower():
                print(f"{x1:>3.0f} {y1:>3.0f} {text}")

gives:

 26 241 ADOPTION
 35 252 adults,
 35 263 agencies,

rkiddy · 2023-08-15T21:23:59Z

@MartinThoma Please do add the pdf to whatever tests you wish to. It was from a document created by the Legislative Analyst Office of the California State Legislature, so it can be used (AFAIK) for this purpose without worry.

rkiddy · 2023-08-16T18:16:37Z

I know that you guys are not bored, sitting around with nothing to do. But I wanted to show you, this may be what I try to interpret next. I usually use tabula for tables but I am not sure it will be able to handle it. And, really, if this module worked for it, I would prefer it. This modules provides for control smarts.

To download, go to https://leginfo.legislature.ca.gov/, click on the "Publications" tab and click "Table of Sections Affected[PDF]". I cannot just point you to a URL because the LAO uses unnecessarily complex url handling. And it is stupid that they have to put this into a PDF but there it is.

troethe · 2023-08-18T08:05:25Z

@MartinThoma Hey, since I'm already familiar with this part of the code, I debugged this a bit yesterday and think I know the reason for this behavior.
What pypdf currently returns as tm_matrix for each line of text is the internal tm_prev at the point when it recognizes a new line being started. This delivers the expected results, if for the whole line the same tm_matrix was active. In the example pdf provided here however, many lines consist of multiple Tj-OPs with Tm's in between to create horizontal spacing.
For example after removing some unnecessary OPs, these are the operations printing "complaints, investigations, etc., AB 1264":

([], b'BT')
([35, 673.19500000000005], b'Td')
([['complaints, in', 40, 'v', 15, 'estig', 5, 'ations, etc., ']], b'TJ')
([1, 0, 0, 1, 150.209, 673.19500000000005], b'Tm')
([[' ', 55, 'AB ']], b'TJ')
([1, 0, 0, 1, 166.715, 673.19500000000005], b'Tm')
(['1264 '], b'Tj')

The coordinates the visitor_text gets for this line are x=166.715 y=673.195. So, the ones from the tm_matrix that was active while printing "1264".

I think the PR #2060, where we now return tm_prev instead of the current tm_matrix got us "closer" to the correct matrix we should return here, but fell short of doing what we should be doing, which is return the tm_matrix active during the first Tj of a line.

pubpub-zz · 2023-09-20T19:03:10Z

@rkiddy
I've proposed a new PR to fix the issue. I've used your test file https://github.com/py-pdf/pypdf/files/12318042/LegIndex-page6.pdf for the test succesfully. Can you try it elsewhere to confirm it solve all cases.

for py-pdf#2075

rkiddy · 2023-09-20T21:17:32Z

@pubpub-zz
I will try it later today, Much thanx,

rkiddy · 2023-09-21T00:01:16Z

Sorry about this. Before I mess it up and make things more confused, can you tell me if there is a way to 'pip install' the PR? I saw something like this in your doc but cannot find it now.

pubpub-zz · 2023-09-21T17:00:50Z

you can directly installed a version from a git repo:
pip install git+https://github.com/pubpub-zz/pypdf.git@iss2200

note : I've added some documentation to explain how to do it

addressed in py-pdf#2075

See #2075

Reworks and is still valid to close #2059 Closes #2200 Closes #2075

MartinThoma changed the title ~~how off are x and y values in tm_matrix? in how complex a document?~~ x and y values in the tm_matrix are wrong Aug 14, 2023

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Aug 14, 2023

MartinThoma added the workflow-advanced-text-extraction Getting coordinates, font weight, font type, ... label Aug 14, 2023

rkiddy changed the title ~~x and y values in the tm_matrix are wrong~~ x values in the tm_matrix are wrong Aug 15, 2023

py-pdf deleted a comment from pubpub-zz Aug 15, 2023

py-pdf deleted a comment from rkiddy Aug 15, 2023

py-pdf deleted a comment from pubpub-zz Aug 15, 2023

troethe mentioned this issue Aug 18, 2023

visitor_text method of extract_text method could be given each word separately #2094

Closed

pubpub-zz self-assigned this Sep 3, 2023

pubpub-zz mentioned this issue Sep 19, 2023

BUG: invalid cm/tm in visitor functions #2206

Merged

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 20, 2023

complete test

37485df

for py-pdf#2075

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 21, 2023

DOC: explain how to install from dev branch

fd82f29

addressed in py-pdf#2075

pubpub-zz mentioned this issue Sep 21, 2023

DOC: How to install pypi from any branch #2209

Merged

MartinThoma pushed a commit that referenced this issue Sep 22, 2023

DOC: How to install pypi from any branch (#2209)

8cbe5e7

See #2075

MartinThoma closed this as completed in #2206 Oct 8, 2023

MartinThoma pushed a commit that referenced this issue Oct 8, 2023

BUG: invalid cm/tm in visitor functions (#2206)

bcd85c4

Reworks and is still valid to close #2059 Closes #2200 Closes #2075

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x values in the tm_matrix are wrong #2075

x values in the tm_matrix are wrong #2075

rkiddy commented Aug 10, 2023 •

edited by MartinThoma

Loading

rkiddy commented Aug 10, 2023

MartinThoma commented Aug 14, 2023 •

edited

Loading

rkiddy commented Aug 14, 2023 •

edited

Loading

rkiddy commented Aug 14, 2023 •

edited by MartinThoma

Loading

MartinThoma commented Aug 15, 2023

MartinThoma commented Aug 15, 2023

MartinThoma commented Aug 15, 2023 •

edited

Loading

MartinThoma commented Aug 15, 2023

MartinThoma commented Aug 15, 2023

rkiddy commented Aug 15, 2023

rkiddy commented Aug 16, 2023

troethe commented Aug 18, 2023

pubpub-zz commented Sep 20, 2023

rkiddy commented Sep 20, 2023

rkiddy commented Sep 21, 2023

pubpub-zz commented Sep 21, 2023 •

edited

Loading

x values in the tm_matrix are wrong #2075

x values in the tm_matrix are wrong #2075

Comments

rkiddy commented Aug 10, 2023 • edited by MartinThoma Loading

Environment

Code + PDF?

rkiddy commented Aug 10, 2023

MartinThoma commented Aug 14, 2023 • edited Loading

rkiddy commented Aug 14, 2023 • edited Loading

rkiddy commented Aug 14, 2023 • edited by MartinThoma Loading

MartinThoma commented Aug 15, 2023

pdfminer.six

MartinThoma commented Aug 15, 2023

MartinThoma commented Aug 15, 2023 • edited Loading

MartinThoma commented Aug 15, 2023

MartinThoma commented Aug 15, 2023

rkiddy commented Aug 15, 2023

rkiddy commented Aug 16, 2023

troethe commented Aug 18, 2023

pubpub-zz commented Sep 20, 2023

rkiddy commented Sep 20, 2023

rkiddy commented Sep 21, 2023

pubpub-zz commented Sep 21, 2023 • edited Loading

rkiddy commented Aug 10, 2023 •

edited by MartinThoma

Loading

MartinThoma commented Aug 14, 2023 •

edited

Loading

rkiddy commented Aug 14, 2023 •

edited

Loading

rkiddy commented Aug 14, 2023 •

edited by MartinThoma

Loading

MartinThoma commented Aug 15, 2023 •

edited

Loading

pubpub-zz commented Sep 21, 2023 •

edited

Loading