-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x values in the tm_matrix are wrong #2075
Comments
Also, I get the exact same x values for:
|
I tried uninstalling and re-installing with the command below and it did not immediately seem to fix the problem. I will double-check and update.
|
Well, the spaces are fixed but the x-values are still pretty wild. (edit by Martin: Added code + PDF to the top comment) |
Checking the results of other libraries: pdfminer.sixCode: from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
fp = open("LegIndex-page6.pdf", "rb")
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)
for page in pages:
interpreter.process_page(page)
layout = device.get_result()
for lobj in layout:
if isinstance(lobj, LTTextBox):
x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
rt = text.replace('\n', '\\n')
print(f"x={x:.0f}, y={y:.0f}: {rt}") gives:
|
So I guess the But the |
Taking https://stackoverflow.com/a/69151177/562769:
|
@rkiddy May I add |
PyMuPDF: import fitz
with fitz.open("LegIndex-page6.pdf") as document:
for page_number, page in enumerate(document):
for x1, y1, x2, y2, text, a, b, c in page.get_text("words"):
if "adoption" in text.lower() or "adults," in text.lower() or "agencies" in text.lower():
print(f"{x1:>3.0f} {y1:>3.0f} {text}") gives:
|
@MartinThoma Please do add the pdf to whatever tests you wish to. It was from a document created by the Legislative Analyst Office of the California State Legislature, so it can be used (AFAIK) for this purpose without worry. |
I know that you guys are not bored, sitting around with nothing to do. But I wanted to show you, this may be what I try to interpret next. I usually use tabula for tables but I am not sure it will be able to handle it. And, really, if this module worked for it, I would prefer it. This modules provides for control smarts. To download, go to https://leginfo.legislature.ca.gov/, click on the "Publications" tab and click "Table of Sections Affected[PDF]". I cannot just point you to a URL because the LAO uses unnecessarily complex url handling. And it is stupid that they have to put this into a PDF but there it is. |
@MartinThoma Hey, since I'm already familiar with this part of the code, I debugged this a bit yesterday and think I know the reason for this behavior. ([], b'BT')
([35, 673.19500000000005], b'Td')
([['complaints, in', 40, 'v', 15, 'estig', 5, 'ations, etc., ']], b'TJ')
([1, 0, 0, 1, 150.209, 673.19500000000005], b'Tm')
([[' ', 55, 'AB ']], b'TJ')
([1, 0, 0, 1, 166.715, 673.19500000000005], b'Tm')
(['1264 '], b'Tj') The coordinates the I think the PR #2060, where we now return |
@rkiddy |
@pubpub-zz |
Sorry about this. Before I mess it up and make things more confused, can you tell me if there is a way to 'pip install' the PR? I saw something like this in your doc but cannot find it now. |
you can directly installed a version from a git repo: note : I've added some documentation to explain how to do it |
I am trying to read what seems to be a not very complex pdf. Here is a bit from one page:
I am pulling out the y and then x value from the tm_matrix and the text from the visitor_text. I am getting this:
There are only 2 levels of indentation in the text, as you can see from the screenshot. And the x values are all over the place,
The amount of error in the x value seems to be somewhat proportional to the number of spaces that have been lost. I wonder if this is significant.
Environment
OS: Ubuntu 22.04.2 LTS
Code + PDF?
LegIndex-page6.pdf
The text was updated successfully, but these errors were encountered: