DOC: Change extract-text.md example codes from using cm to tm #2432

etern4l-white · 2024-02-01T10:30:12Z

TL;DR

Fixes #2431
Changed cm (current_matrix) to tm (text matrix).

Problem

In the extract-text documentation here, the example codes that are used won't produce the correct output.

For example, the first code snippet should output the text of table of contents, but it outputs nothing. The second code snippet is supposed to convert page 4 from the PDF to a SVG, including the text, but it only outputs empty fields.

Reason

The coordination process should be using the text matrix instead of current matrix.

Solution/update

Just changed the cm to tm in the code snippets.

etern4l-white · 2024-02-01T10:49:02Z

This is my first pull request ever. Any comments?

docs/user/extract-text.md

@stefan6419846

Added by @stefan6419846 Co-authored-by: Stefan <[email protected]>

pubpub-zz · 2024-02-01T18:43:08Z

I personally do not recommend to extract text using cm : We've observed many case where the actual text and ordering is not valid.
Maybe you should try to use the new text extraction with the layout.

stefan6419846 · 2024-02-03T09:33:25Z

@pubpub-zz This is an example inside our docs which somehow stopped working correctly, maybe because due to internal changes, maybe because it has not worked at all in the past. If ever, we should explain the new layout mode besides the "classic" mode in the docs. But this can be part of another PR - our examples should work and yield correct results to avoid frustrations.

etern4l-white · 2024-02-03T09:37:55Z

@pubpub-zz This is an example inside our docs which somehow stopped working correctly, maybe because due to internal changes, maybe because it has not worked at all in the past. If ever, we should explain the new layout mode besides the "classic" mode in the docs. But this can be part of another PR - our examples should work and yield correct results to avoid frustrations.

By the way, in pypdf version 3 (i guess) documentation it uses tm intead of cm , too. See here. I don't know the exact underlying reason.

stefan6419846 · 2024-02-03T09:39:39Z

This has been a change by @pubpub-zz: bcd85c4

etern4l-white · 2024-02-03T09:48:36Z

Hm, so tm was making issues, and that change fixed the issue? That makes sense, but I believe the example in the docs should be working, too. I'm kind of confused to be honest, since I'm not that deep in the internals of how it works.

Maybe the example PDF was the problem? Because if that change was to solve a problem, then maybe tm was producing an issue in most of PDFs, but not in the example PDF, so switching to cm might have fixed the issue for alot of PDFs, but not the one in the example.

I think for me to validate this assumption I should test it on many PDFs. I will do later today because I got an assignment to do 😅.

stefan6419846 · 2024-02-20T19:31:51Z

Did you have a chance to further check this already?

MartinThoma · 2024-07-20T08:38:26Z

Hey @etern4l-white are there any updates? 😇

stefan6419846 · 2024-10-07T18:53:14Z

Should fix #2881 as well.

pubpub-zz · 2024-10-07T19:23:20Z

as recommended in #2881 (comment) shouldn't we propose to use mult(cm,tm) ?

stefan6419846 · 2024-10-08T07:42:45Z

Does this have any effect on the text size? According to the quote in #2881 (comment), the text size is somehow affected the multiplied matrix.

Updated extract-text.md in docs

9b58fc1

stefan6419846 reviewed Feb 1, 2024

View reviewed changes

docs/user/extract-text.md Outdated Show resolved Hide resolved

Update docs/user/extract-text.md

bcf5d33

Added by @stefan6419846 Co-authored-by: Stefan <[email protected]>

stefan6419846 added the on-hold PR requests that need clarification before they can be merged.A comment must give details label Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Change extract-text.md example codes from using cm to tm #2432

DOC: Change extract-text.md example codes from using cm to tm #2432

etern4l-white commented Feb 1, 2024

etern4l-white commented Feb 1, 2024

pubpub-zz commented Feb 1, 2024

stefan6419846 commented Feb 3, 2024

etern4l-white commented Feb 3, 2024

stefan6419846 commented Feb 3, 2024

etern4l-white commented Feb 3, 2024

stefan6419846 commented Feb 20, 2024

MartinThoma commented Jul 20, 2024

stefan6419846 commented Oct 7, 2024

pubpub-zz commented Oct 7, 2024

stefan6419846 commented Oct 8, 2024

DOC: Change extract-text.md example codes from using cm to tm #2432

Are you sure you want to change the base?

DOC: Change extract-text.md example codes from using cm to tm #2432

Conversation

etern4l-white commented Feb 1, 2024

TL;DR

Problem

Reason

Solution/update

etern4l-white commented Feb 1, 2024

pubpub-zz commented Feb 1, 2024

stefan6419846 commented Feb 3, 2024

etern4l-white commented Feb 3, 2024

stefan6419846 commented Feb 3, 2024

etern4l-white commented Feb 3, 2024

stefan6419846 commented Feb 20, 2024

MartinThoma commented Jul 20, 2024

stefan6419846 commented Oct 7, 2024

pubpub-zz commented Oct 7, 2024

stefan6419846 commented Oct 8, 2024