Can I convert from PDF to text preserving the original layout? #290

gildofabregat · 2024-01-12T23:12:08Z

gildofabregat
Jan 12, 2024

Hii!! I don't know much about it but I want to know if it is possible using this library to convert a two-column document from PDF to text preserving the original layout. Thanks for your help!

Answered by mara004

Jan 13, 2024

pdfium provides range and bounds based text extraction APIs. It can also tell you the position of individual chars on the page.
However, it does not expose APIs for layout analysis such as detecting words, lines and paragraphs/columns.

It should not mix up the two columns when extracting the text.
But it doesn't format text output to visually reflect the original (in the sense of keeping the two columns side by side), i.e. it can't do what pymupdf's python -m fitz gettext -mode layout can. So the answer to your question is probably "no".

View full answer

mara004 · 2024-01-13T00:16:01Z

mara004
Jan 13, 2024
Maintainer

pdfium provides range and bounds based text extraction APIs. It can also tell you the position of individual chars on the page.
However, it does not expose APIs for layout analysis such as detecting words, lines and paragraphs/columns.

It should not mix up the two columns when extracting the text.
But it doesn't format text output to visually reflect the original (in the sense of keeping the two columns side by side), i.e. it can't do what pymupdf's python -m fitz gettext -mode layout can. So the answer to your question is probably "no".

2 replies

mara004 Jan 13, 2024
Maintainer

Note that it would theoretically be possible to implement this from scratch on top of the char info and rect-based extraction provided by pdfium, but very difficult.

If anyone's interested in the subject, the following notebook might be worth taking a look, though:
https://github.com/pmbaumgartner/pdf-sketches/blob/main/page-segmentation.ipynb

gildofabregat Jan 13, 2024
Author

Thank you very much for your help.

mara004 · 2024-01-19T14:07:41Z

mara004
Jan 19, 2024
Maintainer

Incidentally, this just popped up in my GH feed:
https://github.com/py-pdf/pypdf/releases/tag/4.0.0
py-pdf/pypdf#2388

We finally have a layout-mode text extraction.

Note that pypdf seems to be liberal-licensed.
(Disclaimer: I never tried pypdf nor this new feature, but it definitely seems interesting.)

3 replies

gildofabregat Jan 21, 2024
Author

Thanks!! I will try to check that. It seems to be promising.

gildofabregat Jan 23, 2024
Author

From what I am reading in many libraries the option to preserve the original layout is considered experimental. I have tested pyMuPDF, pypdf and pdfplumber libraries and I have the sensation (after evaluating it qualitatively only) that the most consistent output is from pdfplumber. In practice, my goal is to preserve a logical reading order after the conversion. After reviewing related open questions in many libraries, it seems to be a common question.

mara004 Jan 23, 2024
Maintainer

Interesting. I think pdfplumber is powered by (rsp. implements higher-level logic on top of) pdfminer.six, and they are both MIT license.

I don't have time to look into this, but I wonder if it would be possible to detach the layout logic from pdfplumber or pypdf and create an abstracted API into which you could plug in char info from an arbitrary PDF backend, including pdfium?

But that opens a new problem, which is performance. pypdfium2 uses ABI FFI bindings, which is convenient with packaging, but ABI mode calls are said to be less performant than API mode, which means operating on per char basis might be unfortunate as it leads to many FF calls.

That leads me to wondering if we could add a cffi printer to ctypesgen, because cffi can be used for API mode, so that high-performance embedders could compile an extension module locally...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I convert from PDF to text preserving the original layout? #290

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Can I convert from PDF to text preserving the original layout? #290

gildofabregat Jan 12, 2024

Replies: 2 comments · 5 replies

mara004 Jan 13, 2024 Maintainer

mara004 Jan 13, 2024 Maintainer

gildofabregat Jan 13, 2024 Author

mara004 Jan 19, 2024 Maintainer

gildofabregat Jan 21, 2024 Author

gildofabregat Jan 23, 2024 Author

mara004 Jan 23, 2024 Maintainer

gildofabregat
Jan 12, 2024

Replies: 2 comments 5 replies

mara004
Jan 13, 2024
Maintainer

mara004 Jan 13, 2024
Maintainer

gildofabregat Jan 13, 2024
Author

mara004
Jan 19, 2024
Maintainer

gildofabregat Jan 21, 2024
Author

gildofabregat Jan 23, 2024
Author

mara004 Jan 23, 2024
Maintainer