Can I convert from PDF to text preserving the original layout? #290
-
Hii!! I don't know much about it but I want to know if it is possible using this library to convert a two-column document from PDF to text preserving the original layout. Thanks for your help! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
pdfium provides range and bounds based text extraction APIs. It can also tell you the position of individual chars on the page. It should not mix up the two columns when extracting the text. |
Beta Was this translation helpful? Give feedback.
-
Incidentally, this just popped up in my GH feed:
Note that pypdf seems to be liberal-licensed. |
Beta Was this translation helpful? Give feedback.
pdfium provides range and bounds based text extraction APIs. It can also tell you the position of individual chars on the page.
However, it does not expose APIs for layout analysis such as detecting words, lines and paragraphs/columns.
It should not mix up the two columns when extracting the text.
But it doesn't format text output to visually reflect the original (in the sense of keeping the two columns side by side), i.e. it can't do what pymupdf's
python -m fitz gettext -mode layout
can. So the answer to your question is probably "no".