Text extraction incomplete on Linux (resolution: missing system font) #288

maivan-hoa · 2024-01-06T05:37:49Z

maivan-hoa
Jan 6, 2024

The library's text extraction produces different results when used on Win 11 and Ubuntu. I use the same version pypdfium2==4.22.0
Here is the code that I use:

import pypdfium2

docs = []
path_file = '...'
pdf_reader = pypdfium2.PdfDocument(path_file, autoclose=True)

for page_number, page in enumerate(pdf_reader, start=1):
    text_page = page.get_textpage()
    content = text_page.get_text_range()
    docs.append(content)

    text_page.close()
    page.close()

pdf_reader.close()

Result on Win 11

Tín hiệu chuông được đánh giá là đạt khi có tín hiệu chuông ở hai hướng.

Result on Ubuntu

Tín hiệu chuông đ ợc đánh giá là đạt khi có tín hiệu chuông ở hai h ớng.

How can I fix this error?
Thanks so much for taking a look!

Answered by mara004

Jan 6, 2024

I believe the culprit might be proprietary Windows fonts such as Arial and TimesNewRoman which distros can't ship for licensing reasons - so the PDF viewer substitutes with some other available font, which might not have the special chars.
Presumably it would work if you copy over the Windows fonts to Ubuntu.

See the attached inspection screenshot from Okular

View full answer

mara004 · 2024-01-06T15:11:48Z

mara004
Jan 6, 2024
Maintainer

Our get_text_range() wrapper just forwards the output from a pdfium API function and decodes using UTF-16LE¹ as indicated by pdfium docs.

Unfortunately I can't really comment on the behavior of external code.
Yet, here are some questions that might help debug the issue:

Can you share the document and page index in question?
Am I right to assume the Windows output is correct but the Ubuntu output wrong (missing letter ư) ?
Can you verify that the raw (binary) output of FPDFText_GetText() also differs, to rule out a decoding or shell/viewer issue?
Can you check whether get_text_bounded() behaves correctly?
Can you update to the latest version (4.25.0 / 6219 AOTW) and retry?
Which version of Ubuntu are you using?

That is, python's implementation of UTF-16LE ↩

9 replies

mara004 Jan 6, 2024
Maintainer

That said, I wonder if this issue could possibly be handled by the end user installing some system font package?

maivan-hoa Jan 6, 2024
Author

Yes so do I. I will try installing some more fonts into the system to test. Thank you!

mara004 Jan 6, 2024
Maintainer

Have you tried it on windows operating system?

No, I don't have access to windows (except CI) -- but according to your report it works, right?
Maybe Windows ships fonts Ubuntu doesn't by default, or maybe pdfium font search behaves differently on Linux. I don't know.

mara004 Jan 6, 2024
Maintainer

I believe the culprit might be proprietary Windows fonts such as Arial and TimesNewRoman which distros can't ship for licensing reasons - so the PDF viewer substitutes with some other available font, which might not have the special chars.
Presumably it would work if you copy over the Windows fonts to Ubuntu.

See the attached inspection screenshot from Okular

Answer selected by maivan-hoa

maivan-hoa Jan 6, 2024
Author

I installed the font into my Ubuntu system and it worked. Thank you for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text extraction incomplete on Linux (resolution: missing system font) #288

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Text extraction incomplete on Linux (resolution: missing system font) #288

maivan-hoa Jan 6, 2024

Replies: 1 comment · 9 replies

mara004 Jan 6, 2024 Maintainer

Footnotes

mara004 Jan 6, 2024 Maintainer

maivan-hoa Jan 6, 2024 Author

mara004 Jan 6, 2024 Maintainer

mara004 Jan 6, 2024 Maintainer

maivan-hoa Jan 6, 2024 Author

maivan-hoa
Jan 6, 2024

Replies: 1 comment 9 replies

mara004
Jan 6, 2024
Maintainer

mara004 Jan 6, 2024
Maintainer

maivan-hoa Jan 6, 2024
Author

mara004 Jan 6, 2024
Maintainer

mara004 Jan 6, 2024
Maintainer

maivan-hoa Jan 6, 2024
Author