Text extraction incomplete on Linux (resolution: missing system font) #288
-
The library's text extraction produces different results when used on Win 11 and Ubuntu. I use the same version pypdfium2==4.22.0 import pypdfium2
docs = []
path_file = '...'
pdf_reader = pypdfium2.PdfDocument(path_file, autoclose=True)
for page_number, page in enumerate(pdf_reader, start=1):
text_page = page.get_textpage()
content = text_page.get_text_range()
docs.append(content)
text_page.close()
page.close()
pdf_reader.close()
How can I fix this error? |
Beta Was this translation helpful? Give feedback.
Answered by
mara004
Jan 6, 2024
Replies: 1 comment 9 replies
-
Our Unfortunately I can't really comment on the behavior of external code.
Footnotes
|
Beta Was this translation helpful? Give feedback.
9 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I believe the culprit might be proprietary Windows fonts such as Arial and TimesNewRoman which distros can't ship for licensing reasons - so the PDF viewer substitutes with some other available font, which might not have the special chars.
Presumably it would work if you copy over the Windows fonts to Ubuntu.
See the attached inspection screenshot from Okular