Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specifying dpi/duplicate text #48

Open
burrelvannjr opened this issue Jul 29, 2016 · 1 comment
Open

specifying dpi/duplicate text #48

burrelvannjr opened this issue Jul 29, 2016 · 1 comment

Comments

@burrelvannjr
Copy link

Hi Virantha,

I'm in the process of OCRing newspaper article pdfs, but it seems like the module is doubling the text of the document.

For example, if in the document it reads:
``XXXXXX
YYYYY
ZZZZZZZZZ"

The output of pypdfocr will read:

``XXXXXX
XXXXXX
YYYYY
YYYYY
ZZZZZZZZZ
ZZZZZZZZZ"

Any idea how to fix this problem? Is there a way to increase/decrease the resolution that pypdfocr (Tesseract) employs?

@gregorskii
Copy link

I am seeing this as well.

Starting conversion of ./Testing Double Text.pdf
WARNING: Empty pdf, cannot determine dpi using pdfimages
   **** Warning: considering '0000000000 XXXXX n' as a free entry.

   **** This file had errors that were repaired or ignored.
   **** The file was produced by:
   **** >>>> Mac OS X 10.12.6 Quartz PDFContext <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.
pdf2txt.py Testing\ Double\ Text_ocr.pdf
Testing Double Text
Testing Double Text


Testing Double Text
Testing Double Text


Testing Double Text
Testing Double Text

For me it appears it may be adding an OCR layer to a file that already has one, thus doubling it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants