specifying dpi/duplicate text #48

burrelvannjr · 2016-07-29T23:14:40Z

Hi Virantha,

I'm in the process of OCRing newspaper article pdfs, but it seems like the module is doubling the text of the document.

For example, if in the document it reads:
``XXXXXX
YYYYY
ZZZZZZZZZ"

The output of pypdfocr will read:

``XXXXXX
XXXXXX
YYYYY
YYYYY
ZZZZZZZZZ
ZZZZZZZZZ"

Any idea how to fix this problem? Is there a way to increase/decrease the resolution that pypdfocr (Tesseract) employs?

gregorskii · 2018-01-30T23:49:22Z

I am seeing this as well.

Starting conversion of ./Testing Double Text.pdf
WARNING: Empty pdf, cannot determine dpi using pdfimages
   **** Warning: considering '0000000000 XXXXX n' as a free entry.

   **** This file had errors that were repaired or ignored.
   **** The file was produced by:
   **** >>>> Mac OS X 10.12.6 Quartz PDFContext <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

pdf2txt.py Testing\ Double\ Text_ocr.pdf
Testing Double Text
Testing Double Text


Testing Double Text
Testing Double Text


Testing Double Text
Testing Double Text

For me it appears it may be adding an OCR layer to a file that already has one, thus doubling it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

specifying dpi/duplicate text #48

specifying dpi/duplicate text #48

burrelvannjr commented Jul 29, 2016

gregorskii commented Jan 30, 2018

specifying dpi/duplicate text #48

specifying dpi/duplicate text #48

Comments

burrelvannjr commented Jul 29, 2016

gregorskii commented Jan 30, 2018