Is always generating a file with 306 bytes #60

caitifbrito · 2017-02-15T02:50:53Z

Hello,
I'm testing this program to convert medical books in brazilian portuguese. These books have around 500-700 pages at good quality and, after install all I need to run pypdfocr (one exclusively box for this :), tesseract 3.03 and some of others requirements) when I run it [1] looks like fine, so the product of execution is a file with sufix _ocr.pdf sizing 306 bytes. Its content [2] show nothing good.

  What may be wrong!?

1 - Generating OCR of MyBookInPortuguese.pdf - 227 MegaBytes

root@vagrant-ubuntu-trusty-64:/vagrant# pypdfocr -v -l por MyBookInPortuguese.pdf

Starting conversion of MyBookInPortuguese.pdf
Running pdfimages to figure out DPI...
Using 300 DPI
Detected color
gs -q -dNOPAUSE -sDEVICE=jpeg -dJPEGQ=75 -r300 -sOutputFile="MyBookInPortuguese.pdf - 9ª Ed [ptbr+foto]_%d.jpg" "MyBookInPortuguese.pdf" -c quit
Skipping preprocess step
Checking tesseract version
tesseract -v
Created OCR'ed pdf as MyBookInPortuguese.pdf - 9ª Ed [ptbr+foto]_ocr.pdf
Cleaning up []
Cleaning up []
Cleaning up []
Cleaning up []
Cleaning up []
Completed conversion successfully to MyBookInPortuguese.pdf_ocr.pdf

2 - MyBookInPortuguese.pdf_ocr.pdf - 306 bytes

%PDF-1.3
1 0 obj
<<
/Kids [ ]
/Type /Pages
/Count 0
>>
endobj
2 0 obj
<<
/Producer (PyPDF2)
>>
endobj
3 0 obj
<<
/Type /Catalog
/Pages 1 0 R
>>
endobj
xref
0 4
0000000000 65535 f 
0000000009 00000 n 
0000000062 00000 n 
0000000102 00000 n 
trailer
<<
/Size 4
/Root 3 0 R
/Info 2 0 R
>>
startxref
151
%%EOF

The text was updated successfully, but these errors were encountered:

virantha · 2017-02-15T15:00:28Z

Hi. Can't do anything without a test case. Please upload a pdf so I can try to reproduce.

DiegoAscanio · 2017-03-23T00:59:43Z

Did you installed tesseract-data-por (data files for portuguese language) in your distro?

at least in archlinux, tesseract package supports only english language by default. If you need to support other languages, you need to install tesseract-data- package for your distro.

For portuguese language support in archlinux you'll need to run the folowing command:

#pacman -S tesseract-data-por

Sorry for my poor english.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is always generating a file with 306 bytes #60

Is always generating a file with 306 bytes #60

caitifbrito commented Feb 15, 2017

virantha commented Feb 15, 2017

DiegoAscanio commented Mar 23, 2017 •

edited

Loading

Is always generating a file with 306 bytes #60

Is always generating a file with 306 bytes #60

Comments

caitifbrito commented Feb 15, 2017

virantha commented Feb 15, 2017

DiegoAscanio commented Mar 23, 2017 • edited Loading

DiegoAscanio commented Mar 23, 2017 •

edited

Loading