Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is always generating a file with 306 bytes #60

Open
caitifbrito opened this issue Feb 15, 2017 · 2 comments
Open

Is always generating a file with 306 bytes #60

caitifbrito opened this issue Feb 15, 2017 · 2 comments

Comments

@caitifbrito
Copy link

Hello,
I'm testing this program to convert medical books in brazilian portuguese. These books have around 500-700 pages at good quality and, after install all I need to run pypdfocr (one exclusively box for this :), tesseract 3.03 and some of others requirements) when I run it [1] looks like fine, so the product of execution is a file with sufix _ocr.pdf sizing 306 bytes. Its content [2] show nothing good.

  What may be wrong!?

1 - Generating OCR of MyBookInPortuguese.pdf - 227 MegaBytes

root@vagrant-ubuntu-trusty-64:/vagrant# pypdfocr -v -l por MyBookInPortuguese.pdf

Starting conversion of MyBookInPortuguese.pdf
Running pdfimages to figure out DPI...
Using 300 DPI
Detected color
gs -q -dNOPAUSE -sDEVICE=jpeg -dJPEGQ=75 -r300 -sOutputFile="MyBookInPortuguese.pdf - 9ª Ed [ptbr+foto]_%d.jpg" "MyBookInPortuguese.pdf" -c quit
Skipping preprocess step
Checking tesseract version
tesseract -v
Created OCR'ed pdf as MyBookInPortuguese.pdf - 9ª Ed [ptbr+foto]_ocr.pdf
Cleaning up []
Cleaning up []
Cleaning up []
Cleaning up []
Cleaning up []
Completed conversion successfully to MyBookInPortuguese.pdf_ocr.pdf

2 - MyBookInPortuguese.pdf_ocr.pdf - 306 bytes

%PDF-1.3
1 0 obj
<<
/Kids [ ]
/Type /Pages
/Count 0
>>
endobj
2 0 obj
<<
/Producer (PyPDF2)
>>
endobj
3 0 obj
<<
/Type /Catalog
/Pages 1 0 R
>>
endobj
xref
0 4
0000000000 65535 f 
0000000009 00000 n 
0000000062 00000 n 
0000000102 00000 n 
trailer
<<
/Size 4
/Root 3 0 R
/Info 2 0 R
>>
startxref
151
%%EOF

@virantha
Copy link
Owner

Hi. Can't do anything without a test case. Please upload a pdf so I can try to reproduce.

@DiegoAscanio
Copy link

DiegoAscanio commented Mar 23, 2017

Did you installed tesseract-data-por (data files for portuguese language) in your distro?

at least in archlinux, tesseract package supports only english language by default. If you need to support other languages, you need to install tesseract-data- package for your distro.

For portuguese language support in archlinux you'll need to run the folowing command:

#pacman -S tesseract-data-por

Sorry for my poor english.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants