Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(non-ascii): Cyrillic symbols in generated PDF #27

Open
Kristinita opened this issue Aug 19, 2019 · 0 comments
Open

bug(non-ascii): Cyrillic symbols in generated PDF #27

Kristinita opened this issue Aug 19, 2019 · 0 comments

Comments

@Kristinita
Copy link
Owner

Kristinita commented Aug 19, 2019

1. Summary

I can't get Cyrillic symbols in OCR layers for PDF files, that generated by k2pdfopt.

I can't reproduce an issue for PDF documents with Latin symbols.

2. Data

  • KiraIdeal.jpg — file with 2 Russian words Кира Идеал!:

    KiraIdeal.jpg

  • KiraIdeal.pdf — PDF without OCR, that I convert from previous .jpg

3. Steps to reproduce

I download and install Tesseract (see section 4 of this issue) → I download v2.51a version for my 64-bit Windows from here → I add path with k2pdfopt.exe to user PATH environment variable → I set TESSDATA_PREFIX environment variable, as described here.

I run command:

k2pdfopt -mode copy -ocr -ocrlang rus KiraIdeal.pdf

4. Actual behavior

Copy and paste text from KiraIdeal_k2_opt.pdf:

#$%& +02&3!

5. Expected behavior

If tesseract command:

tesseract KiraIdeal.jpg stdout -l rus
  • output:
Кира Идеал!
♀

or tesseract command, that generate PDF:

tesseract KiraIdeal.jpg KiraIdealTesseract -l rus pdf

I copy and paste text from KiraIdealTesseract.pdf:

Кира Идеал!

or k2pdfopt command:

k2pdfopt -mode copy -ocr -ocrlang rus -ocrout KiraIdeal KiraIdeal.pdf
  • KiraIdeal.txt:
Кира Идеал!

6. Not helped

I can't find, how I can solve my problem in official site pages:

  1. OCR
  2. command-line options

7. Enviroment

  • Windows 10 Enterprise LTSB 64-bit EN
  • tesseract v5.0.0-alpha.20190708
D:\SashaDebugging\k2pdfoptOCR>k2pdfopt -ocrlang ?
k2pdfopt v2.51a (w/MuPDF,DjVuLibre,OCR) © 2019, GPLv3, http://willus.com
    Compiled Jan  4 2019 with Gnu C (Mingw64) v7.3.0 for Win64 on x64.

TESSDATA_PREFIX environment variable:  D:\SashaPrograms\Tesseract-OCR\tessdata
Tesseract data folder:  D:\SashaPrograms\Tesseract-OCR\tessdata

Contents of D:\SashaPrograms\Tesseract-OCR\tessdata:
File name                          Size         Date      Type*
---------------------------------------------------------------------
eng.traineddata                     3.92 MB   8-JUL-2019  [LSTM]
osd.traineddata                    10.07 MB   8-JUL-2019  (not valid)
rus.traineddata [Def]               3.68 MB  17-AUG-2019  [LSTM]
* - LSTM = "Long Short-Term Memory" training data.
    LSTM is the latest, most accurate OCR method used by Tesseract v4.x.
    TESS = Tesseract v3.x compatible (can be used by v4.x).

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant