bug(non-ascii): Cyrillic symbols in generated PDF #27

Kristinita · 2019-08-19T09:30:28Z

1. Summary

I can't get Cyrillic symbols in OCR layers for PDF files, that generated by k2pdfopt.

I can't reproduce an issue for PDF documents with Latin symbols.

2. Data

KiraIdeal.jpg — file with 2 Russian words Кира Идеал!:
KiraIdeal.pdf — PDF without OCR, that I convert from previous .jpg

3. Steps to reproduce

I download and install Tesseract (see section 4 of this issue) → I download v2.51a version for my 64-bit Windows from here → I add path with k2pdfopt.exe to user PATH environment variable → I set TESSDATA_PREFIX environment variable, as described here.

I run command:

k2pdfopt -mode copy -ocr -ocrlang rus KiraIdeal.pdf

4. Actual behavior

KiraIdeal_k2_opt.pdf:

Copy and paste text from KiraIdeal_k2_opt.pdf:

#$%& +02&3!

5. Expected behavior

If tesseract command:

tesseract KiraIdeal.jpg stdout -l rus

output:

Кира Идеал!
♀

or tesseract command, that generate PDF:

tesseract KiraIdeal.jpg KiraIdealTesseract -l rus pdf

I copy and paste text from KiraIdealTesseract.pdf:

Кира Идеал!

or k2pdfopt command:

k2pdfopt -mode copy -ocr -ocrlang rus -ocrout KiraIdeal KiraIdeal.pdf

KiraIdeal.txt:

Кира Идеал!

6. Not helped

I can't find, how I can solve my problem in official site pages:

7. Enviroment

Windows 10 Enterprise LTSB 64-bit EN
tesseract v5.0.0-alpha.20190708

D:\SashaDebugging\k2pdfoptOCR>k2pdfopt -ocrlang ?
k2pdfopt v2.51a (w/MuPDF,DjVuLibre,OCR) © 2019, GPLv3, http://willus.com
    Compiled Jan  4 2019 with Gnu C (Mingw64) v7.3.0 for Win64 on x64.

TESSDATA_PREFIX environment variable:  D:\SashaPrograms\Tesseract-OCR\tessdata
Tesseract data folder:  D:\SashaPrograms\Tesseract-OCR\tessdata

Contents of D:\SashaPrograms\Tesseract-OCR\tessdata:
File name                          Size         Date      Type*
---------------------------------------------------------------------
eng.traineddata                     3.92 MB   8-JUL-2019  [LSTM]
osd.traineddata                    10.07 MB   8-JUL-2019  (not valid)
rus.traineddata [Def]               3.68 MB  17-AUG-2019  [LSTM]
* - LSTM = "Long Short-Term Memory" training data.
    LSTM is the latest, most accurate OCR method used by Tesseract v4.x.
    TESS = Tesseract v3.x compatible (can be used by v4.x).

Thanks.

The text was updated successfully, but these errors were encountered:

Kristinita added the need-maintainer label Aug 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(non-ascii): Cyrillic symbols in generated PDF #27

bug(non-ascii): Cyrillic symbols in generated PDF #27

Kristinita commented Aug 19, 2019 •

edited

Loading

bug(non-ascii): Cyrillic symbols in generated PDF #27

bug(non-ascii): Cyrillic symbols in generated PDF #27

Comments

Kristinita commented Aug 19, 2019 • edited Loading

1. Summary

2. Data

3. Steps to reproduce

4. Actual behavior

5. Expected behavior

6. Not helped

7. Enviroment

Kristinita commented Aug 19, 2019 •

edited

Loading