Tesseract

Logiciel pour la reconnaissance de characters automatique (OCR)

Sommaire

1 Installation
- 1.1 Moteur OCR
- 1.2 Données de langage
2 Utilisation en ligne de commande
- 2.1 Cas d'utilisation
3 Depuis une sélection écran
- 3.1 En ligne de commande
- 3.2 Avec une interface graphique Sous Gnome X / Wayland
4 Configuration supplementaire
5 Sources
6 Voir Aussi

Installation

Moteur OCR

Arch Linux

pacman -S tesseract

Ubuntu

apt install tesseract-ocr

Données de langage

Il faut aussi installer des données de langages, par exemple pour le japonais

Arch Linux

pacman -S tesseract-data-jpn

Ces fichiers sont installés sous /usr/share/tessdata

Ubuntu

apt install tesseract-ocr-jpn

Utilisation en ligne de commande

tesseract nom_image [options] [nom_sortie] [configfile]

Une option importante est la langue (-l)

Les fichiers de config sont sous /usr/share/tessdata/configs

Cas d'utilisation

On a une image ~/Pictures/jpn.png contenant uniquement du texte en japonais

Conversion en txt (config par defaut)

tesseract ~/Pictures/jpn.png -l jpn out

Utilisation de la config pdf pour convetir vers du pdf sous out.pdf

tesseract ~/Pictures/jpn.png -l jpn out pdf

Depuis une sélection écran

En ligne de commande

Le script Bash suivant permet de lancer tesseract depuis une selection écran et copie le resultat dans le press papier

#!/bin/bash
# Dependencies: tesseract-ocr imagemagick scrot xsel
SRC_LANG=eng
if [ "$1" != "" ]; then
    echo "set lang to $1"
    SRC_LANG=$1
fi

SCR_IMG=`mktemp`
trap "rm $SCR_IMG*" EXIT   #remove tmp image at exit

read -p "Make the text to select visible on screen and press enter to continue"
echo "Select text on screen now"

# increase image quality with option -q from default 75 to 100
scrot -s $SCR_IMG.png -q 100

#should increase detection rate
mogrify -modulate 100,0 -resize 400% $SCR_IMG.png

tesseract $SCR_IMG.png -l $SRC_LANG $SCR_IMG &> /dev/null

#copy to clip board
cat $SCR_IMG.txt | xsel -bi
cat $SCR_IMG.txt

exit

source https://askubuntu.com/questions/280475/how-can-instantaneously-extract-text-from-a-screen-area-using-ocr-tools

Avec une interface graphique Sous Gnome X / Wayland

#!/bin/bash

TESS_LANG=eng
TESS_PSM=6

function run(){
    IMG=`mktemp`
    gnome-screenshot -a -f $IMG.png
    tesseract $IMG.png  -l $TESS_LANG $IMG &> /dev/null
    result=`cat $IMG.txt`
    rm $IMG.txt $IMG.png
    echo -n "$result" | xsel -bi
    echo "$result"
}

while true; do
    cmd=(yad --button="select":"1" \
             --text="$result" --selectable-labels --no-markup \
             --title "Gnome Tesseract UI" --on-top)
    "${cmd[@]}"
    case $? in
        1) result=$(run);;
    252) break;;
    esac
done

Configuration supplementaire

pour le japonais :

chop_enable                         T
use_new_state_cost                  F
segment_segcost_rating              F
enable_new_segsearch                0
language_model_ngram_on             0
textord_force_make_prop_words       F
edges_max_children_per_outline	    40

Soit

-c chop_enable=T -c use_new_state_cost=F -c segment_segcost_rating=F -c enable_new_segsearch=0 -c language_model_ngram_on=0 -c textord_force_make_prop_words=F -c edges_max_children_per_outline=40

Sources

https://github.com/tesseract-ocr/tesseract/wiki/ControlParams

https://github.com/tesseract-ocr/tesseract/tree/master/doc

http://manpages.ubuntu.com/manpages/xenial/man1/tesseract.1.html

Voir Aussi

Nhocr OCR pour le japonais uniquement