reconnaissance des Écritures

RECONNAISSANCE DES ÉCRITURES MANUSCRITES : ÉTAT DE L’ART ET MISE EN OEUVRE

Paris - 29 mai 2018 Christopher Kermorvant - A2iA - TEKLIA

RECONNAISSANCE D’ÉCRITURE ?

Quand je dis que suis suis chercheur en reconnaissance d’écriture manuscrite :

Plus personne n’écrit à la main.

Je croyais que c’était un problème résolu.

MNIST 0.23% d’erreur

Aucune machine ne peut déchi!rer l’écriture manuscrite

SHORT HISTORY OF HANDWRITING RECOGNITION

La reconnaissance d’écriture est un des plus vieux défis de l’IA

RAND corporation, 1960 The MNIST database “The drosophila of machine learning”

Geoffrey Hinton!

Les performances de la machine sont encore loin derrière celles de l’humain

EST-CE QUE CA MARCHE ?

https://cloud.google.com/vision/docs/drag-and-drop

Première partie : avec des outils “sur étagère”

Google cloud visio API

Microsoft Cognitive Services



Text imprimé (électronique)

Différentes fontes

Différentes couleurs

Logos

Texte manuscrit récent (1981)

Robert Badinter. Discours sur l'abolition de la peine de mort.

http://gallica.bnf.fr/ark:/12148/btv1b8571107s/f9.item


Localisation du texte

Reconnaissance du texte



Archives nationales, JJ 090

Registre du Trésor des chartesannées [1354] 1357-1360 1361

http://bvmm.irht.cnrs.fr/mirador/index.php?manifest=http://bvmm.irht.cnrs.fr/iiif/23678/manifest





Localisation du texte

Reconnaissance du texte



UNE BRÈVE HISTOIRE DE LA RECONNAISSANCE D’ÉCRITURE

Segmenter Mesurer Classifier

1970 - 1990 : Les approches analytiques

1990 - 2000 : Les approches statistiques

SHORT HISTORY OF HANDWRITING RECOGNITION

Apprendre/ClassifierSegmenter Mesurer

2010 - : Les approches Deep Learning


Deuxième partie : dans les laboratoires de recherche

Mots isolés :

Pages complètes

Evaluations internationales


Taux

d’e

rreur

mot

0

10

20

30

40

2005 2007 2009 2010 2011 2013 2015

Mots isolésPages complètes1 humain2 humains

Reconnaissance des mots et pages en arabe manuscrit

Evaluations internationales

! Document issus de la collection Ratsprotokolle (1470 à 1805) en allemand.

! Lettres de la collection Alfred Escher en allemand

DERNIÈRE ÉVALUATION READ : 2016 ET 2017

Figure 1. Document samples. The image on the left belongs to the AlfredEscher Letter Collection and the image on the right belongs to a differentcollection.

II. COMPETITION CHALLENGES AND DATASETDESCRIPTION

This competition aims to bring together researchers work-ing on off-line Handwritten Text Recognition (HTR) for his-torical documents and provides them a suitable benchmark tocompare their techniques on the task of transcribing typicalhistorical handwritten documents.

New challenges were considered in this fourth edition tak-ing into account the experience from previous editions [1],[2], [3] and from the requirements in the READ project.

The competition proposed for ICDAR 2017 aimed atintroducing a usual scenario for some collections in whichthere exist transcripts at page level for many pages, butthese transcripts are not aligned with line images. Theproblem is then to automatically detect the lines and alignthese transcripts with the corresponding line images forsubsequently training an HTR system. In this scenario, itis feasible to annotate accurately line images with theirtranscripts for a few pages that can be used for training aninitial HTR system. This initial HTR system is then used forautomatically getting new training material from the wholetraining dataset. This challenge was proposed also in theHTR ICDAR 2015 competition [2], but in that edition only313 pages were provided as new training dataset and thistime 10, 000 pages were made available (see details below).

There are several reasons that could be interesting forattracting research groups to participate in this competition:

a) the scenario described previously is quite usual and posesinteresting problems that can be very challenging forHTR researchers. The collection that we used in thiscompetition has some characteristics that allowed us to in-troduce several problems that are present in many datasets:existing transcripts that are not aligned at line level but canbe used for training optical models if the correct alignmentis automatically discovered.

b) HTR has been extensively studied in the last years mainlyfor English. One of the reasons for this situation is theexistence of English public databases. In ICFHR 2016, we

proposed an HTR competition on German language [3].The obtained results in that competition were really goodat character level but the error at word level was veryhigh. In this new edition we decided to continue withGerman language to give continuity to HTR research forthis language.Part of the Alfred Escher Letter Collection (AEC) was

used in this competition. This collection is written in Germanbut it includes also documents in other languages, likeFrench, Italian and Greek. Format of selected GT data fortraining and for submitting the results was chosen to bePAGE format [5]. TEI8 marks were removed and ignoredfor the competition.

The training set was divided into two batches, namedrespectively Train-A and Train-B.

The Train-A set consisted of 50 page images, eachencompassing of one or more text blocks. These pagesentailed several line detection and transcription difficultiesand the corresponding GT was produced semi-automaticallyand manually reviewed at line level.

Column “Train-A” in Table I summarizes the basic statis-tics of these pages.

Table ISTATISTICS OF THE DATASET USED IN THE COMPETITION. SUBSET OF

AEC AND DOCUMENTS FROM OTHER GERMAN COLLECTIONS.Number of: Train-A Train-B Total Train Test-A Test-B2Pages 50 10,000 10,050 65 57Lines 1,386 204,775 206,161 1,573 1,412Running words 15,169 1,754,026 1,769,195 14,880 14,460Lexicon 4,637 98,993 99,530 4,635 4,648Running OOV - - - 816 866OOV Lexicon - - - 739 771Character set size 102 168 168 0 0Running Characters 70,268 8,290,607 8,360,875 0 0

The Train-A dataset had images from AEC but also a fewimages from other collections (see Figure 1). The imagesfrom the other collections were written by different handsbut all the documents were from the same period. Thesefew images were included in the training set for testing howwell the HTR systems could adapt to other writing styles andother sort of documents. Some of the images that did notbelong to the AEC collections were of poor quality and/orlow resolution. Regarding the resolution of the images, lowresolution images are very frequent in archives (thousandsof images, according to archives involved in READ). Thisis because many collections were scanned some time agoand they are not being scanned again because of severalreasons: the documents are not currently available, lowbudgets in archives, different priorities, etc. So, this is a realproblem for many collections residing in archives that needsto be addressed. The competitors were informed about thissituation. The same situation happened for the test set.

The second train batch (Train-B) did not include geo-metric information about the location of the text lines. The

8 http://www.tei-c.org/index.xml

!"#$!"#$!"#$!"#$!"#$


HANDWRITING RECOGNITION : DOES IT WORK ?

2016

Taux

d’e

rreu

r mot

0

25

50

RWTH BYU A2iA LITIS PARISTECH

46

26222121

0

25

50

BYU LITIS PARISTECH

22

45

19

2017

! Tous les systèmes sont basés sur le Deep Learning

! Les systèmes peuvent être entrainés avec moins de données

! Un taux plancher de 20% d’erreur

Moins de données d’entrainement

RECONNAISSANCE D’ÉCRITURE : EST-CE UTILE (AVEC CE TAUX D’ERREUR)

Manuscrits de Jeremy Bentham : 70 000 pages

Transcription par des experts humains : 67 ans

Crowdsourcing (2010)

Transcription par des volontaires : 27 ans


Transcription par une machine : 24 heures

Mais le texte doit encore être validé manuellement

Wor

d Er

ror

Rate

(%)

05

1015

LIMSI1 CITLAB LIMSI2 A2IA

Evaluation ICFHR 2014

Reconnaissance automatique (2014)

https://read.transkribus.eu


THE HIMANIS PROJECT

! Objectifs : donner aux historiens de nouveaux outils pour exploiter un corpus de registres manuscrits : les registres de la chancellerie française (1300 - 1483)

! La technologie :

! Reconnaissance automatique d’écriture et indexation profonde

! Interface de recherche controlée par l’utilisateur

European Research Project (JPI)2015-2018

DEEP LEARNING POUR LA RECONNAISSANCE D’ÉCRITURE

Images de lignes de texte

descort, c’est assavoir que li diz sires de Harecourt dysoit et maintenoit que deux miles livres à tournois de rente annuelle et perpetu elle que li diz sires de Partenay li avoit promises et données pour cause dou dit mariage, à avoir du dit seigneur de Harecourt après le deceps du dit seigneur de Partenay, ou cas que il auroit hoir ou hoirs mâles, et le cas se estoit jà offert, li devoient estre assises à Chas

Texte

Modèle optique : reconnait les formes

Images de lignes de texte

Back-propagation

Texte


S e i g n e u rUnigram

# S e i g n e u rBigram

# # S e i g n e u rTrigram

Modèle de langue : modélise les statistiques linguistiques

Représenté dans un automate stochastiqueEntrainé sur un corpus de texte électronique (pas besoin d’images)


ABRÉVIATIONS

! ~30% des mots sont abrégés (60% pour le latin)

! La vérité terrain est généralement modernisée, avec abréviations résolues

! La langue (français/latin) n’est pas explicitée

Chevalier

Guillaume

Modèle optique + modèle de langue :

! Bonne nouvelle n°1 : le système apprend les abréviations

! Bonne nouvelle n°2 : le modèle de langue multilingue


Transcrit (0.5%) Non transcrit

Entrainement Evaluation Corpus

Periode 1302-1361 1302-1483

Nombre d’actes 341 95 68 000

Lignes 6 061 1 733 2 692 339†

Mots 117 709 33 097 39 623 581†

† estimated

Taux d’erreur caractère en test: 18%

† estimatedins11 %

del52 %

subs37 %


COMMENT FAIRE L’INDEXATION AVEC 20% D’ERREUR ?

Indexing & Search HIMANIS Paris-2017 meeting

Probabilistic Indexing & Search: Precision-Recall Tradeoff Model

Indexing and search quality can beassessed by means of precision (⇡) & recall(⇢) performance.

Precision is high if most of the retrievedresults are correct while recall is high if mostof the existing correct results are retrieved.

If perfectly correct text were indexed, you’dget a single, “ideal” point with ⇢ = ⇡ = 1.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision

:π

Recall: ρ

Perfect (AP=1.0)Aut. Transcript (AP=0.6)

If automatic (typically noisy) handwritten text transcripts are naively indexed just asplaintext, precision and recall are also fixed values, albeit not “ideal” (pehaps somethinglike ⇢ = 0.75,⇡ = 0.8, with Averge Precision AP=0.6).

J. Puigcerver, A.H. Toselli and E. Vidal, October-2017 Page 5

Première solution : utiliser la transcription brute et indexer les caractères

Haute précision : les résultats sont bons mais peu nombreux

Haut rappel : les résultats sont nombreux mais souvent faux

Donner à l’utilisateur le contrôle sur précision/rappelIndexing & Search HIMANIS Paris-2017 meeting

Probabilistic Indexing & Search: Precision-Recall Tradeoff Model

Indexing and search quality can beassessed by means of precision (⇡) & recall(⇢) performance.

Precision is high if most of the retrievedresults are correct while recall is high if mostof the existing correct results are retrieved.

If perfectly correct text were indexed, you’dget a single, “ideal” point with ⇢ = ⇡ = 1.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision

:π

Recall: ρ

High confidence theshold

Low confidence threshold

Perfect (AP=1.0)Aut. Transcript (AP=0.6)

Prob. Index (AP=0.8)

If automatic (typically noisy) handwritten text transcripts are naively indexed just asplaintext, precision and recall are also fixed values, albeit not “ideal” (pehaps somethinglike ⇢ = 0.75,⇡ = 0.8, with Averge Precision AP=0.6).

In contrast, probabilistic indexing allows for arbitrary precision-recall tradeoffs by settinga threshold on the system confidence (relevance probability)

This flexible “precision-recall tradeoff model” obviously allows for better search andretrieval performance than naive plaintext searching on automatic noisy transcripts.



Définition d’un seuil

1ère idée : demander au reconnaisseur de multiples hypothèses de reconnaissance

➤ Augmenter le rappel

➤ Augmenter la précision par un meilleur score


Multiple hypothèses : treillis de caractères

MULTIPLES HYPOTHESES

Meilleure hypothèse et maintenoit que deux miles

CARTE DE PROBABILITÉ

Indexing & Search HIMANIS Paris-2017 meeting

Probabilistic Word Indexing from the Posteriorgram

P

X

Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage.But, for each word, image region relevance probabilities and locations are easily derivedfrom the Posteriorgram – and used to probabilistically index the word in an efficient way.


2ème idée : calculer une carte probabilité de présence des caractères recherchés

PROBABILITÉ CALCULÉE SUR LE TREILLIS D’HYPOTHÈSES

b

t

f

a

o @ b

k

e

e

e

@ f

f

t

ao

o

r

r

@

aot @

f

t

e

S(cv,x, i)

y

q1q2 q3

q4

q5

q6q7 q8

β(q8)α(q6)

α(q5)

β(q3)

α(q4)

α(q1)

t(q4) t(q1) t(q2) t(q3) t(q6) t(q5) t(q7) t(q8)

Fig. 1. Top: A small, partial example of CL for a handwritten text “to be for”. The edge sequence paths: {(q1, q2), (q2, q3)}, {(q4, q2), (q2, q3)},{(q5, q7), (q7, q8)} and {(q6, q7), (q7, q8)} (in dashed red lines) correspond to the query character sequence: cv = “to”. The “@” symbol stands for thewhite space character. Bottom: Corresponding posteriorgram-like frame-level function S(cv,x, i). Note that the score S(”to”,x, i) is not negligible in theinterval where the word “for” appears in the image. However is much lower than in the correct interval where “to” is actually written.

Algorithm 1 Comput. of the frame-level score: S(cv,x, i)Input: query word: cv ≡ c1, c2, . . . , cL.

CL generated from decoding an input image of length M .α(qj),β(qj) : j = 1, 2, . . . , | Q | computed for all CLnodes.

Output: S ≡ S1, . . . , SM , frame-level score vector

Let P: stack of tuples (ϕ∈ℜ, k∈ [0,M ], q∈Q, p∈ [1, L])P.clear(); S′ ← 0⃗for all (q′, q) : c1 = ω(q′, q) do

P.push(α(q′)·s(q′, q), t(q′), q, 2)end forwhile not P.empty() do

(ϕ, k, q, p) ← P.top(); P.pop()if p ≤L then

for all q′ : cp = ω(q, q′) doP.push(ϕ·s(q, q′), k, q′, p+1)

end forelse

ϕ ← ϕ·β(q)β(qI)

for i ← k to t(q) doSi ← Si+ϕ

end forend if

end whilereturn S

where, as in Eq. (8), L and σ are the number of characters ofcv and a weighting parameter respectively.

As commented, this method also provides the spottedword location associated to the maximum i used to computeP (R | cv,x); for instance, in the example shown by the Fig. 1,the best scores of S(cv,x, i) for the keyword “to” (whosemaximum is P (R | cv,x)) span positions t(q1) through t(q3).

The overall cost of computing the final confidence scoregiven by Eq. (11), excluding CL generation and forward-backward computation, is determined by the costs of com-

puting the frame-level character sequence score S for all iand the final maximization of Eq. (11). According to Alg. 1,the worst-case cost of computing S(cv,x, i), 1 ≤ i ≤ n isO(n+ l ·BL), where l is the average length (in frames) of thesub-paths corresponding to cv, and B = |E|/(|Q| · |Σ|) is theCL average branching factor per character.

Real computing times of this KWS approach, comparedwith the one presented in Sec. IV, will be reported in Tab. II.

VI. EXPERIMENTS

To compare the effectiveness and efficiency of bothcomputing confidence score methods: the forward and theposteriorgram-like, several experiments were carried out. Theevaluation measures, corpus, experimental setup and the re-sults are presented next.

A. Evaluation MeasuresThe standard recall and interpolated precision mea-

sures [13] are used here. Interpolated precision is widely usedto avoid cases in which plain precision can be ill-defined.Results are reported in terms of average precision (AP) [14],which is computed as the area under the Recall-Precision curveand is a popular scalar assessment measure for KWS.

On the other hand for efficiency assessment, real computingtimes are reported in terms of total elapsed times measuredon a dedicated single-core of an Intel R⃝ XeonTM running at2.5GHz.

B. Corpora DescriptionExperiments were carried out with the IAMDB dataset. It is

a publicly available, well known modern English handwrittentext corpus, compiled by the FKI-IAM Research Group onthe base of the Lancaster-Oslo/Bergen Corpus (LOB). Thelast released version (3.0) is composed of 1 539 scanned textpages, handwritten by 657 different writers and partitionedinto writer-independent training, validation and test sets. Theline segmentation provided with the corpus [15] is used here.Basic statistics appear in Table I.

351352

INDEXATION AVEC CONTRÔLE RAPPEL/PRÉCISIONIndexing & Search HIMANIS Paris-2017 meeting

Chancery Probabilistic Index Evaluation: Laboratory Results

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision

:π

Recall: ρ

5grm AP=0.75 mAP=0.68

3grm AP=0.69 mAP=0.61

0grm AP=0.62 mAP=0.52

HTR AP=0.58 mAP=0.440

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision

:π

Recall: ρ

5gr-la AP=0.86 mAP=0.73

5gr AP=0.84 mAP=0.74

5gr-fr AP=0.80 mAP=0.74

HTR AP=0.69 mAP=0.52

Left: Recall-Precision results for different character N -gram models (0grm, 3grm, 5grm). A singleR-P point (HTR) is also shown for the 1-best recognition hypotheses with character 5-grams.

Right: Recall-Precision results for (only) abbreviated keywords using character 5-gram models:Latin-only (5g-la), French-only (5g-fr) and both Latin and French (5gr). A single R-P point (HTR)is also shown for the 1-best recognition hypotheses with character 5-grams.


Haute précision : les résultats sont bons mais peu nombreux

Haut rappel : les résultats sont nombreux mais souvent faux

LA DETECTION DU TEXTE : ENCORE À AMÉLIORER

Lignes manquées Lignes mal détectées

UN ORDINATEUR PEUT-IL LIRE L’ÉCRITURE MANUSCRITE ?

➤ Un système à base de réseau de neurones profond peut lire les écritures manuscrites médiévales

➤ Une indexation complète est possible même si le taux d’erreur est élevé grâce à des hypothèses multiples

➤ La détection du texte est une étape critique et encore à améliorer

➤ L’évolution de l’écriture dans les registres doit être prise en compte

➤ L’évaluation de la qualité est un enjeu majeur

EST-CE TRANSPOSABLE À D’AUTRES CORPUS ?

A priori, oui

Mais, avec quelle performance ?

Il faut évaluer les facteurs favorables et défavorables


Critère Point

Corpus disponible en IIIF +2

Numérisation uniforme (microfilm, image NB, images couleur) +1

Numérisation uniforme en simple/double page +1

Mise en page uniforme dans le corpus +1

Complexité de la mise en formePages de texte :+1

Tables : 0 Mixte : -1


Critère Point

Langue : simple ou multiple Peu d’impact

Transcriptions disponibles 1 point par lot de 100 pages

Localisation des lignes de texte disponible +1

Corpus textuel similaire disponible +1

Dans tous les cas, la machine apprend à partir d’exemples annotés par les humains.

Plus les documents sont hétérogènes, plus il faudra d’exemples pour apprendre la diversité.

COMMENT METTRE EN OEUVRE CES TECHNIQUES ?

Le gain est dans le passage à l’échelle : pas de gain sur des petits corpus.

Ne pas sous-estimer le travail humain : la machine apprend à partir d’exemples annotés.


Créer une équipe mixte informaticiens / historiens

Si possible au même endroit


Le processus de développement est itératif autour des données

VERS UNE ARCHITECTURE RÉ-UTILISABLE

HIMANISbeta

TranscriptionA2iA

Version 1 : développements spécifiques

IndexationUPV

TranscriptionA2iA

IndexationUPV

IIIF viewerMirador

Serveur IIIF image

Moteur d’indexation

Elastic Search

Version 2 : IIIF et open-source

DÉVELOPPEMENT D’UNE API

Résultats par actes

Intégration des numéros d’actes et de leur localisation sur les pages

DÉVELOPPEMENT D’UNE API - IIIF

A!chage dans Mirador

Position des actes

Mots indexés

[email protected]

reconnaissance des Écritures

Documents