Download - Détection des textes dans les images issues d ’un flux vidéo pour l´indexation sémantique
1
Détection des textes dans les images issues d ’un flux vidéo pour l
´indexation sémantique
Laboratoire d'Informatique en Images et Systèmes d'information LIRIS, FRE 2672 CNRS
Bât. Jules Verne, INSA de Lyon69621 Villeurbanne cedex
5 décembre 2003
http://rfv.insa-lyon.fr/~wolfChristian Wolf
Directeur de thèse: Jean-Michel Jolion
2
The framework of the thesis
2 Industrial contracts with France Télécom: ECAV I, ECAV II“Enrichissement du Contenu Audio-Visuel”
Collaboration with the Language and Media Processing Laboratory, University of Maryland.
2 research internships:2001: character segmentation2002: video indexing (TREC)
3
Indexing using Text
keyword-basedSearch
Patrick Mayhew
Patrick MayhewMin. chargé de l´irlande de NordISRAELJerusalemmontageT.Nouel...............
ResultKey word
Indexing phase
4Still imagesIntroduction Videos ConclusionCharacter segmentation
System Recall Precision H. meanAshida 46 55 50HWDavid 46 44 45Wolf 44 30 36Todoran 18 19 18Full 6 1 2
Results
Introduction
Detection in still images
Detection in video sequences
Character segmentation
Conclusion Experimental Results
Plan
5
Videos vs. scanned documents
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
Temporal aspects
Complex and moving background
Artificial shadows
Low resolution
6
What is text? - character segmentation
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
Artificial textArtificial text
Scene textScene text
7
What is text? - texture
Example: Gabor energy features on a text image
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
Original image Filter tuned to the example text
Gabor energy Thresholded Gabor energy
8
What is text? - contrast & geometry
Example image Accumulated horizontal Sobel edges
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
9
A text detection system for videos
Text occurrencesDetection per single frame
Initial frame integration (averaging)
OCR “Soukaina Oufkir”
Tracking
Image Enhancement -Multiple frame integration
Binarization
Suppression offalse alarms
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
10
Introduction
Detection in still images
Detection in video sequences
Character segmentation
Conclusion Experimental Results
Plan
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
11
2 Algorithms for still images
Calculate a text probability image according to a text model (1 value/ pixel)
Calculate a text feature image (N values/pixel)
Separate the probability values into 2 classes.
Classify each pixel in the feature image
Find the optimal threshold
Post processing Post processing
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
12
The local contrast method
Calculate a text probability image according to a text model (1 value/ pixel)
Separate the probability values into 2 classes.
Post processing
Fisher/Otsu
• Mathematical morphology• Geometrical constraints• Verification of special cases• Combination of rectangles
F. LeBourgeois
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
13
Properties of the local contrast method
+ High detection accuracy (accurate localization).+ Not very sensitive to the type of text.+ Low computational complexity (very fast!).
– False alarms due to the assumption of text presence.
Geometrical constraints are imposed in the post-processing step.
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
14
Method 2: why learning?
+ Hope to increase the precision (decrease the number of false alarms) of the detection algorithm by learning the characteristics of text.
+ More complex text models are very difficult to derive analytically.
+ The discovery of support vector machine (SVM) learning and its ability to generalize even in high dimensional spaces opened the door to complex decision functions and feature models.
Inconvenience:– Specialization to a specific type of text (generalization)?
Text exists in wide varies of forms, fonts, sizes, orientations and deformations (especially scene text).
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
15
Geometrical features
Learning gray values and edge maps alone may not generalize enough.
Texture alone is not reliable, especially if the text is short.
Geometry is a valuable feature.
State of the art: enforce geometrical constraints in the post-processing step (mathematical morphology)
We propose the usage of geometrical features very early in the detection process, i.e. not during post-processing.
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
16
Geometrical features: baseline
Text consists of:• A high density of strokes in
direction of the text baseline.• A consistent baseline (a
rectangular region with an upper and lower border).
Two detection philosophies:• Detection of the baseline directly
before detecting the text region.• Detection of the baseline as the
boundary area of the detected text region in order to refine the detection quality.
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
17
Estimation of the text rectangle height
Original image Accumulated gradients
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
18
Mode width (=rectangle height) Mode height (=Contrast) Difference height left-right
Mode mean Mode standard deviation Difference in mode width
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
Features
19
Learning with Support Vector Machines
Training image database positive samples negative samples
Classification step: a reduction of the computational complexity is necessary:
• Sub-sampling of the pixels to classify (4x4)• Approximation of the SVM model by SVM-regression.
Bootstrapping, cross-validation
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
20
Introduction
Detection in still images
Character segmentation
Conclusion Experimental Results
Plan
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
Detection in video sequences
21
Text occurrences
Frame nr.(time)
Tracking the text appearances
List of rectangles detected for the current frame
The integration is done using greedy search in the overlap matrix.
List containing the most recent rectangleof each text occurrence
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
22
Tracking: content verificationVerification of the text box contents: L2 comparison of a signature vector (vertical projection profile of the Sobel edges).
Frequently text occurrences appear at the same location without significant temporal pause between them
0
50
100
150
200
250
300
350
400
450
500
0
50
100
150
200
250
300
350
400
450
500
0
50
100
150
200
250
300
350
400
450
500
Same text Different text Fading text
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
23
Enhancement
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
Multiple frame integration:Averaging
Bi-linear interpolation
Bi-cubic splines
Super-resolution(interpolation)
Detected text occurence
24
Introduction
Detection in still images
Conclusion Experimental Results
Plan
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
Character segmentation
Detection in video sequences
25
Adaptive binarization
Niblack’s adaptive method:
Sauvola’s improvement:
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
26
Our solution: contrast maximization
Contrast at the center of the image
The maximum local contrast
The contrast of the window
We keep the following pixels:
Threshold:
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
27
Character segmentation: examplesOriginal image
Fisher/Otsu
Fisher/Otsu (windowed)
Yanowitz-B.
Yanowitz-B. +post-proc.
Niblack
Sauvola et al.
Contrast maximiz.
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
28
Modeling text with a Markov random field
Binarization as a Bayesian maximum a posteriori estimation problem using a Markov random field model.
Priormodels the prior knowledge on the spatial relationships in the image as a MRF.
Likelihood of the observationdepends on the observation and noise model. In our case: Gaussian Noise corrected by Niblack’s threshold surface.
Collaboration with Laboratory for language and Media Processing, University of Maryland (David Doermann)
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
29
The prior knowledge before after
1.05 0.95
1.82 1.38
1.48 1.15
1.85 1.30
2.00 1.36
2.14 1.40
1.80 1.79
1.77 1.52
1.87 1.16
1.84 1.57
1.72 1.32
1.66 1.42
2.00 1.28
2.08 1.57
1.89 1.50
1.93 1.69
The clique labelings of the repaired pixel before and after flipping it. All 16 cliques favor the change of the pixel.
• The clique energies (4x4) are learned and interpolated from training data.
• Optimization of the energy function with simulated annealing.
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
30
Introduction
Detection in still images
Conclusion
Plan
Detection in video sequences
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
System Recall Precision H. meanAshida 46 55 50HWDavid 46 44 45Wolf 44 30 36Todoran 18 19 18Full 6 1 2
Experimental Results
Character segmentation
31
Evaluation measures
ICDAR:• 1-1 matches• overlap information only
CRISP:• 1-1, 1-M, M-1 matches• thresholded matches• no overlap information
AREA:• 1-1, 1-M, M-1 matches• thresholded matches• overlap information
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
Detection Ground truth
32Still imagesIntroduction Videos ConclusionCharacter segmentation Results
AIM3News
AIM4Cartoons, News
AIM5News
AIM2Commercials
33
Detection in still images
Dataset # G Eval. Recall Precision H.Mean144 1.49 ICDAR 70.2 18.0 28.6
CRISP 81.2 20.1 32.3AREA 83.5 26.3 40.0
384 1.84 ICDAR 55.9 17.3 26.4CRISP 59.1 18.1 27.7AREA 60.8 21.9 32.2
Artificial text + no text
Artificial text + scene text + no text
Dataset # G Eval. Recall Precision H.Mean144 1.49 ICDAR 54.8 23.2 32.6
CRISP 59.7 23.9 34.2AREA 68.8 25.5 37.3
384 1.84 ICDAR 45.1 21.7 29.3CRISP 47.5 21.5 29.6AREA 53.6 24.1 33.3
Artificial text + no text
Artificial text + scene text + no text
Local contrast
SVM learning
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
34Still imagesIntroduction Videos ConclusionCharacter segmentation Results
Local contrast
SVM learning
35Still imagesIntroduction Videos ConclusionCharacter segmentation Results
Local contrast
SVM learning
36
The influence of falling generality
Local contrast SVM learning
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
37
Detection in video sequences
Videos Contrast SVM Learn.
Classified as text 301 284
Classified as non-text 21 38
Total in ground truth 322 322
Positives 350 384
False alarms 947 171
Logos 75 39
Scene text 72 90
Total - false alarms 497 513
Total 1444 684
Recall (%) 93.5 88.2
Precision (%) 34.4 75.0
Harmonic mean (%) 50.3 81.1
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
38
Bin. Method Recall Precision H. Mean N. CostOtsu 47.3 90.5 62.1 56.8Niblack 80.5 80.4 80.4 40.0Sauvola 72.4 81.2 76.5 42.3Max. contrast 85.4 90.7 88.0 23.0
OCR resultsLocal contrast based binarization
Recognition by Abby Finereader 5.0
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
Sauvola et al. MRF
Baysian estimation using a Markov random field prior
1 2 3 4 5 Total
Sauvola 77.1 39.8 77.1 99.0 98.7 79.0
MRF 81.0 40.5 87.3 99.3 98.8 82.0
Character recognition rate
Document
39
TREC 2002
“Dance”
“EnergyGas”
“Music”
“Oil”
The type of videos present in the collection does not favor the use of recognized text: text is only rarely present.
“Airline”“Air plane”
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
40
ConclusionWe developed a new system for detection, tracking,
enhancement and binarisation of text.
Detection performance is high due to the integration of several types of features in a very early stage. The learning method is less sensitive to textured noise in the image.
We proposed a new evaluation method which takes into account several measures of detection quality.
We derived a new binarisation method adapted to the type of text found in videos.
2 patents2 publications in international journals (+1 submitted)3 publications in international conferences6 publications in national conferences
Still imagesIntroduction Videos ConclusionCharacter segmentation Results
41
OutlookPossible improvement of the features (e.g. contrast
normalization, non-linear texture filters).
Integration of different feature types (statistical, structural, ...)
Multi orientation processing is not yet complete (new training set, implementation of the post processing)
Adaptation of the tracking algorithm to general types of motion.
OCR on low resolution grayscale images.
Usage of a priori knowledge on text in order to decrease the number of false alarms
Integration of the detected text into a indexing/browsing/segmentation framework
Still imagesIntroduction Videos ConclusionCharacter segmentation Results