GALATEAS: donner du sens aux journaux de logs pour une meilleure connaissance des
clients et utilisateurs de sites web
http://www.galateas.eu
Frédérique SegondManager, Research & Development UnitViseo Research & Innovation
Domoina RABARIJAONA Engineer, Objet DirectViseo, With the contribution of the GALATEAS partners
GALATEAS partners
• Project coordinator: Xerox Research Centre Europe (France)
• Objet Direct (France)
• CELI (Italy)
• University of Trento (Italy)
• Gonetwork (Italy)
• Bridgeman Art (England)
• Humboldt University
(Germany)
• University of Amsterdam
(Netherlands)http://www.galateas.eu
The GALATEAS project offers digital content providers
●an innovative approach to understanding users' behaviour by analysing language-based information from transaction logs
●technologies facilitating improved navigation and search for multilingual content access
Today content providers cannot customize content and indexing as they don’t know their users.
GALATEAS - EU Project
GALATEAS develops two web services:
Understanding with LangLog: It analyzes transaction log containing queries to search engines for a given content provider.
By applying statistical technologies coupled with language oriented services, it produces reports concerning the informational needs of the users accessing that particular aggregation. LangLog provides generalizations of the actions that information seekers perform in order to find contents inside a searchable collection of digital objects.
Gobalizing with QueryTrans: It translates queries coming from an external search engine into several target languages: the external search engine returns to the user results into languages different from the one in which the query was formulated.
GALATEAS - web services
5
Understanding web visitors’ intention is key to convert them into customers or users
Development of an on-line service independent or integrable with the client’s web analytics that provides actionable reports on users’ behavior.
The customer access the visitors needs immediately
Understanding web users’ behavior, what users are really looking for, how do they look for information, whether they find it or not, and if I have it in my content
Improving conversion rateOptimize sales on e-commerce sites Optimize communication on PADecide over product innovationImprove search techniques
OBJECTIVES APPROACHApplying semantics and semantic
technologies to an on-line system for interpreting the search and web log data
Applying reporting tools to cross and visualize data obtained in the above phase
RESULTS
GALATEAS – Search Semantic Analytics
6
LangLog : Objectives
• To provide an intelligent platform to analyze user
queries to resident search engine.
• Problem driven research: main-stream query analysis system are completely language
unaware.
7 7
Category Keyword Search
Retirement
pensione casalinga 188
400
pensioni casalinghe 87
pensione casalinghe 73
pensione sociale casalinghe
52
Transport
obbligo gomme termiche piemonte 2011
108
307
obbligo pneumatici invernali piemonte
61
albo autotrasportatori 50
obbligo catene a bordo 2011 piemonte
45
gtt 43
Environment
clorofluorocarburi 132
164 combustibili fossili 32
Elderlies and
handicaped
assistenza domiciliare integrata
53
151
accompagnamento 49
assegno mensile di assistenza
49
School
cesedi 39
109
bosso monti 36
cascina falchera 34
wastes and pollution; 30%
energy; 25%flora and fauna; 23%
other; 13%waterworks; 8%wastes and pollutionenergyflora and faunaotherwaterworks
Standard web analytics results
Galateas results
GALATEAS - Search Semantic Analytics
8
Ex. Google
Normalize
ClassifyCluster
Methodology : Intelligence 1.0
Identify queries that are diferent on surface but refer to the same concept
Discover natural grouping and unexpected “events
To understand mapping of queries on “corporate categories”
10
Technological approach
LogFile parsing/structuring
Linguistic processing
Semantic Enrichment
Morphology
Tagging
NER
ClusteringClassification Reporting
11
Automatic dowload of logFiles from web server
Parsing different logFile formats (es. Apache transaction log, solr logfiles, etc.)
Automatic normalization/correction (format inconsistencies, encoding, etc.)
Filling the database with structured information about query, user sessions, click-through, etc.
LogFile parsing/structuring
12
Linguistic processing
Language Identification: As a prerequisite of any linguistic processing
Morphological analysis and lemmatization: For better aggregation and further processing
Detection and normalization of special tokens (e.g. abbreviations, spelling variations,etc.)
Named-Entities recognition (eg. persons, products, places…etc.) : To understand mapping of query to “object”
multi-words expression recognition (es. “pro loco”, “mezzi pesanti», etc.)
Linguistic Processing
LangLog – Linguistic Processing
Query 1 Tableau Mona Lisa (F)
Query 2 Oil painting la Gioconda (EN)
Query 3 La Gioconda pitturi da Vinci (IT)
Challenge - Recognise named entities and deal with multilingual terms in very short texts
La Gioconda
Oil
Painting
Index term 1
Index term 2
Index term 3
Identify appropriate index terms according to what the user is looking for
GALATEAS
14
Semantic EnrichmentLinguistic processing
Adding semantic information based on:
Generic ontologies
Domain specific semantic networks
Semantic enrichement enables to analyse words used in users query, from a «meaning» point of view at diverse abstraction levels
15
Classification
Query classification according to a given/predefined taxonomy, generally corresponding to the organisation of web site content
Two classification systems:• Unsupervised, for customer without a learning set (or click
through information).• Supervised for customers with a learning set (or click-through
information)
16
Unsupervised Classification
Category1
Category2
Categoryn
Documents2
Documentsn
Wikipedia
Documents1
Similarity
Query
Documents
“Learner”
Vector1
Vector2
Vector3
Vector
17
Supervised Classification
Category1
Category2
Categoryn
Query
“Model”
Decision TreeNaïve Bayes
Features
Meta-Data… Topic Models…
SVM
CT-Info
Feature Computation Module
Learning Algorithm
18
Linguistic processing
Clustering
Automatic grouping of semantically omogenous query in order to • Cluster dynamically set of queries selected by the user.• Provide static clusters of large sets of queries (e.g. last week, last
month, last year)
“Topic Models”, approach using the LDA algorithme - Latent Dirichlet Allocation and Gibbs Sampling
A "topic" consists of a cluster of words that frequently occur together.
Using words and information added at the semantic enrichment step, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings.
Queries are clustered on the basis of topical words
19
Clustering
• The service implements two kinds of clustering:
–An Exclusive Clustering (or hard clustering). The queries are partitioned in k distinct groups–A Hierarchical Clustering Solution. This solution is built on top of the hard clustering solution by successively merging the most similar clusters in a tree.
–The service returns a GEPHI graph representation of the Hierarchical Solution
20
Clustering
Query ID
Query Class
Query 1 Leonardo da Vinci, La Gioconda
Art
Query 2 Leonardo da Vinci, Vitruvian Man
Science
Query 2 Oil painting, la Gioconda
Art
Query 3 La Gioconda, pitturi, da Vinci
Art
Query 5 Leonardo da Vinci, meteorology
Science
Query ID
Query Class
Query X
Leonardo da Vinci, hydraulics, hydrometer
Science
LangLog – Classification and clustering
Challenge –Perform classification and clustering with short query texts
Assign to previously unseen queries a class from your indexing hierarchy
GALATEAS
22
Reporting
Data and graphical representation easy to read:
Grafics for classification and clustering
Temporal view of most relevant topics
Classification of frequent queries, normalised
click-through statistics on results
Overlapping analysis of classification/clustering
Analysis of next-page and next-page-click
All technologies are incorporated in a web services framework that allows easy integration of third-party technologies and great extensibility
General framework
GALATEAS coreCustomer
Natural Language Processing
services
Semantic services
Query logs
Customised reports
Original query
Translated query
Named Entity Recognition/ Part of Speech Tagging
Semantic similarity
Web services
Web services
Différentes sources de données
Tirer connaissance de ces données
Business Intelligence et son application à GALATEAS
Buts du BI
Aide à la décision
25
Architecture standard d’un système de BI
Data source Integration Datawarehouse Reporting
ETL ETL
ODS
DSA
DWH
DataMiningCDW
26
Outils BI
27
Méthodologie
Etude des besoins● De quoi le décideur a-t-il besoin de savoir pour améliorer les rendements?
Recueil des données● Inventaire des sources de données● Mise en place d'un référentiel commun (MDM)
Fixer les données à avoir en sortie
Conception● Modèles de données● Traitements à faire sur les données
Réalisation
28
Architecture Galateas
ETL ETL
ODS
DSA DWH
DB
Data source Integration Datawarehouse Reporting
29
ETL
1. Préparation●identification des données à extraire des sources●extraction de ces données
2. Intégration●nettoyage des données extraites●archivage éventuel●définition d’un format commun●transformation des données vers ce format
ODS
DSA
30
ETL
31
ETL
Data source Integration
ETL
ODS
DSA
DB
32
ETL
De l'ODS au DSA
●Filtrer les lignes●"Parser" chaque ligne selon le format
• Id_Session• Requête envoyée• Ressource accédée
●Formater les valeurs●Distinguer les recherches des clics
Format CommonLogFormat (W3C)
88.126.32.37 - - [03/Feb/2012:14:38:46 +0100] "GET /extension/ezaepi/design/ezaepi/images/favicon.ico HTTP/1.1" 200 1406 "-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
Format ExtendedLogFileFormat (Apache)
[john]&thumb=x150&num=15&page=2&img=24557f6ff4244839ba7b253ff2083194 80 - 208.115.111.242 Mozilla/5.0+(compatible;+DotBot/1.1;+http://www.dotnetdotcom.org/,[email protected]) 200 0 0
33
Enrichissement des logs
DB
Language Identification
Lemmatization
Classification
Clustering
NER
34
Datawarehouse
35
Datawarehouse
Méthodes de construction
●Top-down• Réalisation de toutes les dimensions et faits
●Bottom Up• Création par étoile puis regroupement
●Middle-out• Conception totale, création par partie
36
Importance de la modélisation
Problème de modélisation●différents modèles de données sont utilisés
Problèmes de terminologie●un objet est désigné par 2 noms différents●un même nom désigne 2 objets différents
Incompatibilités de contraintes●2 concepts équivalents ont des contraintes incompatibles
Conflit sémantique, de représentation…
37
Mise à jour du datawarehouse
●Reconstruction périodique
●Mise à jour périodique
●Mise à jour instantanée
38
Datawarehouse et Database
Datawarehouse DatabaseOLAP OLTPSystème décisionnel Système opérationnelBulk insert et Select Insert, update, delete, selectHistorisé Volatile
39
Approche OLAP
MOLAP (Multidimensional) - Base de données multidimensionnelles- Données pré-agrégées- Limitations sur les quantités de
données- MDX
ROLAP (Relational) - Base de données relationnelles- Simulation des cubes- Tables d'agrégation- Grande capacité de données- SQL
HOLAP (Hybrid) - Approche hybride entre MOLAP et ROLAP
SOLAP (Spatial) - Modèle multidimensionnel optimisé pour l'analyse spatio-temporelle
DOLAP (Desktop) - Petite quantité de données directement stockée sur le poste du client
40
Approche OLAP
SQL MDX
SELECT *FROM tableWHERE x=yGROUP BY z
SELECT{[Measures].[Search Count]} ON COLUMNS,{[DIM1].[Value1]} ON ROWS,FROM FactWHERE ([DIM2].[Value2])
41
Approche OLAP
Opérations sur un hypercube●Rotate●Drill down (forage avant)●Roll Up (forage arrière)●Slice & Dice●Drill through
42
Reporting
Web Portal
Rapports statiques iReport
Cube OLAP (hypercube) Mondrian (serveur) Jpivot (client)
43
Démo
44
Tendances du BI
BI Mobile
Fonctions In-memory
Fonctionnalités de collaboration
Big Data
45
Page 45
Thank you!
Thanks to Natural Language processing Technologies we now have a clear view of customers needs!