galateas: donner du sens aux journaux de logs pour une...

Post on 04-Sep-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

GALATEAS: donner du sens aux journaux de logs pour une meilleure connaissance des

clients et utilisateurs de sites web

http://www.galateas.eu

Frédérique SegondManager, Research & Development UnitViseo Research & Innovation

Domoina RABARIJAONA Engineer, Objet DirectViseo, With the contribution of the GALATEAS partners

GALATEAS partners

• Project coordinator: Xerox Research Centre Europe (France)

• Objet Direct (France)

• CELI (Italy)

• University of Trento (Italy)

• Gonetwork (Italy)

• Bridgeman Art (England)

• Humboldt University

(Germany)

• University of Amsterdam

(Netherlands)http://www.galateas.eu

The GALATEAS project offers digital content providers

●an innovative approach to understanding users' behaviour by analysing language-based information from transaction logs

●technologies facilitating improved navigation and search for multilingual content access

Today content providers cannot customize content and indexing as they don’t know their users.

GALATEAS - EU Project

GALATEAS develops two web services:

Understanding with LangLog: It analyzes transaction log containing queries to search engines for a given content provider.

By applying statistical technologies coupled with language oriented services, it produces reports concerning the informational needs of the users accessing that particular aggregation. LangLog provides generalizations of the actions that information seekers perform in order to find contents inside a searchable collection of digital objects.

Gobalizing with QueryTrans: It translates queries coming from an external search engine into several target languages: the external search engine returns to the user results into languages different from the one in which the query was formulated.

GALATEAS - web services

5

Understanding web visitors’ intention is key to convert them into customers or users

Development of an on-line service independent or integrable with the client’s web analytics that provides actionable reports on users’ behavior.

The customer access the visitors needs immediately

Understanding web users’ behavior, what users are really looking for, how do they look for information, whether they find it or not, and if I have it in my content

Improving conversion rateOptimize sales on e-commerce sites Optimize communication on PADecide over product innovationImprove search techniques

OBJECTIVES APPROACHApplying semantics and semantic

technologies to an on-line system for interpreting the search and web log data

Applying reporting tools to cross and visualize data obtained in the above phase

RESULTS

GALATEAS – Search Semantic Analytics

6

LangLog : Objectives

• To provide an intelligent platform to analyze user

queries to resident search engine.

• Problem driven research: main-stream query analysis system are completely language

unaware.

7 7

Category Keyword Search

Retirement

pensione casalinga 188

400

pensioni casalinghe 87

pensione casalinghe 73

pensione sociale casalinghe

52

Transport

obbligo gomme termiche piemonte 2011

108

307

obbligo pneumatici invernali piemonte

61

albo autotrasportatori 50

obbligo catene a bordo 2011 piemonte

45

gtt 43

Environment

clorofluorocarburi 132

164 combustibili fossili 32

Elderlies and

handicaped

assistenza domiciliare integrata

53

151

accompagnamento 49

assegno mensile di assistenza

49

School

cesedi 39

109

bosso monti 36

cascina falchera 34

wastes and pollution; 30%

energy; 25%flora and fauna; 23%

other; 13%waterworks; 8%wastes and pollutionenergyflora and faunaotherwaterworks

Standard web analytics results

Galateas results

GALATEAS - Search Semantic Analytics

8

Ex. Google

Normalize

ClassifyCluster

Methodology : Intelligence 1.0

Identify queries that are diferent on surface but refer to the same concept

Discover natural grouping and unexpected “events

To understand mapping of queries on “corporate categories”

10

Technological approach

LogFile parsing/structuring

Linguistic processing

Semantic Enrichment

Morphology

Tagging

NER

ClusteringClassification Reporting

11

Automatic dowload of logFiles from web server

Parsing different logFile formats (es. Apache transaction log, solr logfiles, etc.)

Automatic normalization/correction (format inconsistencies, encoding, etc.)

Filling the database with structured information about query, user sessions, click-through, etc.

LogFile parsing/structuring

12

Linguistic processing

Language Identification: As a prerequisite of any linguistic processing

Morphological analysis and lemmatization: For better aggregation and further processing

Detection and normalization of special tokens (e.g. abbreviations, spelling variations,etc.)

Named-Entities recognition (eg. persons, products, places…etc.) : To understand mapping of query to “object”

multi-words expression recognition (es. “pro loco”, “mezzi pesanti», etc.)

Linguistic Processing

LangLog – Linguistic Processing

Query 1 Tableau Mona Lisa (F)

Query 2 Oil painting la Gioconda (EN)

Query 3 La Gioconda pitturi da Vinci (IT)

Challenge - Recognise named entities and deal with multilingual terms in very short texts

La Gioconda

Oil

Painting

Index term 1

Index term 2

Index term 3

Identify appropriate index terms according to what the user is looking for

GALATEAS

14

Semantic EnrichmentLinguistic processing

Adding semantic information based on:

Generic ontologies

Domain specific semantic networks

Semantic enrichement enables to analyse words used in users query, from a «meaning» point of view at diverse abstraction levels

15

Classification

Query classification according to a given/predefined taxonomy, generally corresponding to the organisation of web site content

Two classification systems:• Unsupervised, for customer without a learning set (or click

through information).• Supervised for customers with a learning set (or click-through

information)

16

Unsupervised Classification

Category1

Category2

Categoryn

Documents2

Documentsn

Wikipedia

Documents1

Similarity

Query

Documents

“Learner”

Vector1

Vector2

Vector3

Vector

17

Supervised Classification

Category1

Category2

Categoryn

Query

“Model”

Decision TreeNaïve Bayes

Features

Meta-Data… Topic Models…

SVM

CT-Info

Feature Computation Module

Learning Algorithm

18

Linguistic processing

Clustering

Automatic grouping of semantically omogenous query in order to • Cluster dynamically set of queries selected by the user.• Provide static clusters of large sets of queries (e.g. last week, last

month, last year)

“Topic Models”, approach using the LDA algorithme - Latent Dirichlet Allocation and Gibbs Sampling

A "topic" consists of a cluster of words that frequently occur together.

Using words and information added at the semantic enrichment step, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings.

Queries are clustered on the basis of topical words

19

Clustering

• The service implements two kinds of clustering:

–An Exclusive Clustering (or hard clustering). The queries are partitioned in k distinct groups–A Hierarchical Clustering Solution. This solution is built on top of the hard clustering solution by successively merging the most similar clusters in a tree.

–The service returns a GEPHI graph representation of the Hierarchical Solution

20

Clustering

Query ID

Query Class

Query 1 Leonardo da Vinci, La Gioconda

Art

Query 2 Leonardo da Vinci, Vitruvian Man

Science

Query 2 Oil painting, la Gioconda

Art

Query 3 La Gioconda, pitturi, da Vinci

Art

Query 5 Leonardo da Vinci, meteorology

Science

Query ID

Query Class

Query X

Leonardo da Vinci, hydraulics, hydrometer

Science

LangLog – Classification and clustering

Challenge –Perform classification and clustering with short query texts

Assign to previously unseen queries a class from your indexing hierarchy

GALATEAS

22

Reporting

Data and graphical representation easy to read:

Grafics for classification and clustering

Temporal view of most relevant topics

Classification of frequent queries, normalised

click-through statistics on results

Overlapping analysis of classification/clustering

Analysis of next-page and next-page-click

All technologies are incorporated in a web services framework that allows easy integration of third-party technologies and great extensibility

General framework

GALATEAS coreCustomer

Natural Language Processing

services

Semantic services

Query logs

Customised reports

Original query

Translated query

Named Entity Recognition/ Part of Speech Tagging

Semantic similarity

Web services

Web services

Différentes sources de données

Tirer connaissance de ces données

Business Intelligence et son application à GALATEAS

Buts du BI

Aide à la décision

25

Architecture standard d’un système de BI

Data source Integration Datawarehouse Reporting

ETL ETL

ODS

DSA

DWH

DataMiningCDW

26

Outils BI

27

Méthodologie

Etude des besoins● De quoi le décideur a-t-il besoin de savoir pour améliorer les rendements?

Recueil des données● Inventaire des sources de données● Mise en place d'un référentiel commun (MDM)

Fixer les données à avoir en sortie

Conception● Modèles de données● Traitements à faire sur les données

Réalisation

28

Architecture Galateas

ETL ETL

ODS

DSA DWH

DB

Data source Integration Datawarehouse Reporting

29

ETL

1. Préparation●identification des données à extraire des sources●extraction de ces données

2. Intégration●nettoyage des données extraites●archivage éventuel●définition d’un format commun●transformation des données vers ce format

ODS

DSA

30

ETL

31

ETL

Data source Integration

ETL

ODS

DSA

DB

32

ETL

De l'ODS au DSA

●Filtrer les lignes●"Parser" chaque ligne selon le format

• Id_Session• Requête envoyée• Ressource accédée

●Formater les valeurs●Distinguer les recherches des clics

Format CommonLogFormat (W3C)

88.126.32.37 - - [03/Feb/2012:14:38:46 +0100] "GET /extension/ezaepi/design/ezaepi/images/favicon.ico HTTP/1.1" 200 1406 "-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"

Format ExtendedLogFileFormat (Apache)

[john]&thumb=x150&num=15&page=2&img=24557f6ff4244839ba7b253ff2083194 80 - 208.115.111.242 Mozilla/5.0+(compatible;+DotBot/1.1;+http://www.dotnetdotcom.org/,+crawler@dotnetdotcom.org) 200 0 0

33

Enrichissement des logs

DB

Language Identification

Lemmatization

Classification

Clustering

NER

34

Datawarehouse

35

Datawarehouse

Méthodes de construction

●Top-down• Réalisation de toutes les dimensions et faits

●Bottom Up• Création par étoile puis regroupement

●Middle-out• Conception totale, création par partie

36

Importance de la modélisation

Problème de modélisation●différents modèles de données sont utilisés

Problèmes de terminologie●un objet est désigné par 2 noms différents●un même nom désigne 2 objets différents

Incompatibilités de contraintes●2 concepts équivalents ont des contraintes incompatibles

Conflit sémantique, de représentation…

37

Mise à jour du datawarehouse

●Reconstruction périodique

●Mise à jour périodique

●Mise à jour instantanée

38

Datawarehouse et Database

Datawarehouse DatabaseOLAP OLTPSystème décisionnel Système opérationnelBulk insert et Select Insert, update, delete, selectHistorisé Volatile

39

Approche OLAP

MOLAP (Multidimensional) - Base de données multidimensionnelles- Données pré-agrégées- Limitations sur les quantités de

données- MDX

ROLAP (Relational) - Base de données relationnelles- Simulation des cubes- Tables d'agrégation- Grande capacité de données- SQL

HOLAP (Hybrid) - Approche hybride entre MOLAP et ROLAP

SOLAP (Spatial) - Modèle multidimensionnel optimisé pour l'analyse spatio-temporelle

DOLAP (Desktop) - Petite quantité de données directement stockée sur le poste du client

40

Approche OLAP

SQL MDX

SELECT *FROM tableWHERE x=yGROUP BY z

SELECT{[Measures].[Search Count]} ON COLUMNS,{[DIM1].[Value1]} ON ROWS,FROM FactWHERE ([DIM2].[Value2])

41

Approche OLAP

Opérations sur un hypercube●Rotate●Drill down (forage avant)●Roll Up (forage arrière)●Slice & Dice●Drill through

42

Reporting

Web Portal

Rapports statiques iReport

Cube OLAP (hypercube) Mondrian (serveur) Jpivot (client)

43

Démo

44

Tendances du BI

BI Mobile

Fonctions In-memory

Fonctionnalités de collaboration

Big Data

45

Page 45

Thank you!

Thanks to Natural Language processing Technologies we now have a clear view of customers needs!

top related