raÏssa yapan dougnonfinalement, je tiens également à remercier les membres du jury, qui ont...
TRANSCRIPT
ALGORITHMES D’INFÉRENCE DE PROFILS UTILISATEURS SUR LES RÉSEAUX
SOCIAUX, UTILISANT UN GRAPHE SOCIAL PARTIEL
THÈSE PRÉSENTÉE À LA FACULTÉ DES ÉTUDES SUPÉRIEURES ET DE LA
RECHERCHE EN VUE DE L’OBTENTION DE LA MAÎTRISE ÈS SCIENCES EN
INFORMATIQUE
RAÏSSA YAPAN DOUGNON
DÉPARTEMENT D’INFORMATIQUE
FACULTÉ DES SCIENCES
UNIVERSITÉ DE MONCTON
Septembre 2015
COMPOSITION DU JURY
Président du jury :
Julien Chiasson
Professeur d’informatique, Université de MonctonExaminateur externe :
Usef Faghihi
Professeur d’informatique, Cameron University, États-UnisExaminateur interne :
Mustapha Kardouchi
Professeur d’informatique, Université de MonctonDirecteur de thèse :
Philippe Fournier-Viger
Professeur d’informatique, Université de Moncton
REMERCIEMENTS
Mes remerciements les plus sincères vont à Philippe Fournier-Viger, mon directeur de thèse.
Tout au long de ce travail, Philippe à toujours été disponible et enthousiaste, et il m’a beaucoup
aidée. Il a su me guider, me soutenir et surtout m’a encouragée dans ma recherche. Merci Philippe,
pour ta générosité en matière de formation et d’encadrement en recherche.
À mes parents Wifried et Angèle DOUGNON pour leur amour et leur attachement indéfectible
à ma réussite sur le chemin de la vie.
À mes soeurs et mon frère, dont j’espère atteindre le seuil de vos espérances. Merci pour le
soutien moral et l’encouragement que vous m’avez accordés.
Un profond Merci à Harouna M Coulibaly, pour son amour constant, inconditionnel ainsi que
son support et encouragement tout au long de mes études.
À mon oncle et ami Hamadoun Baba Touré qui a su m’apporter son soutien aux moments
propices, mes reconnaissances et remerciements chaleureux à toi.
Finalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon
travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger.
ii
TABLE DES MATIÈRES
RÉSUMÉ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
RÉSUMÉ (anglais) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
INTRODUCTION GÉNÉRALE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
CHAPITRE IPGPI : ALGORITHME D’INFÉRENCE DE PROFILS UTILISATEURS SUR LES RÉ-SEAUX SOCIAUX EN UTILISANT UN GRAPHE SOCIAL PARTIEL . . . . . . . . . . 5
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Travaux antérieurs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Définition du problème . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 L’algorithme proposé . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1 Inférence de profils utilisateurs en utilisant les noeuds et les liens d’amitiés . . . 12
4.2 Inférence de profils utilisateurs en utilisant les groupes et les publications . . . . 13
4.3 Inférence de profils utilisateurs en utilisant les noeuds, les liens, les groupes etles publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5 Évaluation experimentale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
CHAPITRE IIPGPI+ : ALGORITHME AMÉLIORÉ D’INFÉRENCE DE PROFILS UTILISATEURSSUR LES RÉSEAUX SOCIAUX À PARTIR D’UN GRAPHE SOCIAL PARTIEL . . . . . 25
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Travaux antérieurs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Définition du problème . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 L’algorithme proposé PGPI+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 L’algorithme PGPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Optimisation de la couverture et de l’exactitude . . . . . . . . . . . . . . . . . . 33
4.3 Extension de PGPI+ pour évaluer les attributs numériques . . . . . . . . . . . . 34
4.4 Extension de PGPI+ pour évaluer la certitude des prédictions . . . . . . . . . . . 35
5 Évaluation expérimentale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
CONCLUSION GÉNÉRALE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
RÉFÉRENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
TABLE OF CONTENTS
ABSTRACT (French) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
GENERAL INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
CHAPTER IPGPI: INFERRING USER PROFILES IN SOCIAL NETWORKS USING A PAR-TIAL SOCIAL GRAPH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1 Inferring user profiles using nodes and links . . . . . . . . . . . . . . . . . . . . 12
4.2 Inferring user profiles using groups and publications . . . . . . . . . . . . . . . 13
4.3 Inferring user profiles using nodes, links, groups and publications . . . . . . . . 15
5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
CHAPTER IIPGPI+: ACCURATE SOCIAL NETWORK USER PROFILING UNDER THE CON-STRAINT OF A PARTIAL SOCIAL GRAPH . . . . . . . . . . . . . . . . . . . . . . . 25
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 The proposed PGPI+ algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 The PGPI algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Optimizations to improve accuracy and coverage . . . . . . . . . . . . . . . . . 33
4.3 Extension to handle numerical attributes . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Extension to evaluate the certainty of predictions . . . . . . . . . . . . . . . . . 35
5 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
GENERAL CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
LISTE DES TABLEAUX
I-1 Best accuracy/coverage results for each attribute . . . . . . . . . . . . . . . . . . . . 24
II-1 Best accuracy results for nominal attributes on Facebook . . . . . . . . . . . . . . . 41
II-2 Best accuracy results for nominal attributes on Pokec . . . . . . . . . . . . . . . . . 42
II-3 Average error and standard deviation for numeric attributes . . . . . . . . . . . . . . 43
II-4 Best accuracy for nominal attributes using the full social graph . . . . . . . . . . . . 44
v
LISTE DES FIGURES
I-1 Product of accuracy and coverage w.r.t. number of accessed facts . . . . . . . . . . . 17
I-2 Accuracy and coverage w.r.t. number of accessed facts . . . . . . . . . . . . . . . . 18
I-3 Number of accessed facts w.r.t. maxFacts . . . . . . . . . . . . . . . . . . . . . . . 19
I-4 Product of accuracy and coverage w.r.t. ratioFacts . . . . . . . . . . . . . . . . . . 20
I-5 Product of accuracy and coverage (left) and number of accessed facts (right) w.r.t.maxDistance for PGPI-N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
II-1 Accuracy w.r.t. number of accessed facts for nominal attributes . . . . . . . . . . . . 39
II-2 Average error, standard deviation and coverage w.r.t. certainty for the "year" attribute 44
II-3 Average error, standard deviation and coverage w.r.t. certainty for the "age" attribute . 45
vi
LISTE DES ALGORITHMES
1 The PGPI-N algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 The PGPI-G algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 The PGPI algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 The PGPI algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Certainty assessment of a nominal attribute value . . . . . . . . . . . . . . . . . . . 37
vii
RÉSUMÉ
Les réseaux sociaux en ligne occupent une place importante dans la société d’aujourd’hui. Un
problème important pour le ciblage publicitaire dans les réseaux sociaux est que les utilisateurs y
divulguent souvent peu d’informations personnelles. Pour pallier ce problème, un certain nombre
d’algorithmes ont été proposés pour inférer automatiquement les profils des utilisateurs sur la base
d’informations publiques. Cependant, ces algorithmes présentent néanmoins tous au moins deux
des quatre limites qui sont : (1) ils supposent que le graphe social complet est disponible, (2) ils ne
pas exploitent le contenu riche disponible sur certains réseaux sociaux tel que l’appartenance à des
groupes et les "J’aime", (3) ils traitent les attributs numériques comme des attributs nominaux et (4)
ils n’évaluent pas la certitude des prédictions. Afin de pallier (1) et (2), nous présentons un algo-
rithme appelé Partial Graph Profile Inference (PGPI). Cet algorithme permet l’inférence des profils
utilisateurs en l’absence d’un graphe social complet. PGPI ne requiert aucun entraînement pour être
appliqué, et permet de contrôler finement le compromis entre la quantité d’informations à laquelle
ont peut accédée pour faire une inférence et l’exactitude de l’inférence. De plus, PGPI est conçu
pour utiliser des informations variées sur les utilisateurs (liens d’amitié, profils utilisateurs, apparte-
nance à des groupes d’intérêt, et les « vues » et les « J’aime ») présents sur certains réseaux sociaux
comme Facebook. Afin de pallier les limites (3) et (4), nous présentons une version améliorée de
PGPI appelé PGPI+ (Partial Graph Profile Inference+). Cette version ajoute la prise en compte des
attributs numériques à celle des attributs nominaux. Il permet également l’évaluation de la certitude
des prédictions. Une évaluation expérimentale avec plus de 30,000 profils utilisateurs des réseaux
sociaux Facebook et Pokec montrent que PGPI/PGPI+ infèrent les profils d’utilisateurs avec une
plus grande exactitude que plusieurs approches données dans la littérature. Nos deux approches
accédent également à moins d’information du graphe social. De plus, un résultat intéressant est que
certains attributs comme le statut (étudiant/professeur) et le genre peuvent être prédits avec plus de
95% d’exactitude.
Mots clés : réseaux sociaux, inférence, profils utilisateurs, graphe partiel
viii
ABSTRACT
Online social networks have an important place in today’s society. An important problem for
ad targeting on social networks is that users often do not reveal much personal information. To
address this issue, various algorithms have been proposed to automatically infer user profiles using
the publicly disclosed information. Nevertheless, all these algorithms suffer from at least two of
the following limitations: (1) to assume that the full social graph is available for training, (2) to not
exploit the rich information available on some social networks such as group memberships, "views"
and "likes", (3) to treat numeric attributes as nominal attributes and (4) to not assess the certainty
of predictions.
To address limitations 1 and 2, we propose an algorithm named Partial Graph Profile Inference
(PGPI). The PGPI algorithm can accurately infer user profiles under the constraint of a partial social
graph. PGPI does not require training, and it lets the user select the trade-off between the amount
of information used for inference and the accuracy of predictions. Besides, PGPI is designed to use
rich information about users when available: user profiles, friendship links, group memberships,
and the "views" and "likes" from social networks such as Facebook.
Moreover, to also address limitations 3 and 4, we present an improved version of PGPI named
PGPI+ (Partial Graph Profile Inference +). PGPI+ consider numeric attributes in addition to nom-
inal attributes, and can evaluate the certainty of its predictions. An experimental evaluation with
more than 30,000 user profiles from the Facebook and Pokec social networks shows that PG-
PI/PGPI+ predicts user profiles with an accuracy higher than several start-of-the-art algorithms,
and by accessing less information from the social graph. Furthermore, an interesting result is that
some profile attributes such as the status (student/professor) and genre can be predicted with more
than 95 % accuracy using PGPI+.
Keywords: social networks, inference, user profiles, partial graph
ix
INTRODUCTION GÉNÉRALE
Dans la société d’aujourd’hui, les réseaux sociaux en ligne sont extrêmement populaires. Une
étude récente indique qu’il y a 3.025 milliards d’internautes à travers le monde, que 2.060 milliards
sont actifs sur les réseaux sociaux, soit 68% des internautes et 28% de la population mondiale, et
qu’un utilisateur de réseaux sociaux y passe en moyenne deux heures par jour [15]. Il existe divers
types de réseaux sociaux comme les réseaux d’amitié (notamment, Facebook et Google plus), les
réseaux professionnels (par exemple, ResearchGate et Linkedin) et les réseaux liés à des intérêts
spécifiques (par exemple, Flickr).
Un problème important pour le ciblage publicitaire dans les réseaux sociaux est due au fait
que les utilisateurs partagent généralement peu d’informations publiquement [1, 12]. Afin de re-
médier à ce problème, plusieurs chercheurs du domaine de l’extraction de connaissances dans les
réseaux sociaux (social network mining) s’intéressent maintenant au développement d’algorithmes
automatiques pour inférer des profils utilisateurs détaillés en utilisant les informations divulguées
publiquement par les utilisateurs [3, 6, 8, 11, 13, 14]. L’inférence du profil d’un utilisateur consiste
à découvrir les valeurs des attributs de son profil tels que : son opinion politique, son lieu de rési-
dence, son âge, ses traits de personnalité, son orientation sexuelle, son origine ethnique, etc.
Afin de résoudre ce problème, les chercheurs ont proposé plusieurs algorithmes d’inférence
de profil utilisateur dont les classificateurs relationnels naïfs de Bayes [14], la propagation d’éti-
quettes [11, 13], le vote majoritaire [6], la régression linéaire [12], l’allocation latente de Diri-
chlet [3] et la détection de communautés [16]. Ces différents algorithmes permettent la prédiction
des attributs privés des profils utilisateurs. Cependant, ils souffrent tous d’au moins une des quatre
limites importantes suivantes.
La première limite vient du fait que la grande majorité des algorithmes supposent avoir à leur
disposition le graphe complet du réseau social ou une grande partie pour l’entrainement [11]. Ceci
est une hypothèse pratique en recherche lorsque des chercheurs souhaitent comparer la performance
d’algorithmes en utilisant un grand ensemble de données d’entrainement. Or, dans la réalité, le
1
graphe social complet est généralement indisponible ou peut être très coûteux à obtenir ou à mettre à
jour [3,8]. Ainsi, plusieurs de ces algorithmes peuvent être inadaptés ou fournissent des prédictions
avec une exactitude faible. Quelques approches comme le vote majoritaire [6] ne supposent pas
un graphe social complet. Par contre, elles se limitent à l’exploration des voisins immédiats d’un
noeud et ne laissent pas la possibilité à l’utilisateur de décider du compromis entre le nombre de
noeuds auxquels on peut accéder dans le graphe social et l’exactitude des prédictions, ce qui peut
mener à des prédictions ayant une faible exactitude.
Concernant la seconde limite, nous remarquons que la plupart de ces algorithmes n’utilisent
ou ne considèrent pas l’information riche souvent disponible sur les réseaux sociaux. D’une part,
plusieurs de ces algorithmes considèrent les liens entre les utilisateurs et les différents attributs des
profils utilisateurs, mais ils ne considèrent pas d’autres informations sur l’utilisateur comme les
adhésions à des groupe, et les "J’aime" et les "vues" qui sont disponibles sur les réseaux sociaux
comme Facebook [1, 6, 9, 11, 13, 14, 16]. D’autre part, plusieurs de ces algorithmes considèrent la
similarité entre les profils utilisateurs et les informations comme les "J’aime", mais pas les liens
d’amitié entre les utilisateurs et d’autres types d’information disponibles sur les réseaux sociaux
[3, 12, 20].
La troisième limite est due au fait que plusieurs de ces algorithmes n’offre pas de procédure
d’inférence spécifique aux attributs numériques. Ils traitent donc les attributs numériques (par
exemple, l’âge) comme un attribut nominal [1, 6], ce qui peut entrainer une perte d’exactitude
pour l’inférence du profil d’un utilisateur. Certains de ces algorithmes sont spécialement conçus
pour traiter des attributs numériques, mais supposent tout de même l’entraînement avec un graphe
social complet ou une grande partie du graphe, ce qui n’est pas pratique [7, 11, 12, 16].
La quatrième limite importante est la non-évaluation de la certitude des prédictions. En effet,
peu d’approches évaluent la certitude des prédictions alors que cette information est essentielle
pour déterminer si une prédiction est fiable. Par exemple, dans le contexte du ciblage publicitaire,
si la certitude qu’une prédiction soit correcte est faible, il est préferable de ne pas utiliser la prédic-
tion pour le ciblage publicitaire [4], car une publicité ciblant une différente catégorie d’utilisateurs
pourrait être présentée.
2
Objectif. Le principal objectif de ce travail est de proposer un nouvel algorithme pour l’infé-
rence de profils utilisateurs dans les réseaux sociaux qui pallie les limites susmentionnées, c’est-à-
dire (1) qui laisse l’utilisateur décider du compromis entre la quantité d’information à utiliser pour
une prédiction et son exactitude (2) qui tient compte de l’information riche qu’on trouve dans les
réseaux sociaux comme les adhésions à des groupes, et les "J’aime" et les "vues" de Facebook, (3)
le traitement des attributs numériques et (4) l’évaluation de la certitude des prédictions. De plus, il
est souhaitable que l’algorithme offre un taux d’exactitude élevé comparativement aux algorithmes
de la littérature, tout en accédant à un nombre limité de noeuds du graphe social.
Hypothèse. Ce travail repose sur les hypothèses suivantes :
— Un algorithme qui prédit les valeurs d’attributs d’un profil utilisateur dans un réseau social
en utilisant un graphe social partiel peut offrir une exactitude comparable ou supérieure à
celle de plusieurs approches de la littérature qui utilisent un graphe social complet.
— Utiliser un algorithme paresseux (sans entraînement) permettra de laisser à l’utilisateur dé-
cider de la quantité d’information à utiliser pour l’inférence et l’exactitude des prédictions.
— Utiliser une information riche sur les utilisateurs (information sur l’adhésion aux groupes,
les "J’aime" et les "vues") permettra d’obtenir des prédictions avec une exactitude plus
élevée, lorsque cette information est disponible.
— Traiter les attributs numériques des utilisateurs (par exemple, l’âge, le poids et la taille) de
façon différente des attributs nominaux, permettra également d’obtenir des prédictions plus
exactes.
— Un algorithme peut être conçu pour l’inférence des profils utilisateurs en exploitant un
graphe social partiel, qui permet d’évaluer la certitude des prédictions.
Organisation de la thèse. Cette thèse est une thèse par publications. Les chapitres 2 et 3
contiennent respectivement deux articles publiés dans des actes de conférences internationales d’in-
telligence artificielle.
— Le premier article, « Inferring User Profiles in Online Social Networks using a Partial Social
Graph », a été publié dans les actes de AI 2015 : The 28th Canadian Conference on Artificial
Intelligence. L’article propose un algorithme nommé PGPI qui résout les deux premières
limites mentionnées précédemment. PGPI permet d’inférer des profils utilisateurs en ex-
3
ploitant un graphe social partiel et des informations riches sur les utilisateurs comme leurs
adhésions à des groupes, les "J’aime" et les "vues". Les résultats expérimentaux montrent
que PGPI offre une exactitude supérieure à plusieurs algorithmes de la littérature. De plus,
PGPI permet de prédire certains attributs tels que le statut et le genre sont avec une exacti-
tude supérieure à 90 %.
— Le deuxième article, « PGPI+ : Accurate Social Network User Profiling under the Constraint
of a Partial Social Graph » a été accepté pour publication dans les actes de MICAI 2015 : The
14th edition of the Mexican International Conference on Artificial Intelligence Intelligence.
Cet article présente une version améliorée de PGPI nommée PGPI+. PGPI+ introduit de
nouvelles optimisations pour augmenter l’exactitude des prédictions. De plus, il offre le
traitement des attributs numériques de façon particulière et l’évaluation de la certitude des
prédictions.
Il est à noter que la deuxième publication a été légèrement altérée pour ajouter des détails qui
avaient été omis par manque d’espace dans la version publiée.
Finalement, le chapitre 4 de cette thèse présente la conclusion générale ainsi que des perspec-
tives de travaux futurs.
4
CHAPITRE I
PGPI: INFERRING USER PROFILES IN SOCIAL NETWORKS USING A
PARTIAL SOCIAL GRAPH
PGPI: ALGORITHME D’INFÉRENCE DE PROFILS UTILISATEURS SUR LES
RÉSEAUX SOCIAUX EN UTILISANT UN GRAPHE SOCIAL PARTIEL
Abstract
Most algorithms for user profile inference in online social networks assume that the full social
graph is available for training. This assumption is convenient in a research setting. However, in
real-life, the full social graph is generally unavailable or may be very costly to obtain or update.
Thus, several of these algorithms may be inapplicable or provide poor accuracy. Moreover, current
approaches often do not exploit all the rich information that is available in social networks. In this
paper, we address these challenges by proposing an algorithm named PGPI (Partial Graph Profile
Inference) to accurately infer user profiles under the constraint of a partial social graph and without
training. It is to our knowledge, the first algorithm that let the user control the trade-off between the
amount of information accessed from the social graph and the accuracy of predictions. Moreover,
it is also designed to use rich information about users such as group memberships, views and likes.
An experimental evaluation with 11,247 Facebook user profiles shows that PGPI predicts user
profiles more accurately and by accessing a smaller part of the social graph than four state-of-the-art
algorithms. Moreover, an interesting result is that profile attributes such as status (student/professor)
and gender can be predicted with more than 90% accuracy using PGPI.
Keywords: social networks, inference, user profiles, partial graph
1 Introduction
In today’s society, online social networks have become extremely popular. Various types of
social networks are used such as friendship networks (e.g. Facebook and Twitter), professional
networks (eg. LinkedIn and ResearchGate) and networks dedicated to specific interests (e.g. Flickr
and IMDB). A concern that is often raised about social networks is how to protect users’ personal
information [10]. To address this concern, much work has been done to help users select how their
personal information is shared and with whom, and to propose automatic anonymization techniques
[10]. The result is that personal information is better protected but also that less information is now
available to companies to provide targeted advertisements and services to their users. To address
this issue, an important sub-field of social network mining is now interested in developing automatic
techniques to infer user profiles using the disclosed public information.
6
Various methods have been proposed to solve this problem such as relational Naïve Bayes
classifiers [14], label propagation [11, 13], majority voting [6] and approaches based on linear
regression [12], Latent-Dirichlet Allocation [3] and community detection [16]. It was shown that
these methods can accurately predict hidden attributes of user profiles in many cases. However,
these work suffers from two important limitations. First, the great majority of these approaches
assumes the unrealistic assumption that the full social graph is available for training (e.g. label
propagation). This is a convenient assumption used by researchers for testing algorithms using large
public datasets. However, in real-life, the full social graph is often unavailable (e.g. on Facebook,
LinkedIn and Google Plus) or may be very costly to obtain [3]. Moreover, even if the full social
graph is available, it may be unpractical to keep it up to date. The result is that several of these
methods are inapplicable or provide poor accuracy if the full social graph is not available (see
the experiment section of this paper). A few approaches does not assume a full social graph such
as majority-voting [6]. However, this latter is limited to only exploring immediate neighbors of a
node, and thus do not let the user control the trade-off between the number of nodes accessed and
prediction accuracy, which may lead to low accuracy. It is thus an important challenge to develop
algorithms that let the user choose the best trade-off between the number of nodes accessed and
prediction accuracy.
Second, another important limitation is that several algorithms do not consider the rich infor-
mation that is generally available on social networks. On one hand, several algorithms consider
links between users and user attributes but do not consider other information such as group mem-
berships, "likes" and "views" that are available on social networks such as Facebook and Twit-
ter [1, 6, 9, 11, 13, 14, 16]. On the other hand, several algorithms consider the similarity between
user profiles and information such as "likes" but not the links between users and other rich infor-
mation [3, 12, 20]. But exploiting more information may help to further increase accuracy.
In this paper, we address these challenges. Our contributions are as follows. First, we propose
a new lazy algorithm named PGPI (Partial Graph Profile Inference) that can accurately infer user
profiles under the constraint of a partial graph and without training. The algorithm is to our knowl-
edge, the first algorithm that let the user select how many nodes of the social graph can be accessed
to infer a user profile. This lets the user choose the best trade-off between accuracy versus number
7
of nodes visited.
Second, the algorithm is also designed to be able to use not only information about friendship
links and profiles but also group memberships, likes and views, when the information is available.
Third, we report results from an extensive empirical evaluation against four state-of-the-art algo-
rithms when applied to 11,247 Facebook user profiles. Results show that the proposed algorithm
can provide a considerably higher accuracy while accessing a much smaller number of nodes from
the social graph. Moreover, an interesting result is that profile attributes such as status (student/pro-
fessor) and gender can be predicted with more than 90% accuracy using PGPI.
The rest of this paper is organized as follows. Section 2, 3, 4, 5 and 6 respectively presents the
related work, the problem definition, the proposed algorithm, the experimental evaluation and the
conclusion.
2 Related Work
Several work have been done to infer user profiles on online social networks. We briefly review
recent work on this topic. Davis Jr et al. [6] inferred location of Twitter users based on relationships
to other users with known locations. They used a dataset of about 50,000 users. Their approach
consists of performing a majority vote over the locations of directly connected users. An important
limitation is that this approach is designed to predict a single attribute based on a single attribute
from other user profiles. Having the same goal, Jurgens [11] designed a variation of the popular la-
bel propagation approach to predict user locations on Twitter/Foursquare social networks. A major
limitation is that it is an iterative algorithm that requires the full social graph since it propagates
known labels to unlabeled nodes through links between nodes. Li et al. [13] developed an iterative
algorithm to deduce LinkedIn user profiles based on relation types. The algorithm is similar to label
propagation and requires a large training set to discover relation types.
He et al. [9] proposed an approach consisting of building a Bayesian network based on the
full social graph to then predict user attribute values. The approach considers similarity between
user profiles and links between users to perform predictions, and was applied to data collected from
LiveJournal. Recently, Dong et al. [7] proposed a similar approach based on using graphical-models
8
to predict the age and gender of users. Their study was performed with 1 billion phone and SMS
data and 7 millions user profiles. Chaudhari [4] also proposed a graphical model-based approach to
infer user profiles, designed to only perform predictions when the probability of making an accurate
prediction is high. The approach has shown high accuracy on a dataset of more than 1M Twitter
users and a datasets from the Pokac social network. A limitation of these approaches however,
it that they all assumes that the full social graph is available for training. Furthermore, they only
consider user attributes and links but do not consider additional information such as likes, views
and group membership.
A community detection based approach for user profile inference was proposed by Mislove
[16]. It was applied on more than 60K Facebook profiles with friendship links. It consists of ap-
plying a community detection algorithm and then to infer user profiles based on similarity to other
members of the same community. Lindamood et al. [14] also inferred user profiles from Facebook
data. They applied a modified Naïve Bayes classifier on 167K Facebook profiles with friendship
links, and concluded that if links or attributes are erased, accuracy of the approach can greatly
decrease. This raises the challenges of performing accurate predictions using few data. Recently,
Blenn et al. [1] utilized bird flocking, association rule mining and statistical analysis to infer user
profiles in a dataset of 3 millions Hyves.nl users. However, all these work assume that the full so-
cial graph is available for training and they only use profile information and social links to perform
predictions.
Chaabane et al. [3] proposed an approach based on Latent Dirichlet Allocation (LDA) to in-
fer Facebook user profiles. The approach extracts a probabilistic model from music interests and
additional information provided from Wikipedia. A drawback is that it requires a large training
dataset, which was very difficult and time-consuming to obtain according to the authors [3]. Kosin-
ski et al. [12] also utilized information about user preferences to infer Facebook user profiles. The
approach consists of applying Singular Value Decomposition to a large matrix of users/likes and
then applying regression to perform prediction. A limitation of this work is that it does not utilize
information about links between users and requires a very large training dataset.
Quercia et al. [20] developed an approach to predict the personality of Twitter users. The ap-
9
proach consists of training a decision tree using a large training sets of users tagged with their
personality traits, and then use it to predict personality traits of other users. Although this approach
was successful, it uses a very limited set of information to perform predictions: number of follow-
ers, number of persons followed and list membership count.
3 Problem Definition
As outlined above, most approaches assume a large training set or full social graph. The most
common definition of the problem of inferring user profiles is the following [1, 4, 9, 11, 13, 14, 16].
Definition 1. Social graph. A social graph G is a quadruplet G = {N,L, V, A}. N is the set of
nodes in the social graph. L ⊆ N × N is a binary relation representing the link (edges) between
nodes. Let be m attributes to describe users of the social network such that V = {V1, V2, ...Vm}
contains for each attribute i, the set of possible attribute values Vi. Finally, A = {A1, A2, ...Am}
contains for each attribute i a relation assigning an attribute value to nodes, that is Ai ⊆ N × Vi.
Example. Let be a social graph with three nodes N = {John, Alice, Mary} and friend-
ship links L = {(John, Mary), (Mary, John), (Mary, Alice), (Alice, Mary)}. Consider
two attributes gender and status, respectively called attribute 1 and 2 to describe users. The set
of possible attribute values for gender and status are respectively V1 = {male, female} and V2 =
{professor, student}. The relations assigning attributes values to nodes areA1 = {(John,male),
(Alice, female), (Mary, female)} and A2 = {(John, student), (Alice, student),
(Mary, professor)}.
Definition 2. Problem of inferring user profiles in a social graph. The problem of inferring
the user profile of a node n ∈ N in a social graph G is to correctly guess the attribute values of n
using the other information provided in the social graph.
We also consider an extended problem definition where we consider additional information
available in social networks such as Facebook (views, likes and group memberships).
Definition 3. Extended social graph. An extended social graph E is a tuple
E = {N,L, V,A,G,NG, P, PG,LP, V P} where N,L, V,A are defined as previously. G is a set
10
of groups that a user can be a member of. The relation NG ⊆ N ×G indicates the membership of
users to groups. P is a set of publications such as pictures, texts, videos that are posted in groups.
PG is a relation PG ⊆ P ×G, which associates a publication to the group(s) where it was posted.
LP is a relation LP ⊆ N × P , which indicates the publication(s) liked by each user (e.g. the
"likes" on Facebook). V P is a relation V P ⊆ N ×P , which indicates the publication(s) viewed by
each user (e.g. the "views" on Facebook). An observation is that LP ⊆ V P .
Example. Let be two groups G = {book_club, music_lovers} such that NG = {(John,
book_club), (Mary, book_club), (Alice, music_lovers)}. Let be two publications P = {picture1,
picture2} published in the groups PG = {(picture1, book_club), (picture2, music_lovers)}.
The publications viewed by users are V P = {(John, picture1), (Mary, picture1), (Alice,
picture2)}while the publications liked by users are LP = {(John, picture1), (Alice, picture2)}.
Definition 4. Problem of inferring user profiles in an extended social graph. The problem
of inferring the user profile of a node n ∈ N in an extended social graph E is to correctly guess the
attribute values of n using the information in the social graph.
But the above definitions assume that the full social graph may be used to perform predictions.
In this paper, we define the problem of inferring user profiles using a limited amount of information
as follows.
Definition 5. Problem of inferring user profiles using a partial (extended) social graph. Let
maxFacts ∈ N+ be a parameter set by the user. The problem of inferring the user profile of a
node n ∈ N using a partial (extended) social graph E is to accurately predict the attribute values
of n by accessing no more than maxFacts facts from the social graph. A fact is a node, group or
publication from N , G or P (excluding n).
4 The Proposed Algorithm
In this section, we present the proposed PGPI algorithm. We first describe a version PGPI-N
that infer user profiles using only nodes and links. Then, we present an alternative version PGPI-
G designed for predicting user profiles using only group and publication information (views and
11
likes). Then, we explain how these two versions are combined in the full PGPI algorithm.
4.1 Inferring user profiles using nodes and links
Our proposed algorithm PGPI-N for inferring profiles using nodes and links is inspired by the
label propagation family of algorithms, which was shown to provide high accuracy [11,13]. These
algorithms suffer however from an important limitation. They are iterative algorithms that require
the full social graph for training to propagate attribute values [11].
PGPI-N adapts the idea of label propagation for the case where at most maxFacts facts from
the social graph can be accessed to make a prediction. This is possible because PGPI-N is a lazy
algorithm (it does not require training), unlike label propagation. To predict an attribute value
of a node n, PGPI-N only explores the neighborhood of n, which is restricted by a parameter
maxDistance.
The PGPI-N algorithm (Algorithm 1) takes as parameter a node ni, an attribute k to be pre-
dicted, the maxFacts and maxDistance parameters and a social graph G. It outputs a predicted
value v for attribute k of node ni. The algorithm first initializes a map M so that it contains a key-
value pair (v, 0) for each possible value v for attribute k. The algorithm then performs a breadth-first
search. It first initializes a queue Q to store nodes and a set seen to remember the already visited
nodes, and the node ni is pushed in the queue.
Then, while the queue is not empty and the number of accessed facts is less than maxFacts,
the algorithm pops the first node nj in the queue. Then, the formula Fi,j = Wi,j/dist(ni, nj) is
calculated. Wi,j and dist(ni, nj) are respectively used to weight the influence of node nj by its
similarity and distance to ni. Wi,j is defined as the number of attribute values common to ni and
nj divided by the number of attributes m. dist(x, y) is the number of edges in the shortest path
between ni and nj . Then, Fi,j is added to the entry in map M for the attribute value of nj for
attribute k. Then, if dist(x, y) ≤ maxDistance, each node nh linked to nj that was not already
visited is pushed in the queue and added to the set seen. Finally, when the while loop terminates,
the attribute value v associated to the largest value in M is returned as the predicted value for ni.
Note that in our implementation, if the maxFacts limit is reached, the algorithm do not perform a
12
prediction to reduce the probability of making an inaccurate prediction.
Algorithm 1: The PGPI-N algorithminput : ni: a node, k: the attribute to be predicted, maxFacts and maxDistance: the
user-defined thresholds, G: a social graph
output: the predicted attribute value v
1 M = {(v, 0)|v ∈ Vk};
2 Initialize a queue Q;
3 Q.push(ni);
4 seen = {ni};
5 while Q is not empty and |accessedFacts| < maxFacts do
6 nj = Q.pop();
7 Fi,j ← Wi,j/dist(ni, nj);
8 Add Fi,j to the entry of value v in M such that (nj, v) ∈ Ak;
9 if dist(ni, nj) ≤ maxDistance then
10 foreach node nh 6= ni such that (nh, nj) ∈ L and nh 6∈ seen do
11 Q.push(nh);
12 seen← seen ∪ {nh};
13 end
14 end
15 end
16 return a value v such that (v, z) ∈M∧ 6 ∃(v′, z′) ∈M |z′ > z;
4.2 Inferring user profiles using groups and publications
PGPI-G is a lazy algorithm designed to predict user attribute values using only group and pub-
lication information (views and likes). The algorithm is inspired by majority voting algorithms
(e.g. [6]), which have been used for predicting user profiles based on links between users. PGPI-G
adapts this idea for groups and publications and also to handle the constraint that at mostmaxFacts
facts from the social graph can be accessed to make a prediction.
13
The PGPI-G algorithm (Algorithm 2) takes as parameter a node ni, an attribute k to be pre-
dicted, the maxFacts and maxDistance parameters and an extended social graph E . It outputs
a predicted value v for attribute k of node ni. The algorithm first initializes a map M so that it
contains a key-value pair (v, 0) for each possible value v for attribute k. Then, the algorithm it-
erates over each member nj 6= ni of each group g where ni is a member, while the number of
accessed facts is less than maxFacts. For each such node nj , the formula Fgi,j is calculated to
estimate the similarity between ni and nj and the similarity between ni and g. In this formula,
commonLikes(ni, nj) and commonV iews(ni, nj) respectively denotes the number of publica-
tions liked by both ni and nj , while commonGroups(ni, nj) is the number of groups common to
ni and nj . Finally, commonPopularAttributes(ni, g) is the number of attribute values of ni that
are the same as the most popular attribute values for members of g. Then, Fgi,j is added to the entry
in map M for the attribute value of nj for attribute k. Finally, the attribute value v associated to the
largest value in M is returned as the predicted value for ni. Note that in our implementation, if the
maxFacts limit is reached, the algorithm do not perform a prediction to reduce the probability of
making an inaccurate prediction.
Algorithm 2: The PGPI-G algorithminput : ni: a node, k: the attribute to be predicted, maxFacts: a user-defined threshold, E :
an extended social graph
output: the predicted attribute value v
1 M = {(v, 0)|v ∈ Vk};
2 foreach group g|(ni, g) ∈ NG s.t. |accessedFacts| < maxFacts do
3 foreach person nj 6= ni ∈ g s.t. |accessedFacts| < maxFacts do
4 Fgi,j ← Wi,j × commonLikes(ni, nj)× commonV iews(ni, nj)×
(commonGroups(ni, nj)/|{(ni, x)|(ni, x) ∈ NG}|)×
5 commonPopularAttributes(ni, g);
6 Add Fgi,j to the entry of value v in M such that (nj, v) ∈ Ak;
7 end
8 end
9 return a value v such that (v, z) ∈M∧ 6 ∃(v′, z′) ∈M |z′ > z;
14
4.3 Inferring user profiles using nodes, links, groups and publications
The PGPI-N and PGPI-G algorithms have similar design. Both of them are lazy algorithms
that update a map M containing key-value pairs and then return the attribute value v associated
to the highest value in M as the prediction. Because of this similarity, the algorithms PGPI-N
and PGPI-G can be easily combined. We name this combination PGPI. The pseudocode of PGPI
is shown in Algorithm 3, and is obtained by inserting lines 2 to 15 of PGPI-N before line 8 of
PGPI-G. Moreover, a new parameter named ratioFacts is added. It specifies how much facts of
the maxFacts facts can be respectively used by PGPI-N and by PGPI-G to make a prediction.
For example, ratioFacts = 0.3 means that PGPI-G may use up to 30% of the facts, and thus that
PGPI-N may use the other 70%. In our experiment, the best value for this parameter was 0.5.
5 Experimental Evaluation
We performed several experiments to assess the accuracy of the proposed PGPI-N, PGPI-G and
PGPI algorithms for predicting attribute values of nodes in a social network. Experiments were
performed on a computer with a fourth generation 64 bit Core i5 processor running Windows 8.1
and 8 GB of RAM.
We compared the performance of the proposed algorithms with four state-of-the-art algorithms.
The three first are Naïve Bayes classifiers [10]. Naïve Bayes (NB) infer user profiles strictly based
on correlation between attribute values. Relational Naïve Bayes (RNB) consider the probability
of having friends with specific attribute values. Collective Naïve Bayes (CNB) combines NB and
RNB. To be able to compare NB, RNB and CNB with the proposed algorithms, we have adapted
them to work with a partial graph. This is simply done by training them with maxFacts users
chosen randomly instead of the full social graph. The last algorithm is label propagation (LP) [11].
Because LP requires the full social graph and does not consider themaxFacts parameter, its results
are only used as a baseline. For algorithm specific parameters, the best values have been empirically
found to provide the best results.
Experiments were carried on a real-life dataset containing 11,247 user profiles collected from
15
the Facebook social network in 2005 [23]. Each user is described according to seven attributes:
a student/faculty status flag, gender, major, second major/minor (if applicable), dorm/house, year,
and high school. These attributes respectively have 6, 2, 65, 66, 66, 15 and 1,420 possible values.
The dataset is a social graph rather than an extended social graph, i.e it does not contain information
about group memberships, views and likes. But this information is needed by PGPI-G and PGPI.
To address this issue, we have generated synthetic data about group memberships, views and likes.
Our data generator takes several parameters as input. To find realistic values for the parameters, we
have observed more than 100 real groups on the Facebook network. Parameters are the following.
We set the number of groups to 500 and each group to contain between 20 and 250 members, since
we observed that most groups are small. Groups are randomly generated. But similar users are more
likely to be member of a same group. Furthermore, we limit the number of groups per person to
25. We generated 10,000 publications that are used to represent content shared by users in groups
such as pictures, text and videos. A publication may appear in three to five groups. A user has a
50% probability of viewing a publication and a 30% probability of liking a publication that he has
viewed. Lastly, a user can like and view respectively no more than 50 and 100 publications, and a
user is more likely to like a publication that users with a similar profile also like.
To compare algorithms, the prediction of an attribute value is said to be a success if the predic-
tion is accurate, a failure if the prediction is inaccurate or otherwise a no match if the algorithm was
unable to perform a prediction. Three performance measures are used. The accuracy is the number
of successful predictions divided by the total number of prediction opportunities. The coverage is
the number of success or failures divided by the number of prediction opportunities. The product
of accuracy and coverage (PAC) is the product of the accuracy and coverage and is used to as-
sess the trade-off obtained for these two measures. In the following, the accuracy and coverage are
measured as the average for all seven attributes, except when indicated otherwise.
PAC w.r.t number of facts. We have run all algorithms while varying themaxFacts parameter
to assess the influence of the number of accessed facts on the product of prediction accuracy and
coverage. Fig. II-1 shows the overall results. In this figure, the PAC is measured with respect to
the number of accessed facts. It can be observed that PGPI provides the best results when 132 to
700 facts are accessed. For less than 132 facts, CNB provides the best results followed by PGPI.
16
PGPI-N provides the second best results for 226 to 306 facts. Note that no results are provided for
PGPI-N for more than 306 facts. The reason is that PGPI-N relies solely on links between nodes
to perform predictions and the dataset does not contains enough links to let us further increase the
number of facts accessed by this algorithm.
0%
10%
20%
30%
40%
50%
60%
70%
0 100 200 300 400 500 600 700
Acc
ura
cy x
co
ver
age
Number of accessed facts
PGPI PGPI-G
PGPI-N NB
RNB CNB
Figure I-1 Product of accuracy and coverage w.r.t. number of accessed facts
The algorithm providing the worst results is LP (not shown in figures). LP provides a maximum
accuracy and coverage of respectively 43.2% and 100%. This is not good considering that LP uses
the full social graph of more than 10,000 nodes. For the family of Naïve Bayes algorithms, CNB
provides the best results until 90 facts are accessed. Then, NB is the best. RNB always provides the
worst results among Naïve Bayes algorithms.
Accuracy and coverage w.r.t number of facts. The left and right parts of Fig. I-2 separately
show the results for accuracy and coverage. In terms of accuracy, it can be observed that PGPI-N,
PGPI-G and PGPI always provides much higher accuracy than NB, RNB, CNB and LP (from 9.2%
to 36.8% higher accuracy). Thus, the reason why PGPI algorithms have a lower PAC than NB for
132 facts or less is because of coverage and not accuracy. In terms of coverage, the coverage is
only shown in Fig. I-2 for PGPI-N, PGPI-G and PGPI because the coverage of the other algorithms
remains always the same (100% for NB, RNB and CNB, and 94.4% for LP). PGPI has a much
higher coverage than PGPI-N and PGPI-G, even when it accesses much less facts. For example,
PGPI has a coverage of 87% when it accesses 132 facts, while PGPI-N and PGPI-G have a similar
17
coverage when using respectively 267 and 654 facts. The coverage of PGPI exceeds 95% when
using about 400 facts.
35%
40%
45%
50%
55%
60%
65%
70%
75%
80%
0 100 200 300 400 500 600 700
Acc
ura
cy
Number of accessed facts
0%
20%
40%
60%
80%
100%
0 100 200 300 400 500 600 700C
ov
era
ge
Number of accessed facts
PGPI PGPI-G PGPI-N NB RNB CNB
Figure I-2 Accuracy and coverage w.r.t. number of accessed facts
Best results for each attributes. We also analyzed each attributes individually. The best results
in terms of accuracy and coverage obtained for each attribute and each algorithm are shown in Table
II-1. The last column of the table indicates the number of accessed facts to obtain these results.
In terms of accuracy, the best accuracy was achieved by PGPI algorithms for all attributes. For
status,gender, major, minor, residence, year and school, PGPI algorithms respectively obtained a
maximum accuracy of 90.9%, 94.6%, 75.1%, 60.5%, 70.5% and 12.9%. The reason for the low
accuracy for school is that there are 1,420 possible values and that users are rarely linked to other
users with the same value.
The best results in terms of PAC are written in bold. PGPI provides the best PAC for all but one
attribute. The second best algorithm is PGPI-N, which provides the second or third best PAC for
five of seven attributes. Moreover, it is interesting to see that PGPI-N achieves this result using less
than half the amount of facts used by PGPI. The third best algorithm is PGPI-G, which provides
the second or third best PAC for four attributes.
18
Influence of maxFacts and ratioFacts parameters. In the previous experiments, we have
analyzed the relationship between the number of accessed facts and the PAC, accuracy and cover-
age. However, we did not discuss how to choose the maxFacts parameter and how it influences
the number of accessed facts for each algorithm.
For Naïve Bayes algorithms and LP, the maxFacts parameter determines the number of nodes
to be used for training, and thus directly determines the number of accessed facts.
For PGPI algorithms, we show the influence of maxFacts on the number of accessed facts in
Fig. I-3. It can be observed that PGPI algorithms generally explore less nodes than maxFacts. For
PGPI-N, the reason is that some social network users have less than maxFacts friends whithin
maxDistance and thus, PGPI-N cannot use more than maxFacts facts to perform a prediction.
For PGPI-G, the reason is that some users are members of just a few groups. For PGPI, the number
of accessed facts is greater than for PGPI-N but less than for PGPI-G. The reason is that PGPI com-
bines both algorithms and uses the ratioFacts parameter, as previously explained, to decide how
much facts can be used by PGPI-N and PGPI-G. The value that we used in previous experiments
for this parameter is 0.5, which means to use 50% of the facts for PGPI-G, and thus to use the other
half for PGPI-N.
075
150225300375450525600675750825900
0 150 300 450 600 750 900 1050 1200 1350 1500
Nu
mb
er
of
acce
sse
d in
form
atio
ns
maxFacts
PGPI PGPI-G
PGPI-N maxFacts
Figure I-3 Number of accessed facts w.r.t. maxFacts
A question that can be raised is how to set the ratioFacts parameter for the PGPI algorithms.
19
In Fig. I-4, we show the influence of the ratioFacts parameter on the product of accuracy and
coverage. We have measured the PAC w.r.t the number of accessed facts when ratiofacts is re-
spectively set to 0.3, 0.5 and 0.7. Note that setting ratioFacts to 0 and 1 would be respectively
equivalent to using PGPI-N and PGPI-G. In Fig. I-4, it can be observed that setting ratioFacts to
0.5 allows to obtain the highest PAC. Furthermore, in general a value of 0.5 also gives a high PAC
w.r.t the number of accessed facts. A value of 0.3 sometimes gives a higher PAC than 0.5. However,
the highest PAC value for 0.3 on overall is lower than that for 0.5. A value of 0.7 almost always
gives a lower PAC.
10%
20%
30%
40%
50%
60%
70%
0 100 200 300 400 500 600 700
Acc
ura
cy x
co
ver
age
Number of accessed informations
PGPI ratioFacts = 0.3
PGPI ratioFacts = 0.5
PGPI ratioFacts = 0.7
Figure I-4 Product of accuracy and coverage w.r.t. ratioFacts
Best results when using the full social graph. We also performed an experiment using the
full social graph (maxFacts = ∞). The best algorithm in this case is PGPI with an accuracy and
coverage of 64.7% and 100%, while the second best algorithm (PGPI-G) achieves 62% and 98.4%.
Influence of the maxDistance parameter. Recall that the PGPI-N and PGPI algorithms uti-
lize a parameter called maxDistance. This parameter indicates that two nodes are in the same
neighborhood if there are no more than maxDistance edges in the shortest path between them. In
the previous experiments, we have empirically set this parameter to obtain the best results. We now
discuss the influence of this parameter on the results of PGPI-N and PGPI.
We have investigated the influence of maxDistance on the product of accuracy and coverage,
20
and also on the number of accessed facts. Results are shown in Fig. I-5 for PGPI-N. It can be
observed that as maxDistance is increased, the PAC also increases until it reaches its peak at
around maxDistance = 6. After that, it remains unchanged. The main reason why it remains
unchanged is that the influence of a user on another user is divided by their distance in the PGPI-
N algorithm. Thus, farther nodes have a very small influence on each other. Another interesting
observation is that as maxDistance increases, the number of accessed facts also increases. Thus,
if our goal is to use less facts, thenmaxDistance should be set to a small value, whereas if our goal
is to achieve the maximum PAC, maxDistance should be set to a large value. Lastly, we can also
observe that the number of accessed facts does not increase after maxDistance = 7. The reason is
that the Facebook dataset is quite sparse (i.e. users have few friendship links).
30%
35%
40%
45%
50%
55%
60%
1 2 3 4 5 6 7 8 9 10
Pro
du
ct o
f ac
cura
cy a
nd
co
vera
ge
maxDistance
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8 9 10
Nu
mb
er
of
acce
sse
d f
acts
maxDistance
Figure I-5 Product of accuracy and coverage (left) and number of accessed facts (right) w.r.t.
maxDistance for PGPI-N
Discussion. Overall, PGPI algorithms provide the best accuracy. A key reason is that they are
lazy algorithms. They use local facts close to the node to be predicted rather than relying on a gen-
eral training phase performed beforehand. A second important reason is that more rich information
is used by PGPI-G and PGPI (groups, publications). A third reason is that PGPI-N calculates the
similarity between users while label propagation does not.
It is interesting to note that although PGPI-G relies on synthetic data in our experiment, PGPI-N
does not and performs very well.
21
6 Conclusion
We have proposed a lazy algorithm named PGPI that can accurately infer user profiles under
the constraint of a partial social graph. It is to our knowledge, the first algorithm that let the user
control the amount of information accessed from the social graph and the accuracy of predictions.
Moreover, the algorithm is designed to use not only node links and profiles as basis to perform
predictions but also group memberships, likes and views, if the information is available. An exten-
sive experimental evaluation against four state-of-the-art algorithms shows that PGPI can provide
a considerably higher accuracy while accessing a much smaller number of nodes from the social
graph to perform predictions. Moreover, an interesting result is that profile attributes such as status
(student/professor) and gender can be predicted with more than 90% accuracy using PGPI.
For future work, we plan to further improve the PGPI algorithms by proposing new optimiza-
tions and to perform experiments on additional datasets.
22
Algorithm 3: The PGPI algorithminput : ni: a node, k: the attribute to be predicted, maxFacts: a user-defined threshold, E :
an extended social graph, ratioFacts: the ratio of facts to be used by PGPI-G
output: the predicted attribute value v
1 M = {(v, 0)|v ∈ Vk};
2 foreach group g|(ni, g) ∈ NG s.t. |accessedFacts| < maxFactsxratioFacts do
3 foreach person nj 6= ni ∈ g s.t. |accessedFacts| < maxFacts do
4 Fgi,j ← Wi,j × commonLikes(ni, nj)×
commThepseudocodeofPGPIisshowninFig.3.onV iews(ni, nj)×
(commonGroups(ni, nj)/|{(ni, x)|(ni, x) ∈ NG}|)×
5 commonPopularAttributes(ni, g);
6 Add Fgi,j to the entry of value v in M such that (nj, v) ∈ Ak;
7 end
8 end
9 Initialize a queue Q;
10 Q.push(ni);
11 seen = {ni};
12 while Q is not empty and |accessedFacts| < maxFacts do
13 nj = Q.pop();
14 Fi,j ← Wi,j/dist(ni, nj);
15 Add Fi,j to the entry of value v in M such that (nj, v) ∈ Ak;
16 if dist(ni, nj) ≤ maxDistance then
17 foreach node nh 6= ni such that (nh, nj) ∈ L and nh 6∈ seen do
18 Q.push(nh);
19 seen← seen ∪ {nh};
20 end
21 end
22 end
23 return a value v such that (v, z) ∈M∧ 6 ∃(v′, z′) ∈M |z′ > z;
23
algorithm status gender major minor residence year school |facts|
PGPI acc. 90.8% 87.7% 31.5% 75.1% 59.59% 66.86% 10.3% 684
PGPI cov. 100% 98.9% 99% 99% 99% 99% 99% 684
PGPI-N acc. 89.4% 90.9% 24.5% 73.9% 60.5% 66.77% 12.9% 272
PGPI-N cov. 94.4% 94.0% 94.4% 94.4% 94.4% 94.4% 94.4% 272
PGPI-G acc. 90.9% 94.6% 35.2% 70.8% 53.0% 70.5% 8.8% 689
PGPI-G cov. 96.6% 72% 72% 72% 72% 72% 72% 689
NB acc. 88.0% 51.1% 16.6% 74.4% 55.6% 26.0% 10.6% 189
NB cov. 100% 100% 100% 100% 100% 100% 100% 189
RNB acc. 80.2% 57.7% 15.0% 74.2% 55.0% 26.2% 9.8% 580
RNB cov. 100% 100% 100% 100% 100% 100% 100% 580
CNB acc. 88.0% 47.3% 9.0% 74.0% 54.8.% 26.0% 10.6% 189
CNB cov. 100% 100% 100% 100% 100% 100% 100% 189
LP acc. 83.0% 50.4% 16.7% 56.3% 49.1% 35.3% 10.1% 10K
LP cov. 94.4% 94.4% 94.4% 94.4% 94.4% 94.4% 94.4% 10K
Table I-1 Best accuracy/coverage results for each attribute
24
CHAPITRE II
PGPI+: ACCURATE SOCIAL NETWORK USER PROFILING UNDER THE
CONSTRAINT OF A PARTIAL SOCIAL GRAPH
PGPI+: ALGORITHME AMÉLIORÉ D’INFÉRENCE DE PROFILS
UTILISATEURS SUR LES RÉSEAUX SOCIAUX À PARTIR D’UN GRAPHE
SOCIAL PARTIEL
Abstract
Algorithms for social network user profiling suffer from one or more of the following limita-
tions: (1) assuming that the full social graph is available for training, (2) not exploiting the rich
information that is available in social networks such as group memberships and likes, (3) treating
numeric attributes as nominal attributes, and (4) not assessing the certainty of its predictions. In
this paper, we address these challenges by proposing an improved algorithm named PGPI+ (Partial
Graph Profile Inference+). PGPI+ accurately infers user profiles under the constraint of a partial
social graph using rich information about users (e.g. group memberships, views and likes), handles
nominal and numeric attributes, and assesses the certainty of predictions. An experimental evalua-
tion with more than 30,000 user profiles from the Facebook and Pokec social networks shows that
PGPI+ predicts user profiles with considerably more accuracy and by accessing a smaller part of
the social graph than five state-of-the-art algorithms.
Keywords: social networks, inference, user profiles, partial graph
1 Introduction
Online social networks have become extremely popular. Various types of social networks are
used such as friendship networks (e.g. Facebook), professional networks (e.g. ResearchGate) and
interest-based networks (e.g. Flickr). An important problem for ad targeting [25] and building rec-
ommender systems [5, 17, 19, 24, 27] that relies on social networks is that users often disclose few
information publicly [1, 12]. To address this issue, an important sub-field of social network mining
is now interested in developing algorithms to infer detailed user profiles using publicly disclosed in-
formation [2,21]. Various approaches have been used to solve this problem such as relational Naïve
Bayes classifiers [14], label propagation [11,13], majority voting [6], linear regression [12], Latent-
Dirichlet Allocation [3] and community detection [16]. It was shown that these approaches can
accurately predict hidden attributes of user profiles in many cases. However, all these approaches
suffer from at least two of the following four limitations.
1. Assuming a full social graph. Many approaches assume that the full social graph is avail-
26
able for training (e.g. [11]). However, in real-life, it is generally unavailable or may be very
costly to obtain or update [3, 8]. A few approaches do not assume a full social graph such
as majority-voting [6]. However, they do not let the user control the trade-off between the
number of nodes accessed and prediction accuracy, which may lead to low accuracy.
2. Not using rich information. Several algorithms do not consider the rich information that is
available on social networks. For example, several algorithms consider links between users
and user attributes but do not consider other information such as group memberships, "likes"
and "views" that are available on some social networks [1, 6, 11, 13, 14, 16].
3. Not handling numeric attributes. Many approaches treat numeric attributes (e.g. age) as
nominal attributes [1, 6], which may decrease inference accuracy. Others are designed to
handle numeric attributes but requires the full social graph, which is often unpractical [7,
11, 12, 16].
4. Not assessing certainty. Few approaches assess the certainty of their predictions. But this
information is essential to determine if a prediction is reliable and actions should be taken
based on the prediction. For example, if there is a low certainty that a user profile attribute
is correctly inferred, it may be better to not use this attribute for ad targeting, rather than
showing an ad that is targeted to a different audience [4].
To address limitations 1 and 2, the PGPI algorithm (Partial Graph Profile Inference) was re-
cently proposed [8]. PGPI lets the user select how many nodes of the social graph can be accessed
to infer a user profile, and can use not only information about friendship links and profiles but also
about group memberships, likes and views, when available. In this paper, we present an extended
version of PGPI named PGPI+ to also address limitations 3 and 4. Contributions are threefold.
First, we design a new procedure for predicting values of numeric attributes. Second, we introduce
a mechanism to assess the certainty of predictions for both nominal and numeric attributes. Third,
we introduce four optimizations that considerably improve the overall prediction accuracy of PGPI.
We report results from an extensive empirical evaluation against PGPI and four other state-of-
the-art algorithms, for more than 30,000 user profiles from the Facebook and Pokec social networks.
Results show that the proposed PGPI+ algorithm can provide a considerably higher accuracy for
27
both numeric and nominal attributes while accessing a much smaller number of nodes from the
social graph. Moreover, results show that the calculated certainty well assesses the reliability of
predictions. Moreover, an interesting result is that profile attributes such as status (student/profes-
sor) and gender can be predicted with more than 95% accuracy using PGPI+.
The rest of this paper is organized as follows. Section 2, 3, 4, 5 and 6 respectively presents the
related work, the problem definition, the proposed algorithm, the experimental evaluation and the
conclusion.
2 Related work
We review recent work on social network user profile inference. Davis Jr et al. [6] inferred
locations of Twitter users by performing a majority vote over the locations of directly connected
users. A major limitation of this approach is that a single attribute is considered. Jurgens [11]
predicted locations of Twitter/Foursquare users using a label propagation approach. However, it
is an iterative algorithm that requires the full social graph since it propagates known labels to
unlabeled nodes through links between nodes. Li et al. [13] proposed an iterative algorithm to
deduce LinkedIn user profiles based on relation types. This algorithm also requires a large training
set to discover relation types.
Mislove [16] applied community detection on more than 60k user profiles with friendship links,
and then inferred user profiles based on similarity of members from the same community. Lin-
damood et al. [14] applied a Naïve Bayes classifier on 167k Facebook profiles with friendship
links, and concluded that if links or attributes are erased, accuracy of the approach can greatly
decrease. This study highlights the challenges of performing accurate predictions using few data.
Recently, Blenn et al. [1] utilized bird flocking, association rule mining and statistical analysis to
infer user profiles in a dataset of 3 millions Hyves.nl users. However, all these work assume that a
large training set is available for training and they only use profile information and social links to
perform predictions.
Chaabane et al. [3] inferred Facebook user profiles using Latent Dirichlet Allocation (LDA) and
majority voting. The approach extracts a probabilistic model from music interests and additional
28
information provided from Wikipedia. But it requires a large training set, which was difficult and
time-consuming to obtain [3]. Kosinski et al. [12] also utilized information about user preferences
to infer Facebook user profiles. Kosinski et al. applied Singular Value Decomposition to a huge
matrix of users/likes and then used regression to perform prediction. A limitation of this work is
that it does not utilize information about links between users and requires a very large training
dataset.
He et al. [9] proposed an approach consisting of building a Bayesian network based on the full
social graph to then predict user attribute values. The approach considers similarity between user
profiles and links between users to perform predictions, and was applied to data collected from
LiveJournal. Recently, Dong et al. [7] used graphical-models to predict the age and gender of users.
Their study was performed with 1 billion phone and SMS data and 7M user profiles. Chaudhari [4]
also used graphical models to infer user profiles. The approach has shown high accuracy on datasets
of more than 1M users from the Twitter and Pokec social networks. A limitation of these approaches
however, it that they assume a large training set for training. Furthermore, they only consider user
attributes and links but not additional information such as likes, views and group membership.
Some of the above approaches infer numeric attributes, either by treating them as nominal
attributes [1, 3, 6], or by using specific inference procedures. However, these latter require a large
training set [7, 11, 12, 16].
Besides, current approaches generally do not assess the certainty of predictions. But this in-
formation is essential to determine if a prediction is reliable, and actions should be taken based
on this prediction. To our knowledge, only Chaudhari [4] provides this information. However, this
approach is designed to use the full social graph.
3 Problem definition
The problem of user profiling is commonly defined as follows [1, 4, 11, 13, 14, 16].
Definition 1. Social graph. A social graph G is a quadruplet G = {N,L, V, A}. N is the set
of nodes in G. L ⊆ N × N is a binary relation representing the links (edges) between nodes. Let
29
be m attributes to describe users of the social network such that V = {V1, V2, ...Vm} contains for
each attribute i, the set of possible attribute values Vi. Finally, A = {A1, A2, ...Am} contains for
each attribute i a relation assigning an attribute value to nodes, that is Ai ⊆ N × Vi.
Example. Let be a social graph with three nodes N = {Tom, Amy, Lea} and friendship links
L = {(Tom, Lea), (Lea, Tom), (Lea, Amy), (Amy, Lea)}. Consider two attributes gender and
status, respectively called attribute 1 and 2 to describe users. The set of possible attribute values for
gender and status are respectively V1 = {male, female} and V2 = {professor, student}. The re-
lations assigning attributes values to nodes areA1 = {(Tom,male), (Amy, female), (Lea, female)}
and A2 = {(Tom, student), (Amy, student), (Lea, professor)}.
Definition 2. Problem of inferring user profiles in a social graph. The problem of inferring
the user profile of a node n ∈ N in a social graph G is to guess the attribute values of n using the
other information provided in G.
The problem definition can be extended to consider additional information from social networks
such as Facebook (views, likes and group memberships).
Definition 3. Extended social graph. An extended social graph E is a tuple
E = {N,L, V,A,G,NG, P, PG,LP, V P} where N,L, V,A are defined as previously. G is a set
of groups that a user can be a member of. The relation NG ⊆ N × G indicates the membership
of users to groups. P is a set of publications such as pictures, texts, videos that are posted in
groups. PG is a relation PG ⊆ P × G, which associates a publication to the group(s) where it
was posted. LP is a relation LP ⊆ N ×P indicating publication(s) liked by each user (e.g. "likes"
on Facebook). V P is a relation V P ⊆ N × P indicating publication(s) viewed by each user (e.g.
"views" on Facebook), such that LP ⊆ V P .
Example. Let be two groups G = {book_club, music_lovers} such that NG = {(Tom,
book_club), (Lea, book_club), (Amy, music_lovers)}. Let be two publications P = {picture1,
picture2} published in the groups PG = {(picture1, book_club), (picture2, music_lovers)}.
The publications viewed by users are V P = {(Tom, picture1), (Lea, picture1), (Amy, picture2)}
while the publications liked by users are LP = {(Tom, picture1), (Amy, picture2)}.
30
Definition 4. Problem of inferring user profiles in an extended social graph. The problem
of inferring the user profile of a node n ∈ N in an extended social graph E is to guess the attribute
values of n using the information in E .
But the above definitions assume that the full social graph may be used to perform predictions.
The problem of inferring user profiles using a limited amount of information is defined as follows
[8].
Definition 5. Problem of inferring user profiles using a partial (extended) social graph. Let
maxFacts ∈ N+ be a parameter set by the user. The problem of inferring the user profile of a
node n ∈ N using a partial (extended) social graph E is to accurately predict the attribute values
of n by accessing no more than maxFacts facts from the social graph. A fact is a node, group or
publication from N , G or P (excluding n).
The above definition can be extended for numeric attributes. For those attributes, instead of
aiming at predicting an exact attribute value, the goal is to predict a value that is as close as possible
to the real value. Moreover, in this paper, we also extend the problem to consider the certainty of
predictions. In this setting, a prediction algorithm must assign a certainty value in the [0,1] interval
to each predicted value, such that a high certainty value indicates that a prediction is likely to be
correct.
4 The proposed PGPI+ algorithm
We next present the proposed PGPI+ algorithm. Subsection 4.1 briefly introduces PGPI. Then,
subsections 4.2, 4.3 and 4.4 respectively present optimizations to improve its prediction accuracy
and coverage, and how it is extended to handle numerical attributes and assess the certainty of
predictions.
4.1 The PGPI algorithm
The PGPI algorithm [8] is a lazy algorithm designed to perform predictions under the constraint
of a partial social graph, where at most maxFacts facts from the social graph can be accessed to
31
make a prediction. PGPI (Algorithm 4) takes as parameter a node ni, an attribute k to be predicted,
the maxFacts parameter, a parameter named maxDistance, and an (extended) social graph E .
PGPI outputs a predicted value v for attribute k of node ni. To predict the value of an attribute k,
PGPI relies on a map M . This map stores pairs of the form (v, f), where v is a possible value v for
attribute k, and f is positive real number called the weight of v. PGPI automatically calculates the
weights by applying two procedures named PGPI-G and PGPI-N. These latter respectively update
weights by considering the (1) views, likes and group memberships of ni, and (2) its friendship
links. After applying these procedures, PGPI returns the value v associated to the highest weight in
M as the prediction. In PGPI, half of the maxFacts facts that can be used to make a prediction are
used by PGPI-G and the other half by PGPI-N. If globally the maxFacts limit is reached, PGPI
does not perform a prediction. PGPI-N or PGPI-G can be deactivated. If PGPI-N is deactivated,
only views, likes and group memberships are considered to make a prediction. If PGPI-G is deacti-
vated, only friendship links are considered. In the following, we respectively refer to these versions
of PGPI as PGPI-N and PGPI-G (and as PGPI-N+/PGPI-G+ for PGPI+).
PGPI-N works as follows. To predict an attribute value of a node ni, it explores the neighbor-
hood of ni restricted by the parameter maxDistance using a breadth-first search. It first initializes
a queue Q and pushes ni in the queue. Then, while Q is not empty and the number of accessed
facts is less than maxFacts, the first node nj in Q is popped. Then, Fi,j = Wi,j/dist(ni, nj) is
calculated. Wi,j = Ci,j/Ci, where Ci,j is the number of attribute values common to ni and nj , and
Ci is the number of known attribute values for node ni. dist(x, y) is the number of edges in the
shortest path between ni and nj . Then, Fi,j is added to the weight of the attribute value of nj for
attribute k, in map M . Then, if dist(x, y) ≤ maxDistance, each unvisited node nh linked to nj is
pushed in Q and marked as visited. PGPI-G is similar to PGPI-N. It is also a lazy algorithm. But
it uses a majority voting approach to update weights based on group and publication information
(views and likes). Due to space limitation, we do not describe it. The reader may refer to [8] for
more details.
32
Algorithm 4: The PGPI algorithminput : ni: a node, k: the attribute to be predicted, maxFacts: a user-defined threshold, E :
an extended social graph
output: the predicted attribute value v
1 M = {(v, 0)|v ∈ Vk};
2 // Apply PGPI-G
3 // ...
4 // Apply PGPI-N
5 Initialize a queue Q and add ni to Q;
6 while Q is not empty and |accessedFacts| < maxFacts do
7 nj = Q.pop();
8 Fi,j ← Wi,j/dist(ni, nj);
9 Update (v, f) as (v, f + Fi,j) in M , where (nj, v) ∈ Ak;
10 if dist(ni, nj) ≤ maxDistance then for each node nh 6= ni such that (nh, nj) ∈ L and
nh is unvisited, push nh in Q and mark nh as visited ;
11 end
12 return a value v such that (v, z) ∈M∧ 6 ∃(v′, z′) ∈M |z′ > z;
4.2 Optimizations to improve accuracy and coverage
In PGPI+, we redefine the formula Fi,j used by PGPI-N by adding three optimizations. The new
formula is Fi,j = Wi,j × (Ti,j + 1)/newdist(ni, nj)× R. The first optimization is to add the term
Ti,j + 1, where Ti,j is the number of common friends between ni and nj , divided by the number
of friends of ni. This term is added to consider that two persons having common friends (forming
a triad) are more likely to have similar attribute values. The constant 1 is used so that if ni and nj
have no common friends, Fi,j is not zero.
The second optimization is based on the observation that the term dist(ni, nj) makes Fi,j de-
crease too rapidly. Thus, nodes that are not immediate neighbors but were still close in the so-
cial graph had a negligible influence on their respective profile inference. To address this issue,
dist(ni, nj) is replaced by newdist(ni, nj) = 3 − (0.2 × dist(ni, nj)), where it is assumed that
33
maxDistance < 15. It was empirically found that this formula provides higher accuracy.
The third optimization is based on the observation that PGPI-G had too much influence on
predictions compared to PGPI-N. To address this issue, we multiply the weights calculated using
the formula Fi,j by a new constantR. This thus increases the influence of PGPI-N+ on the choice of
predicted values. In our experiments, we have found that settingR to 10 provides the best accuracy.
Furthermore, a fourth optimization is integrated in the main procedure of PGPI+. It is based
on the observation that PGPI does not make a prediction for up to 50% of users when maxFacts
is set to a small value [8]. The reason is that PGPI does not make a prediction when it reaches
the maxFacts limit. However, it may have collected enough information to make an accurate
prediction. In PGPI+, a prediction is always performed. This optimization was shown to greatly
increase the number of predictions.
4.3 Extension to handle numerical attributes
The PGPI algorithm is designed to handle nominal attributes. In PGPI+, we performed the
following modifications to handle numeric attributes. First, we modified how the predicted value
is chosen. Recall that the value predicted by PGPI for nominal attributes is the one having the
highest weight inM (line 12). However, for numeric attribute, this approach provides poor accuracy
because few users have exactly the same attribute value. For example, for the attribute "weight",
few users may have the same weight, although they may have similar weights. To address this issue,
PGPI+ calculates the predicted values for numeric attributes as the weighed sum of all values in
M .
Second, we adapted the weighted sum so that it ignores outliers because if unusually large
values are in M , the weighted sum provides inaccurate predictions. For example, if a young user
has friendship links to a few 20 years old friends but also a link to his 90 years old grandmother,
the prediction may be inaccurate. Our solution to this problem is to ignore values in M that have
a weight more than one standard deviation away from the mean. In our experiment, it greatly
improves prediction accuracy for numeric attributes.
34
Third, we change how Wi,j is calculated. Recall that in PGPI, Wi,j = Ci,j/Ci, where Ci,j is the
number of attribute values common to ni and nj , and Ci is the number of known attribute values
for node ni. This definition does not work well for numeric attributes because numeric attributes
rarely have the same value. To consider that numeric values may not be equal but still be close, Ci,j
is redefined as follows in PGPI+. The value Ci,j is the number of values common to ni and nj for
nominal attributes, plus a value CNi,j,k for each numeric attribute k. The value CNi,j,k is calculated
as (vi − vj)/αk if (vi − vj) < αk, and is otherwise 0, where αk is a user-defined constant. Because
CNi,j,k is a value in [0,1], numeric attributes may not have more influence than nominal attributes
on Wi,j .
4.4 Extension to evaluate the certainty of predictions
PGPI+ also extends PGPI with the capability of calculating a certainty value CV (v) for each
predicted value v.
As previously mentioned, few inference algorithm assess the certainty of their predictions.
However, this feature is desirable to know how good a prediction is and whether or not actions
should be taken based on the prediction. For example, in the context of ad targeting, if there is a
low certainty that a user profile attribute is correctly inferred, it may be better to not use this at-
tribute for ad targeting, rather than showing an ad that is targeted to a different audience [4]. To
address this issue, we have added the capability of calculating a certainty value CV (v) for each
value v predicted by PGPI+. A certainty value is a value in the [0, 1] interval. A high certainty value
indicates that a prediction is likely to be correct. Certainty values are calculated differently for nu-
meric and nominal attributes because different inference procedures are used to predict attribute
values.
Certainty of numeric values. Recall that values of numeric attributes are predicted using a
weighted sum. Thus, calculating the certainty value of a predicted value v for a numeric attribute k
requires to find a way to assess how "accurate" the weighted sum for calculating v is. Intuitively,
we can expect the weighted sum to be accurate if (1) the amount of information taken into account
by the weighted sum is large, and (2) if values considered in the weighted sum are close to each
35
other.
These ideas are captured in our approach by using the relative standard error. Let EM =
{v1, v2, ...vp} be the set of values stored in the map M that were used to calculate the weighted
sum. The amount of information used by the weighted sum is measured as the number of updates
made by PGPI-N+/PGPI-G+ to the map M , denoted as updates. The relative standard error is de-
fined as RSE(v) = stdev(EM)/(√updates× avg(EM)). The RSE is a value in the [0,1] interval
that assesses how close the average of a sample might be to the average of the population. Be-
cause we want a certainty value rather than an error value, we calculate the certainty value of v as:
CV (v) = 1−RSE(v) = 1− [stdev(EM)/(√updates× avg(EM))] .
The proposed approach provides an assessment of the certainty of a prediction. However, a
drawback of this RSE based approach is that it is sensible to outliers. To address this issue, we
ignore values that are more than one standard deviation away from the mean when calculating
CV (v).
Certainty of nominal values. The procedure for calculating the certainty of a predicted value
v for a nominal attribute is different than for numeric attributes since the inference procedure is not
the same. Our idea is to evaluate how likely the weight z of value v in M is to be as large at it is,
compared to other weights in M .
To estimate this, we use a simulation-based approach where "larger" is defined in terms of a
number of standard deviations away from the mean. The procedure for calculating the certainty of
a value for a numeric attribute is shown in Algorithm 5. It takes as input a value v for an attribute
k and the map of values/weights M . Let ZM = {z1, z2, ...zp} denotes the weights of values in
the map M , and t(M, v) be the weight of v in map M . We initialize a variable count to 0 and
perform 1,000 simulations. During the j-th simulation, the algorithm creates a new map B so that
it contains a key-value pair (v, 0) for each possible value v for attribute k in map M . These weights
are initialized to zero. Then, the algorithm performs updates random updates to B. A random
update is performed as follows. First, a pair (v′, z′) is randomly chosen from B. Then, a value x is
randomly selected from the [0, 1] interval. Finally, (v′, z′) is replaced by (v′, z′ + x) in B. At the
end of the j-th simulation, we increase the count variable by 1 if (f(B, v)−avg(ZB)) /stdev(ZB)
36
≥ (f(M, v)− avg(ZM)) /stdev(ZM). After performing the 1,000 simulations, the certainty value
is calculated as CV (v) = count/1000, which gives a value in the [0,1] interval. This value CV (v)
can be interpreted as follows. It represents what is the chance that the weight of v in map M is
as far as it is in terms of standard deviation from the average weight of values in map M for the
attribute to be predicted k.
Thus, by the above mechanisms a certainty values can be provided for predicted values of both
numeric and nominal attributes.
Algorithm 5: Certainty assessment of a nominal attribute valueinput : M : the map of values/weights, v: an attribute value, k: the attribute, updates: the
number of updates to M
output: the certainty value CV (v)
1 count = 0;
2 for 1 to 1000 do
3 B = {(v, 0)|(v, z) ∈M};
4 for 1 to updates do
5 Randomly choose a pair (v′, z′) from B;
6 x = random([0, 1]);
7 Replace (v′, z′) by (v′, z′ + x) in B;
8 end
9 if (f(B, v)− avg(ZB)) /stdev(ZB) ≥ (f(M, v)− avg(ZM)) /stdev(ZM) then
10 count = count+ 1;
11 end
12 end
13 CV (v) = count/1000;
14 return CV(v);
37
5 Experimental evaluation
We compared the accuracy of the proposed PGPI+, PGPI-N+ and PGPI-G+ algorithms with the
original PGPI, PGPI-N and PGPI-G algorithms and four additional state-of-the-art algorithms for
predicting attribute values of nodes in a social network. The three first are Naïve Bayes classifiers
[10]. Naïve Bayes (NB) infer user profiles strictly based on correlation between attribute values.
Relational Naïve Bayes (RNB) consider the probability of having friends with specific attribute
values. Collective Naïve Bayes (CNB) combines NB and RNB. To be able to compare NB, RNB
and CNB with the proposed algorithms, we have adapted them to work with a partial graph. This is
done by training them with maxFacts users chosen randomly instead of the full social graph. The
last algorithm is label propagation (LP) [11]. Because LP requires the full social graph, its results
are only used as a baseline. Each algorithm was tuned with optimal parameter values.
Datasets. Two datasets are used. The first one is 11,247 user profiles collected from Facebook
in 2005 [23]. Each user is described according to seven attributes: a student/faculty status flag,
gender, major, second major/minor (if applicable), dorm/house, year, and high school, where year
is a numerical attribute. The second dataset is 20,000 user profiles from the Pokec social network
obtained at https://snap.stanford.edu/data/. It contains 17 attributes, including three
numeric attributes: age, weight and height. Because both datasets do not contain information about
groups, and this information is needed by PGPI-G and PGPI, synthetic data about groups was
generated using the generator proposed in [8], using the same parameters. This latter generator is
designed to generate group having characteristics similar to real-life groups.
Accuracy for nominal attributes w.r.t number of facts. We first ran all algorithms while vary-
ing the maxFacts parameter to assess the influence of the number of accessed facts on accuracy
for nominal attributes. The accuracy for nominal attributes is defined as the number of correctly
predicted values, divided by the number of prediction opportunities. Fig. II-1 shows the overall re-
sults for the Facebook and Pokec datasets. Note that PGPI algorithms are not shown in these tables
due to lack of space. It can be observed that PGPI+/PGPI-N+/PGPI-G+ provides the best results.
For example, on Facebook, PGPI+ and PGPI-G+ provide the best results when 66 to 700 facts are
accessed, and for less than 66 facts, PGPI-N+ provides the best results followed by PGPI+. No
38
results are provided for PGPI-N+ for more than 306/6 facts on Facebook/Pokec because PGPI-N+
relies solely on links between nodes to perform predictions and the datasets do not contains enough
links. It is also interesting to note that PGPI-N+ only uses real data (contrarily to PGPI+/PGPI-G+)
and still performs better than all other algorithms. The algorithm providing the worst results is LP
(not shown in the figure). LP provides an accuracy of 43.2%/47.31% on Facebook/Pokec. This is
not good considering that LP uses the full social graph of more than 10,000 nodes. For the family of
Naïve Bayes algorithms, NB has the best overall accuracy. It can be further observed that the accu-
racy of PGPI+ algorithm is up to 34% higher than PGPI, which shows that proposed optimizations
have a major influence on accuracy.
50%
55%
60%
65%
70%
75%
0 100 200 300
Acc
ura
cy
Number of accessed facts
30%
40%
50%
60%
70%
80%
90%
100%
0 100 200 300 400 500 600 700
Acc
ura
cy
Number of accessed facts
PGPI+ PGPI-G+ PGPI-N+ PGPI PGPI-G PGPI-N NB RNB CNB
Pokec
Figure II-1 Accuracy w.r.t. number of accessed facts for nominal attributes
Best results for each nominal attribute. To better understand the previous results, a second
experiment was performed were we analyzed the accuracy obtained for each nominal attribute
separately. The goal of this analysis is to see how well each attribute can be predicted and also if
some approaches perform better than others for some specific attributes.
The best results in terms of accuracy for each attribute and algorithm for the Facebook and
39
Pokec are respectively shown in Table II-1 and II-2. The last row of each table indicates the number
of accessed facts to obtain these results. Note that in these tables, the result for PGPI algorithms are
not shown because they are always worse than those of the optimized PGPI+ algorithms.
On overall, it can be observed that the best accuracy is in general achieved by PGPI+ algorithms
for all attributes. In several cases, PGPI+ algorithms provide an accuracy that is more than 10%
higher than the compared state-of-the art algorithm, and the gap can be greater than 30% in some
cases.
A second important observation is that PGPI+ algorithms allow to predict some attributes with
a very high accuracy. For example, the gender of a Facebook user can be predicted with up to 95%
accuracy by PGPI+. Another example is that PGPI-N+ can guess with up to 94% accuracy if a
Pokec user speaks a language such as spanish or italian. The high accuracy that is obtained may be
very useful for real applications of user profiling such as ad targeting.
A third observation is that some attributes are more difficult to predict. For example, most algo-
rithms have a relatively low accuracy for the school attribute on the Facebook dataset. A reason is
that unlike binary attributes such as gender, the school attribute has a very high number of possible
values (1,420 values, each corresponding to a possible high-school attended by a user). Another
reason is that users are rarely linked to other users with the same value for the school attribute.
Since in 2005, users on Facebook were mostly university students, it seems reasonable that users
may often not come from the same high-school as their Facebook friends attending the same uni-
versity. A similar phenomena occurs with the region attribute for the Pokec dataset, which indicates
the region where a given user lives.
Best results for each numeric attribute. Previous subsections presented accuracy results for
nominal attributes. In this subsection, an experiment is performed to compare the accuracy of PGPI
algorithms and PGPI+ algorithms for numeric attributes. The goal is to evaluate the improvement
provided in PGPI+ for handling numeric attributes. Furthermore, we also compared accuracy ob-
tained with the NB, RNB, CNB and LP algorithms. Because these later algorithms and PGPI treat
numeric attributes as nominal attributes, they are used as baseline. Note that we attempted to also
include the algorithm of Kosinski [12] in our comparison because it is designed for inferring nu-
40
attribute PGPI+ PGPI-N+ PGPI-G+ NB RNB CNB LP
status 92.0% 92.6 % 93.0% 88.0% 80.2% 88.0% 83.0%
gender 96.1% 84.7 % 95.8% 51.1% 57.7% 47.3% 50.4%
major 33.8% 30.4% 32.8% 16.6% 15.0% 9.0% 16.7%
minor 76.0% 76.4% 76.6% 74.4% 74.2% 74.0% 56.3%
residence 64.6% 62.4% 64.4% 55.6 % 55.0% 54.8% 49.1%
school 7.4% 16.8% 7.0% 10.6% 9.8% 10.6% 10.1%
|facts| 482 226 431 189 580 189 10K
Table II-1 Best accuracy results for nominal attributes on Facebook
meric attributes. However, it relies on linear regression, and the training has failed when using a
partial social graph.
In this experiment, the accuracy is measured differently than in the experiments on nominal
attributes since the goal for numeric attributes is to predict attribute values that are as close as
possible to the real values rather than predicting the exact value. Thus, the accuracy for numeric
attributes is measured in terms of average error and standard deviation of predicted values from the
real values.
Table II-3 show the best results obtained by PGPI and PGPI+ algorithms for each numeric
attribute for the Facebook and Pokec datasets, and the number of facts accessed to obtain these
results. It can be first observed that PGPI+ performs the best on overall. For example, the PGPI-N+
algorithm performs better than the PGPI-N algorithm both in terms of average error and standard
deviation for almost all attributes. For some attributes, the improvement provided by PGPI+ over
PGPI is greater than for other attributes. For example, for the height attribute, the average error
obtained by PGPI-N+ is 10.32 instead of 14 for PGPI-N and the standard deviation is 10.32 for
PGPI-N+ instead of 36.52 for PGPI-N, a considerable improvement. The reason why PGPI+ per-
forms better than PGPI in general is that it has specific inference procedures for numeric attributes,
while PGPI treat them as nominal attributes.
41
attribute PGPI+ PGPI-N+ PGPI-G+ NB RNB CNB LP
Gender 95.60% 61.40% 95.77% 52.80% 53.80% 53.60% 49.20%
English 76.35% 63.79% 76.00% 69.74% 69.74% 69.74% 65.40%
French 87.46% 84.48% 87.42% 86.91% 85.60% 86.87% 67.15%
German 62.39% 54.31% 62.85% 47.83% 48.12% 47.83% 50.00%
Italian 94.87% 94.25% 94.85% 94.65% 95.38% 95.41% 85.75%
Spanish 95.15% 94.54% 95.14% 94.38% 95.08% 94.29% 80.52%
Smoker 65.21% 62.34% 65.42% 63.43% 63.43% 63.12% 60.19%
Drink 71.65% 63.36% 71.47% 70.41% 70.41% 70.41% 49.16%
Marital status 76.57% 70.86% 76.40% 76.11% 76.02% 76.07% 69.92%
Hip-hop 86.51% 82.20% 86.47% 86.01% 85.83% 85.93% 61.82%
Rap 69.33% 63.78% 69.52% 69.08% 69.35% 69.35% 45.78%
Rock 77.93% 73.80% 78.09% 76.33% 74.93% 74.93% 53.69%
Disco 58.40% 52.50% 58.56% 50.07% 53.18% 53.46% 47.28%
Metal 86.19% 83.52% 86.15% 84.75% 84.61% 84.61% 63.79%
Region 18.60% 10.20% 18.71% 6.20% 6.20% 6.20% 10.00%
|facts| 334 6 347 375 378 278 10k
Table II-2 Best accuracy results for nominal attributes on Pokec
Assessment of certainty values. Another experiment is also performed to assessed certainty
values calculated by PGPI+, PGPI-N+ and PGPI-G+. Fig. II-2 and Fig. II-3 respectively show
the best accuracy obtained for numeric attributes year and age for each of those algorithms. Each
line represents the accuracy of predictions having at least a given certainty value. It can be seen,
that a high certainty value generally means a low average error, standard deviation and coverage
(percentage of predictions made), as expected.
For nominal attributes, the accuracy of predictions made by PGPI-N+, PGPI, PGPI-G+ having
a certainty no less than 0 and no less than 0.7 are respectively 59%, 59% and 61% on Pokec, and
91%, 91% and 93% on Facebook, which also shows that the proposed certainty assessment is a
42
Table II-3 Average error and standard deviation for numeric attributes
Facebook Pokec
algorithm year |facts| age weight height |facts|
PGPI+ 0.95 (0.85) 241 2.94 (4.55) 9.83 (10.32) 7.70 (11.75) 189
PGPI-N+ 0.68 (0.72) 44 3.92 (4.56) 14.60 (12.56) 10.32 (12.55) 6
PGPI-G+ 0.99 (0.89) 252 2.89 (4.45) 9.83 (10.37) 7.71 (11.76) 290
PGPI 0.46 (0.93) 148 2.55 (4.80) 11.67 (11.53) 8.75 (12.43) 251
PGPI-N 0.46 (0.87) 103 4.35 (5.11) 17.28 (15.61) 14.0 (36.52) 6
PGPI-G 0.39 (0.93) 223 2.20 (4.78) 10.75 (10.86) 8.35 (12.45) 344
NB 1.30 (0.81) 1,743 5.78 (4.34) 14.13 (12.40) 13.71 (11.17) 8
RNB 1.30 (0.81) 1,746 3.47 (4.71) 13.74 (13.91) 15.65 (13.92) 83
CNB 1.30 (0.81) 1,746 3.33 (4.40) 13.38 (14.34) 15.30 (12.47) 84
LP 1.12 (1.30) 11,247 5.07 (5.38) 18.42 (59.35) 19.29 (15.80) 20,000
good indicator of accuracy.
Best results using the full social graph. We also compared the accuracy of the algorithms
using the full social graph. The best accuracy obtained for each algorithm on the Facebook and
Pokec datasets is shown in Table II-4. It can be observed that the proposed PGPI+ algorithms
provide an accuracy that is considerably higher than the accuracy of the compared algorithms, even
when using the full social graph.
6 Conclusion
We proposed an improved algorithm named PGPI+ for user profiling in online social networks
under the constraint of a partial social graph and using rich information. PGPI+ extends the PGPI
algorithm with new optimizations to improve its prediction accuracy and coverage, to handle nu-
merical attributes, and assess the certainty of predictions. An experimental evaluation with more
than 30,000 user profiles from the Facebook and Pokec social networks shows that PGPI+ predicts
43
PGPI+
b)
0.3 0.5 0.7
2,5
2,7
2,9
3,1
3,3
3,5
3,7
3,9
0 100 200 300 400
ave
rage
err
or
Number of accessed facts
4
4,2
4,4
4,6
4,8
5
5,2
5,4
0 100 200 300 400
stan
dar
d d
evia
tio
n
Number of accessed facts
80828486889092949698
100
0 100 200 300 400
cove
rage
Number of accessed facts
0,6
0,7
0,8
0,9
1
1,1
1,2
1,3
0 200 400 600 800
ave
rage
err
or
Number of accessed facts
0,61,11,62,12,63,13,64,14,65,15,6
0 200 400 600 800
stan
dar
d d
evia
tio
n
Number of accessed facts
20
30
40
50
60
70
80
90
100
0 200 400 600 800
cove
rage
Number of accessed facts
PGPI-N+
0.70.750.8
0.850.9
0.951
1.05
0 100 200
aver
age
err
or
Number of accessed facts
0.61.11.62.12.63.13.64.14.6
0 100 200
stan
dar
d d
evi
atio
n
Number of accessed facts
2030405060708090
100
0 100 200
cove
rage
Number of accessed facts
3.83.853.9
3.954
4.054.1
4.154.2
3 4 5 6
aver
age
erro
r
Number of accessed facts
4.6
4.65
4.7
4.75
4.8
4.85
4.9
3 4 5 6
stan
dar
d d
evia
tio
n
Number of accessed facts
88
90
92
94
96
98
100
3 4 5 6co
vera
ge
Number of accessed facts
b)
0.3 0.5 0.7
PGPI-G+
b
0.3 0.5 0.7
0.5 0.7 0.9
1
1,1
1,2
1,3
1,4
1,5
1,6
1,7
0 200 400 600 800
aver
age
erro
r
Number of accessed facts
00,51
1,52
2,53
3,54
4,55
0 200 400 600 800
stan
dar
d d
evi
atio
n
Number of accessed facts
0102030405060708090100
0 200 400 600 800co
vera
ge
Number of accessed facts
2,52,72,93,13,33,53,73,94,14,34,5
0 100 200 300 400
aver
age
err
or
Number of accessed facts
3,54
4,55
5,56
6,57
7,58
8,5
0 100 200 300 400
stan
dar
d d
evia
tio
n
Number of accessed facts
50556065707580859095
100
0 100 200 300 400
cove
rage
Number of accessed facts
a)
0.3 0.5 0.7
2,5
2,7
2,9
3,1
3,3
3,5
3,7
3,9
0 100 200 300 400
ave
rage
err
or
Number of accessed facts
4
4,2
4,4
4,6
4,8
5
5,2
5,4
0 100 200 300 400
stan
dar
d d
evia
tio
n
Number of accessed facts
80828486889092949698
100
0 100 200 300 400
cove
rage
Number of accessed facts
0,6
0,7
0,8
0,9
1
1,1
1,2
1,3
0 200 400 600 800
aver
age
erro
r
Number of accessed facts
0,61,11,62,12,63,13,64,14,65,15,6
0 200 400 600 800
stan
dar
d d
evia
tio
n
Number of accessed facts
20
30
40
50
60
70
80
90
100
0 200 400 600 800
cove
rage
Number of accessed facts 0.5 0.7 0.9
Figure II-2 Average error, standard deviation and coverage w.r.t. certainty for the "year" attribute
algorithm Facebook Pokec algorithm Facebook Pokec
PGPI+ 96.6 73.8 PGPI-G 62.8 56.2
PGPI-N+ 96.4 73.9 NB 48.67 57.48
PGPI-G+ 84.9 65.9 RNB 50.11 56.37
PGPI 63.8 62.0 CNB 50.11 56.40
PGPI-N 63.8 62.1 LP 48.03 47.31
Table II-4 Best accuracy for nominal attributes using the full social graph
user profiles with considerably more accuracy and by accessing a smaller part of the social graph
than five state-of-the-art algorithms. Moreover, an interesting result is that profile attributes such as
status (student/professor) and gender can be predicted with more than 95% accuracy using PGPI+.
44
PGPI+
a)
0.3 0.5 0.7
2,5
2,7
2,9
3,1
3,3
3,5
3,7
3,9
0 100 200 300 400
ave
rage
err
or
Number of accessed facts
4
4,2
4,4
4,6
4,8
5
5,2
5,4
0 100 200 300 400
stan
dar
d d
evi
atio
n
Number of accessed facts
80828486889092949698
100
0 100 200 300 400
cove
rage
Number of accessed facts
0,6
0,7
0,8
0,9
1
1,1
1,2
1,3
0 200 400 600 800
ave
rage
err
or
Number of accessed facts
0,61,11,62,12,63,13,64,14,65,15,6
0 200 400 600 800
stan
dar
d d
evi
atio
n
Number of accessed facts
20
30
40
50
60
70
80
90
100
0 200 400 600 800
cove
rage
Number of accessed facts 0.5 0.7 0.9
PGPI-N+
0.70.750.8
0.850.9
0.951
1.05
0 100 200
ave
rage
err
or
Number of accessed facts
0.61.11.62.12.63.13.64.14.6
0 100 200
stan
dar
d d
evi
atio
n
Number of accessed facts
2030405060708090
100
0 100 200
cove
rage
Number of accessed facts
3.83.853.9
3.954
4.054.1
4.154.2
3 4 5 6
aver
age
erro
r
Number of accessed facts
4.6
4.65
4.7
4.75
4.8
4.85
4.9
3 4 5 6
stan
dar
d d
evia
tio
n
Number of accessed facts
88
90
92
94
96
98
100
3 4 5 6
cove
rage
Number of accessed facts
a)
0.3 0.5 0.7
0.5 0.7 0.9
PGPI-G+
a)
0.3 0.5 0.7
0.5 0.7 0.9
1
1,1
1,2
1,3
1,4
1,5
1,6
1,7
0 200 400 600 800
aver
age
err
or
Number of accessed facts
00,51
1,52
2,53
3,54
4,55
0 200 400 600 800
stan
dar
d d
evia
tio
n
Number of accessed facts
0102030405060708090100
0 200 400 600 800
cove
rage
Number of accessed facts
2,52,72,93,13,33,53,73,94,14,34,5
0 100 200 300 400
aver
age
err
or
Number of accessed facts
3,54
4,55
5,56
6,57
7,58
8,5
0 100 200 300 400
stan
dar
d d
evi
atio
n
Number of accessed facts
50556065707580859095
100
0 100 200 300 400co
vera
ge
Number of accessed facts
a)
0.3 0.5 0.7
2,5
2,7
2,9
3,1
3,3
3,5
3,7
3,9
0 100 200 300 400
ave
rage
err
or
Number of accessed facts
4
4,2
4,4
4,6
4,8
5
5,2
5,4
0 100 200 300 400
stan
dar
d d
evia
tio
n
Number of accessed facts
80828486889092949698
100
0 100 200 300 400co
vera
ge
Number of accessed facts
0,6
0,7
0,8
0,9
1
1,1
1,2
1,3
0 200 400 600 800
aver
age
erro
r
Number of accessed facts
0,61,11,62,12,63,13,64,14,65,15,6
0 200 400 600 800
stan
dar
d d
evia
tio
n
Number of accessed facts
20
30
40
50
60
70
80
90
100
0 200 400 600 800
cove
rage
Number of accessed facts 0.5 0.7 0.9
Figure II-3 Average error, standard deviation and coverage w.r.t. certainty for the "age" attribute
There are several opportunities to improve the proposed algorithms. For future work, we plan to
propose new optimizations for PGPI+ and to extend the proposed algorithms with text analysis ca-
pability and to consider other types of information for making predictions. We also wish to perform
experiments with other social networks and compare accuracy with more algorithms. Lastly, we
also intend to develop algorithms for community detection [18,22,26] that are optimized for using
a small amount of facts from social networks.
45
CONCLUSION GÉNÉRALE
Dans ce travail, nous avons proposé deux algorithmes novateurs pour inférer les profils utili-
sateurs dans les réseaux sociaux en utilisant un graphe social partiel. Ces algorithmes sont à notre
connaissance, les premiers à laisser l’utilisateur décider de la quantité d’information à accéder du
graphe social pour faire une prédiction.
Le premier algorithme PGPI, permet l’inférence de profils utilisateurs en l’absence d’un graphe
social complet et ne requiert aucun entraînement pour être appliqué. Il est composé de deux procé-
dures nommées PGPI-N (Partial Graph Profile Inference- Node) et PGPI-G (Partial Graph Profile
Inference- Groupe). Elles permettent respectivement l’inférence de profils utilisateurs sur la base
(1) des liens d’amitiés, et (2) des adhésions aux groupes, "J’aime" et "Vues" (lorsque l’information
est disponible). Une évaluation expérimentale a été présentée comparant la performance de PGPI
à quatre algorithmes d’inférence de profils utilisateurs de la littérature (NB, CNB, RNB et LP).
Un jeu de donées réel provenant du réseau social Facebook a été utilisé. Dans cette expérience,
trois mesures de performance ont été employées : l’exactitude, le nombre de noeuds accedés dans
le graphe social et la couverture. Les résultats ont montrés que PGPI obtient une exactitude allant
jusqu’à 34% supérieure à celle des méthodes proposés, tout en utilisant moins d’information pour
prédire. Il est aussi intéressant de noter que PGPI peut prédire certains attributs avec une exactitude
supérieure à 90%.
Bien que PGPI procure une très haute exactitude et accède à moins de noeud dans le graphe
social, PGPI traite les attributs numériques comme des attributs nominaux et ne tiens pas compte
de la certitude des prédictions.
Afin de remédier à ces limites, nous avons introduit une version améliorée de PGPI nommée
PGPI+. Elle introduit de nouvelles optimisations pour augmenter l’exactitude des prédictions et la
couverture de PGPI. De plus, PGPI+ traite les attributs numériques de façon adaptée et offre un
mécanisme pour estimer la certitude des prédictions. Grâce à ces nouveautés, PGPI+ offre un gain
considérable en termes d’exactitude et couverture des prédictions.
46
Nous avons réalisé une expérience pour comparer la performance de PGPI+ avec PGPI et celles
des quatre mêmes algorithmes de la littérature, en utilisant des profils utilisateurs des réseaux
sociaux Pokec et Facebook. Les résultats montrent que PGPI+ offre une plus haute exactitude
pour tous les jeux de données utilisés, tout en accèdant à moins de noeuds dans le graphe social.
Nous avons aussi comparé la meilleure exactitude de PGPI/PGPI+ pour les attributs numériques en
termes d’erreur moyenne et d’écart-type des valeurs prédites par rapport aux valeurs réelles. Les
résultats indiquent que PGPI+ donne en général une inférence plus exacte. Finalement, des résul-
tats expérimentaux montrent également que la certitude calculée pour les prédictions est un bon
indicateur de l’exactitude des prédictions.
Cette recherche ouvre la voie à plusieurs possibilités de travaux futurs. Nous envisageons entre
autres de comparer PGPI+ avec d’autres algorithmes d’inférence de profils utilisateurs spécialisés
pour le traitement d’attributs numériques ou l’estimation de la certitude des prédictions. De plus,
nous prévoyons améliorer l’algorithme PGPI+ en proposant de nouvelles optimisations pour le
traitement des attributs numériques. Finalement, des expériences avec les données d’autres réseaux
sociaux sont envisagées.
47
RÉFÉRENCES
[1] Blenn, N., Doerr, C., Shadravan, N., Van Mieghem, P.: How much do your friends know
about you?: reconstructing private information from the friendship graph. In: Proc. of the
Fifth Workshop on Social Network Systems, pp. 1–6. ACM (2012)
[2] Bonchi, F., Castillo, C., Gionis, A., Jaimes, A.: Social network analysis and mining for busi-
ness applications. ACM Transactions on Intelligent Systems and Technology, 2(3), 1–37
(2011)
[3] Chaabane, A., Acs, G., Kaafar, M.A.: You are what you like! information leakage through
users’ interests. In: Proc. of the 19th Annual Network and Distributed System Security Sym-
posium, The Internet Society (2012)
[4] Chaudhari, G., Avadhanula, V., Sarawagi, S.: A few good predictions: selective node labeling
in a social network. In: Proc. of the 7th ACM international conference on Web search and
data mining, pp. 353–362. ACM (2014)
bibitemchen Chen, P. T., Hsieh, H. P.: Personalized mobile advertising: Its key attributes,
trends, and social impact. Technological Forecasting and Social Change, 79(3), 543–557
(2012)
[5] Dahimene, R., Constantin, C., Du Mouza, C.:Recland: A recommender system for social
networks. In: Proc. 23rd ACM Intern. Conf. on Information and Knowledge Management,
pp. 2063–2065 (2014)
[6] Davis Jr, C. A. et al.: Inferring the location of twitter messages based on user relationships.
Transactions in GIS, 15(6), pp. 735–751 (2011)
[7] Dong, Y., Yang, Y., Tang, J., Yang, Y., Chawla, V. N.: Inferring user demographics and social
strategies in mobile social networks. In: Proc. of the 20th ACM international conference on
Knowledge discovery and data mining, pp. 15–24. ACM (2014)
[8] Dougnon, Y. R., Fournier-Viger, P., Nkambou, R.: Inferring User Profiles in Social Networks
using a Partial Social Graph. In: Proc. 28th Canadian Conference on Artificial Intelligence,
Springer, LNAI 9091, pp. 84–99 (2015)
48
[9] He, J., Chu, W. W., Liu, Z. V.: Inferring privacy information from social networks. In: Proc.
of 2006 IEEE International Conference on Intelligence and Security Informatics. pp. 154–
165. Springer, Heidelberg (2006)
[10] Heatherly, R., Kantarcioglu, M., Thuraisingham, B.: Preventing private information infer-
ence attacks on social networks. IEEE Transactions on Knowledge and Data Engineering.
25(8), pp. 1849–1862 (2013)
[11] Jurgens, D.: That’s what friends are for: Inferring location in online social media platforms
based on social relationships. In: Proc. of the 7th International AAAI Conference on Weblogs
and Social Media, pp. 273–282, AAAI Press (2013)
[12] Kosinski, M., Stillwell, D., Graepel, T.: Private traits and attributes are predictable from
digital records of human behavior. National Academy of Sciences, 110(15), pp. 5802–5805
(2013)
[13] Li, R., Wang, C., Chang, K. C. C. User profiling in an ego network: co-profiling attributes
and relationships. In: Proc. of the 23rd international conference on World wide web, pp. 819–
830. ACM (2014)
[14] Lindamood, J., Heatherly, R., Kantarcioglu, M., Thuraisingham, B.: Inferring private infor-
mation using social network data. In: Proc. of the 18th international conference on World
wide web, pp. 1145–1146. ACM (2009)
[15] Compendium of global digital statistics. We are Social Ltd. (2015)
[16] Mislove, A., Viswanath, B., Gummadi, K. P., Druschel, P.: You are who you know: inferring
user profiles in online social networks. In: Proc. of the 3rd ACM international conference on
Web search and data mining, pp. 251–260. ACM (2010)
[17] Mazhari, S., Fakhrahmad, S. M., Sadeghbeygi, H.: A user-profile-based friendship rec-
ommendation solution in social networks. Journal of Information Science, 41(3), 284–
295(2015)
[18] Papadopoulos, S., Kompatsiaris, Y., Vakali, A., Spyridonos, P.: Community detection in so-
cial media. Data Mining and Knowledge Discovery, 24(3), pp. 515–554 (2012)
[19] Petrovic, I., Perkovic, P., Stajduhar, I.: A profile-and community-driven book recommender
system. In: 38th International Convention on Information and Communication Technology,
49
Electronics and Microelectronics, pp. 631–635 (2015)
[20] Quercia, D., Kosinski, M., Stillwell, D., Crowcroft, J.: Our Twitter profiles, our selves: Pre-
dicting personality with Twitter. In: Proc. of the 3rd IEEE international conference on social
computing, pp. 180–185, IEEE Press (2011)
[21] Scott, J., Carrington, P. J.: The SAGE handbook of social network analysis. SAGE publica-
tions (2011)
[22] Takaffoli, M., Sangi, F., Fagnan, J., Zäıane, O. R.: Community evolution mining in dynamic
social networks. Procedia-Social and Behavioral Sciences, 22, pp. 49–58 (2011)
[23] Traud, A. L., Mucha, P. J., Porter, M. A.: Social structure of Facebook networks. Physica A:
Statistical Mechanics and its Applications. 391(16), pp. 4165–4180 (2012)
[24] Tiroshi, A., Kuflik, T., Kay, J., Kummerfeld, B.: Recommender systems and the social web.
In: Proc. 19th international conference on Advances in User Modeling, pp. 60-70 (2012)
[25] Tucker, C. E.: Social networks, personalized advertising, and privacy controls. Journal of
Marketing Research, 51(5), 546–562 (2014)
[26] Wakita, K.,Tsurumi, T.: Finding community structure in mega-scale social networks. In:
Proc. 16th international conference on World Wide Web, pp. 1275–1276 (2007)
[27] Zhou, X., Xu, Y., Li, Y., Josang, A., Cox, C.: The state-of-the-art in personalized recom-
mender systems for social networking. Artificial Intelligence Review, 37(2), 119–132 (2012)
50