
Transcript
Scalable infrastructures for personalization Anne-Marie Kermarrec
Inria, France: 8 research centers, 150 research teams Les huit
centres de recherche Inria Inria RENNES Bretagne Atlantique Inria
BORDEAUX Sud-Ouest Inria PARIS - Rocquencourt Inria LILLE Nord
Europe Inria NANCY Grand Est Inria SACLAY le-de-France Inria
GRENOBLE Rhne-Alpes Inria SOPHIA ANTIPOLIS Mditerrane - 3 June
2014A.-M. Kermarrec (Inria) June 2014A.-M. Kermarrec (Inria) A cry
for personalization June 2014A.-M. Kermarrec (Inria) Why is
personalization so difficult? Huge volume of data: small portion of
interest Dynamic interests Interesting stuff does not come always
from friends Classical notification systems do not filter enough or
too much Scalable personalization infrastructures June 2014A.-M.
Kermarrec (Inria) KNN computation over large data Basic building
block for many applications Similarity search Machine learning Data
mining Image processing Collaborative filtering June 2014A.-M.
Kermarrec (Inria) KNN-based user-centric collaborative filtering
Provide each user with her k closest neighbors (Users owns a
profile, the system has its favorite similarity metric) Use this
topology for personalized notifications recommendation Alice Bob
Carl Dave Ellie June 2014A.-M. Kermarrec (Inria) Dealing with truly
big data Want to scale? Think P2P June 2014A.-M. Kermarrec (Inria)
Do not look exhaustively June 2014A.-M. Kermarrec (Inria) The key
to scalability in KNN graph construction Consider a partial set of
candidates Sampling-based approach June 2014A.-M. Kermarrec (Inria)
P2P KNN graph construction Which nodes are close? How to discover
them? Similarity metric Sampling June 2014A.-M. Kermarrec (Inria)
Which nodes are close? Model U(sers) I(tems) (items) Profile(u) =
vector of liked/shared/viewed items Cosine similarity metric
Jaccard metric Minimal information: no tag, no users input, generic
June 2014 A.-M. Kermarrec (Inria) Each node maintains a set of
neighbors (c entries) Peer exchange Shuffle P Q How to discover
them: Gossip-based computing Result random graph Highly resilient
against churn, partition Small diameter [JGKVV, ACM TOCS 2007] June
2014A.-M. Kermarrec (Inria) KNN construction Similarity computation
exchange of neighbors lists neighborhood optimization 1 2 Alice Bob
Carl DaveEllie Frank June 2014A.-M. Kermarrec (Inria) Decentralized
KNN selection [FGKL Middleware 2010] RPS layer providing random
sampling KPS clustering layer gossip-based topology clustering
Interest-based linkRandom link Alice Bob Carl Dave Ellie Alice Bob
Carl Dave Ellie June 2014A.-M. Kermarrec (Inria) Convergence Cycles
c current neighbors versus the c closest Biased sampling Random
sampling June 2014A.-M. Kermarrec (Inria) Applications -
Decentralized news recommendation [BFGJK, IPDPS 2013] - Top-K
[BGKL, ACM TODS 2011] [BGK, ACM TOIT 2014] - Geo recommendation
[BKKT, ICDCS 2012] June 2014A.-M. Kermarrec (Inria) DECENTRALIZED
NEWS RECOMMENDER Notification is taking over June 2014A.-M.
Kermarrec (Inria) Whats wrong with news feed Interest are dynamic
Wrong granularity for filtering of classical notification systems
Small portion of the available information is of interest
Interesting stuff does not come always from friends June 2014A.-M.
Kermarrec (Inria) WhatsUp in a nutshell KNN selection Dissemination
June 2014A.-M. Kermarrec (Inria) Dissemination: orientation and
amplification Orientation: to whom? Exploit: Forward To friends
Explore: Forward to random users Amplification: to how many?
Increase Fanout (Log(n)) Decrease Fanout (1) June 2014A.-M.
Kermarrec (Inria) WhatsUp in action on the survey (480 users)
Precision Recall F1-Score Messages Gossip (f=4) 0.34 0.99 0.51 2.3
M Cosine-CF 0.50 0.65 0.57 5,9k Whatsup (f=10) 0.471 0.83 0.60 2,4k
160 180 200 w (WHATSUP) 80 100 120 140 160 180 200 Cycle (b)
Similarity in WUP view (WHATSUP-Cos) 80 100 120 140 160 180 200
Cycle (c) Reception of liked news items (WHATSUP) Figure 7: Cold
start and dynamics in WHATSUP eiving news quickly as shown in n the
number of interesting news ode joins. This is a result of both
(Section II-D) and our metrics h small proles. Once the nodes mber
of received news per cycle arable to those of the reference oining
node reaches 80% of the after only a few cycles. e, we select a
pair of random ataset and, at 100 cycles into the r interests and
start measuring the uild their WUP views. Figure 7 by averaging 100
experiments. auses the views to converge faster cycles as opposed
to over 100. ecall and precision for the nodes nterestsnever
decreasebelow 80% ues. These results are clearly tied window, set
to about 40 cycles in windows would in fact lead to an nodes
(machines and users) deployed on a 25-node cluster equipped with
theModelNet network emulator. For practical reasons we consider a
shorter trace and very fast gossip and news-generation cycles of
30sec, with 5 news items per cycle. These gossip frequencies are
higher than those we use in our prototype, but they were chosen to
be able to run a large number of experiments in reasonable time. We
also use a prole window of 4min, compatible with the duration of
our experiments (1 to 2 hours each). 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
2 4 6 8 10 12 F1-Score Fanout (Flike) Simulation PlanetLab ModelNet
(a) Survey: F1-Score June 2014A.-M. Kermarrec (Inria) Orientation
(survey) News items received through a dislike forward Number of
dislikes 0 1 2 3 4 Fraction of liked news 54% 31% 10% 3% 2% hat
likes that did he news by nodes he dislike om users cross the ke 0
10 20 30 40 50 60 70 0 5 10 15 20 25 30 NBNodes NB Hops Forward by
like Infection by like Forward by dislike Infection by dislike
Figure 6: Survey (f LIKE = 5): Impact of amplication of BEEP June
2014A.-M. Kermarrec (Inria) WhatsUp versus Pub/Sub Approach
Precision Recall F1-Score Pub/Sub 0.40 1.0 0.58 WhatsUp 0.47 0.83
0.60 June 2014A.-M. Kermarrec (Inria) WhatsUp versus cascading
Approach Precision Recall F1-Score Cascading 0.57 0.09 0.16 WhatsUp
0.56 0.57 0.57 June 2014A.-M. Kermarrec (Inria) PRIVACY MATTERS
June 2014A.-M. Kermarrec (Inria) Privacy issues During user
clustering Exchange of profile in clear During item dissemination
Predictive nature of the protocol Profile Obfuscation Randomized
dissemination June 2014A.-M. Kermarrec (Inria) Privacy Obfuscation
Does not reveal the exact profile Does not reveal the least
sensitive information Randomized dissemination Flips the opinion
with a given probability (pf) June 2014A.-M. Kermarrec (Inria)
Structure profiles Private Profile Compact profile In clear: Full
information about the interests Aggregate signatures of liked items
June 2014A.-M. Kermarrec (Inria) Structure profiles Private Profile
Compact profile Filter profile Item profile Obfuscated profile In
clear: Full information about the interests Aggregate signatures of
liked items Interests of users that like similar items Least
sensitive information about interests Aggregate interests of users
that liked it June 2014A.-M. Kermarrec (Inria) Obfuscation
mechanism News item (received) Private profile Profiles kept
locally June 2014A.-M. Kermarrec (Inria) Obfuscation mechanism News
item (received) Private profile Compact profile News item
(forwarded) + Profiles kept locally Profiles exchanged with others
signature item profile June 2014A.-M. Kermarrec (Inria) Obfuscation
mechanism News item (received) Private Profile Compact Profile
Filter Profile Obfuscated ProfileNews item (forwarded) x+ Profiles
kept locally Profiles exchanged with others signature item profile
item profile mask of popularity System parameter June 2014A.-M.
Kermarrec (Inria) Randomized dissemination Flips the opinion with a
given probability (pf) Attacker could still learn from the profile
Private profile contains a field with the result of the randomized
decision Generate Randomized compact profile Users still use
locally their non randomized profile for clustering Differentially
private protocol June 2014A.-M. Kermarrec (Inria) Experimental
setup Simulations and Planetlab Alternatives Cleartext profile
(CT); 2DP (DP dissemination and randomized profile for clustering)
Metrics Recommendation: recall/precision Privacy: Distance between
obfuscated profile and real profile; Dataset: Real survey, 120
users on 200 news items (4 instances) June 2014A.-M. Kermarrec
(Inria) Impact of randomization June 2014A.-M. Kermarrec (Inria)
Impact of randomization Decrease of precision with increasing pf
June 2014A.-M. Kermarrec (Inria) http://131.254.213.98:8080/wup/
Operational prototype Tested on 500 users @ TrentoRise last year
TRY IT Take away message Personalization is needed Decentralization
is healthy Gossip-based computing is one (the) way to go June
2014A.-M. Kermarrec (Inria) For those who are afraid of P2P June
2014A.-M. Kermarrec (Inria) Hybrid recommendation engine June
2014A.-M. Kermarrec (Inria) June 2014A.-M. Kermarrec (Inria) HyRec:
Taking the best of both worlds Online KNN selection Restricted
andidate set (k) No data stored at the client HyRec client:
Javascript (widget) running in the browser June 2014A.-M. Kermarrec
(Inria) June 2014A.-M. Kermarrec (Inria) View similarity June
2014A.-M. Kermarrec (Inria) Dataset Users Items Ratings MovieLens1
943 1700 100,000 MovieLens2 6,040 4000 1,000,000 MovieLens3 69,878
10,000 10,000,000 Digg 59,167 7724 782,807 Recommendation quality
June 2014A.-M. Kermarrec (Inria) HyRec versus the client load
Impact of HyRec Impact of the client load Negligible disruption of
HyRec 50% load