apéro rubybdx - mongodb - 8-11-2011
TRANSCRIPT
Pierre-Louis GottfroisBastien MurzeauApéro Ruby Bordeaux, 8 novembre 2011
• Brève introduction
• Cas pratique
• Map / Reduce
Qu’est ce que mongoDB ?
mongoDB est une base de donnée de type NoSQL,
sans schéma
document-oriented
sans-schéma
• Très utile en développements ‘agiles’ (itérations, rapidité de modifications, flexibilité pour les développeurs)
• Supporte des fonctionnalités qui seraient, en BDDs relationnelles :• quasi-impossible (stockage d’éléments non finis, ex. tags)
• trop complexes pour ce qu’elles sont (migrations)
document-oriented
• mongoDB stocke des documents, pas de rows
• les documents sont stockés sous forme de JSON; binary JSON
• la syntaxe de requêtage est aussi fournie que SQL
• le mécanisme de documents ‘embedded’ résout bon nombre de problèmes rencontrés
document-oriented
• Les documents sont stockés dans une collection, en RoR = model
• une partie des ces données sont indexées pour optimiser les performances
• un document n’est pas une poubelle !
stockage de données volumineuses
• mongoDB (et autres NoSQL) sont plus performantes pour la scalabilité horizontale
• ajout de serveurs pour augmenter la capacité de stockage («sharding»)
• garantissant ainsi une meilleur disponibilité
• load-balancing optimisé entre les nodes
• augmentation transparente pour l’application
Cas pratique• ORM devient ODM, la gem de référence mongoid
• ou : mongoMapper, DataMapper
• Création d’une application a base de NoSQL MongoDB
• rails new nosql
• edition du Gemfile
• gem ‘mongoid’
• gem ‘bson_ext’
• bundle install
• rails generate mongoid:config
Cas pratique• edition du config/application.rb
• #require 'rails/all'
• require "action_controller/railtie"
• require "action_mailer/railtie"
• require "active_resource/railtie"
• require "rails/test_unit/railtie"
Cas pratique
class Conversation include Mongoid::Document include Mongoid::Timestamps
field :public, :type => Boolean, :default => false
has_many :scores, :as => :scorable, :dependent => :delete has_and_belongs_to_many :subjects belongs_to :timeline embeds_many :messages
class Subject include Mongoid::Document include Mongoid::Timestamps
has_many :scores, :as => :scorable, :dependent => :delete, :autosave => true has_many :requests, :dependent => :delete belongs_to :author, :class_name => 'User'
Map Reduce
Example
{“id” : 1,“day” : 20111017,“checkout” : 100
}
{“id” : 2,“day” : 20111017,“checkout” : 42
}
{“id” : 3,“day” : 20111017,“checkout” : 215
}
{“id” : 4,“day” : 20111017,“checkout” : 73
}
A “ticket” collection
Problematic
• We want to
• Calculate the ‘checkout’ sum of each object in our ticket’s collection
• Be able to distribute this operation over the network
• Be fast!
• We don’t want to
• Go over all objects again when an update is made
Map : emit(checkout)
{“id” : 1,“day” : 20111017,“checkout” : 100
}
{“id” : 2,“day” : 20111017,“checkout” : 42
}
{“id” : 3,“day” : 20111017,“checkout” : 215
}
{“id” : 4,“day” : 20111017,“checkout” : 73
}
100 42 215 73
The ‘map’ function emit (select) every checkout value of each object in our collection
Reduce : sum(checkout)
{“id” : 1,“day” : 20111017,“checkout” : 100
}
{“id” : 2,“day” : 20111017,“checkout” : 42
}
{“id” : 3,“day” : 20111017,“checkout” : 215
}
{“id” : 4,“day” : 20111017,“checkout” : 73
}
100 42 215 73
142 288
430
Reduce function
The ‘reduce’ function apply the algorithmic logic for each key/value received from ‘map’ function
This function has to be ‘idempotent’ to be called recursively or in a distributed system
reduce(k, A, B) == reduce(k, B, A)reduce(k, A, B) == reduce(k, reduce(A, B))
Inherently Distributed
{“id” : 1,“day” : 20111017,“checkout” : 100
}
{“id” : 2,“day” : 20111017,“checkout” : 42
}
{“id” : 3,“day” : 20111017,“checkout” : 215
}
{“id” : 4,“day” : 20111017,“checkout” : 73
}
100 42 215 73
142 288
430
Distributed
Since ‘map’ function emits objects to be reduced and ‘reduce’ function processes for each emitted
objects independently, it can be distributed through multiple workers.
map reduce
Logaritmic Update
For the same reason, when updating an object, we don’t have to reprocess for each obejcts.
We can call ‘map’ function only on updated objects.
Logaritmic Update
{“id” : 1,“day” : 20111017,“checkout” : 100
}
{“id” : 2,“day” : 20111017,“checkout” : 42
}
{“id” : 3,“day” : 20111017,“checkout” : 210
}
{“id” : 4,“day” : 20111017,“checkout” : 73
}
100 42 215 73
142 288
430
Logaritmic Update
{“id” : 1,“day” : 20111017,“checkout” : 100
}
{“id” : 2,“day” : 20111017,“checkout” : 42
}
{“id” : 3,“day” : 20111017,“checkout” : 210
}
{“id” : 4,“day” : 20111017,“checkout” : 73
}
100 42 210 73
142 288
430
Logaritmic Update
{“id” : 1,“day” : 20111017,“checkout” : 100
}
{“id” : 2,“day” : 20111017,“checkout” : 42
}
{“id” : 3,“day” : 20111017,“checkout” : 210
}
{“id” : 4,“day” : 20111017,“checkout” : 73
}
100 42 210 73
142 283
430
Logarithmic Update
{“id” : 1,“day” : 20111017,“checkout” : 100
}
{“id” : 2,“day” : 20111017,“checkout” : 42
}
{“id” : 3,“day” : 20111017,“checkout” : 210
}
{“id” : 4,“day” : 20111017,“checkout” : 73
}
100 42 210 73
142 283
425
Let’s do some code!
$> mongo
> db.tickets.save({ "_id": 1, "day": 20111017, "checkout": 100 })> db.tickets.save({ "_id": 2, "day": 20111017, "checkout": 42 })> db.tickets.save({ "_id": 3, "day": 20111017, "checkout": 215 })> db.tickets.save({ "_id": 4, "day": 20111017, "checkout": 73 })
> db.tickets.count()4
> db.tickets.find(){ "_id" : 1, "day" : 20111017, "checkout" : 100 }...
> db.tickets.find({ "_id": 1 }){ "_id" : 1, "day" : 20111017, "checkout" : 100 }
> var map = function() {... emit(null, this.checkout)}
> var reduce = function(key, values) {... var sum = 0... for (var index in values) sum += values[index]... return sum}
Temporary Collection> sumOfCheckouts = db.tickets.mapReduce(map, reduce){ "result" : "tmp.mr.mapreduce_123456789_4", "timeMills" : 8, "counts" : { "input" : 4, "emit" : 4, "output" : 1 }, "ok" : 1}
> db.getCollectionNames()[ "tickets", "tmp.mr.mapreduce_123456789_4"]
> db[sumOfCheckouts.result].find(){ "_id" : null, "value" : 430 }
Persistent Collection> db.tickets.mapReduce(map, reduce, { "out" : "sumOfCheckouts" })
> db.getCollectionNames()[ "sumOfCheckouts", "tickets", "tmp.mr.mapreduce_123456789_4"]
> db.sumOfCheckouts.find(){ "_id" : null, "value" : 430 }
> db.sumOfCheckouts.findOne().value430
Reduce by Date
> var map = function() {... emit(this.date, this.checkout)}
> var reduce = function(key, values) {... var sum = 0... for (var index in values) sum += values[index]... return sum}
> db.tickets.mapReduce(map, reduce, { "out" : "sumOfCheckouts" })
> db.sumOfCheckouts.find(){ "_id" : 20111017, "value" : 430 }
What we can do
Scored Subjects per User
Subject User Score
1 1 2
1 1 2
1 2 2
2 1 2
2 2 10
2 2 5
Scored Subjects per User (reduced)
Subject User Score
1 1 4
1 2 2
2 1 2
2 2 15
$> mongo
> db.scores.save({ "_id": 1, "subject_id": 1, "user_id": 1, "score": 2 })> db.scores.save({ "_id": 2, "subject_id": 1, "user_id": 1, "score": 2 })> db.scores.save({ "_id": 3, "subject_id": 1, "user_id": 2, "score": 2 })> db.scores.save({ "_id": 4, "subject_id": 2, "user_id": 1, "score": 2 })> db.scores.save({ "_id": 5, "subject_id": 2, "user_id": 2, "score": 10 })> db.scores.save({ "_id": 6, "subject_id": 2, "user_id": 2, "score": 5 })
> db.scores.count()6
> db.scores.find(){ "_id": 1, "subject_id": 1, "user_id": 1, "score": 2 }...
> db.scores.find({ "_id": 1 }){ "_id": 1, "subject_id": 1, "user_id": 1, "score": 2 }
> var map = function() {... emit([this.user_id, this.subject_id].join("-"), {subject_id:this.subject_id,... user_id:this.user_id, score:this.score});}
> var reduce = function(key, values) {... var result = {user_id:"", subject_id:"", score:0};... values.forEach(function (value) {result.score += value.score;result.user_id = ... value.user_id;result.subject_id = value.subject_id;});... return result}
ReducedScores Collection
> db.scores.mapReduce(map, reduce, { "out" : "reduced_scores" })
> db.getCollectionNames()[ "reduced_scores", "scores"]
> db.reduced_scores.find(){ "_id" : "1-1", "value" : { "user_id" : 1, "subject_id" : 1, "score" : 4 } }{ "_id" : "1-2", "value" : { "user_id" : 1, "subject_id" : 2, "score" : 2 } }{ "_id" : "2-1", "value" : { "user_id" : 2, "subject_id" : 1, "score" : 2 } }{ "_id" : "2-2", "value" : { "user_id" : 2, "subject_id" : 2, "score" : 15 } }
> db.reduced_scores.findOne().score4
Dealing with Rails Query
ruby-1.9.2-p180 :007 > ReducedScores.first => #<ReducedScores _id: 1-1, _type: nil, value: {"user_id"=>BSON::ObjectId('...'), "subject_id"=>BSON::ObjectId('...'), "score"=>4.0}>
ruby-1.9.2-p180 :008 > ReducedScores.where("value.user_id" => u1.id).count => 2
ruby-1.9.2-p180 :009 > ReducedScores.where("value.user_id" => u1.id).first.value['score'] => 4.0
ruby-1.9.2-p180 :010 > ReducedScores.where("value.user_id" => u1.id).last.value['score'] => 2.0
Questions ?