retour d'expérience - aggrégation de données en ruby (learnivore.com)
TRANSCRIPT
A quoi ressemble Learnivore ?
2
Briques techniques• ThinkingSphinx + Sphinx
• TinyTL
• Hpricot
• TagCleaner
• Ramaze
• UrlRewriter
• ActiveRecord
• ActsAsTaggableOnSteroids
• Craken
• WillPaginate
• Thin
3
class Item < ActiveRecord::Base acts_as_taggable
validates_presence_of :title, :url, :source, :summary, :thumbnail_img_tag, :pricing validates_uniqueness_of :url # our key define_index do indexes title indexes summary indexes source, :facet => true indexes tags.name, :as => :tag, :facet => true indexes pricing, :facet => true indexes language, :facet => true has update_date where "update_date <= curdate()" set_property :delta => true endend
Le modèle
4
Processus d’import
• DSL (TinyTL)
• Tourne chaque heure
• Pour chaque source, récupère un feed ou une page html
• Normalise et conforme chaque item
• Ajoute et conforme les tags
• Upsert (update or insert) - clé = url cible
5
Déclaration d’une source
source(:peepcode) { fresh_get(PEEPCODE_HOST + '/products'). at('ul.products').search('li a'). map { |e| PEEPCODE_HOST + e['href'] }}
6
Traitement spécifique (1/2)each_row(:bddcasts) { |row| # resource from rss row[:title] = grab(row[:content],'title') row[:update_date] = Date.parse(grab(row[:content],'published')) row[:url] = grab(row[:content], 'feedburner:origlink') row[:summary] = SimpleSanitizer.sanitize( CGI.unescapeHTML(grab(row[:content], 'content'))) # resource from page page = get(row[:url]) img_src = "http://bddcasts.com" + page.at("div.episode div.image img").attributes['src'] tags = grab_all(page, "div.episode div.details p.tags a", ', ') episode_css_classes = page.at("div.episode").attributes['class'] pricing = case episode_css_classes when /orange/; 'paid' when /green/; 'free' else raise "Bddcasts pricing guess failed: episode css class : '#{episode_css_classes}" end row[:thumbnail_img_tag] = "<img src='#{img_src}' height='90' style='border: 1px solid #aaa'/>" row[:pricing] = pricing row[:tag_list] = tags}
7
Traitement spécifique (2/2)
each_row(:thinkcode_tv) { |row| row[:title] = grab(row[:content],'title') row[:url] = grab(row[:content],'guid').gsub('//catalogo','/catalogo') row[:update_date] = Date.parse(grab(row[:content],'pubdate')) row[:summary] = SimpleSanitizer.sanitize(CGI.unescapeHTML(grab(row[:content],'description'))) img_src = Hpricot(CGI.unescapeHTML(grab(row[:content],'description'))). at('img').attributes['src'] row[:thumbnail_img_tag] = "<img height='90' src='#{THINKCODE_HOST}#{img_src}'/>" row[:pricing] = 'paid' row[:language] = 'italian'}
8
Import / nettoyage des tags
• tags du partenaires+tags delicious
• conformation / nettoyage
9
Screens (= tests data)
assert_include %w(free paid), row[:pricing]
assert_match(/https{0,1}:\/\//, row[:url], :url)
assert_not_nil row[:update_date]
assert row[:update_date].is_a?(Date)
...
10
Consommation
• via le web => facettes + full-text
• RSS => sort by fraîcheur desc (feedburner MyBrand)
• Twitter => semi-automatique via bit.ly
11
Conclusions
• Contacter les partenaires = bien
• ThinkingSphinx = bien
• Code = une petite portion du travail
• Toute data doit généralement être nettoyée avant consommation
12