tie-22306 data-intensive programming - tunidip/slides/slides1.1.pdf · 2016-09-01 · data storage...
TRANSCRIPT
TIE-22306Data-intensiveProgramming
Dr.TimoAaltonenDepartmentofPervasiveComputing
Data-IntensiveProgramming
• Lecturer:TimoAaltonen– [email protected]
• Assistants– AdnanMushtaq–MScAnttiLuoto–MScAnttiKallonen
Lecturer
• UniversityLecturer• DoctoraldegreeinSoftwareEngineering,TUT,SoftwareEngineering,2005
• Workhistory– Variouspositions,TUT,1995– 2010– PrincipalResearcher,SystemSoftwareEngineering,NokiaResearchCenter,2010- 2012
– Universitylecturer,TUT
Workingatthecourse
• LecturesonFridays• Weeklyexercises– beginningfromtheweek#2
• Coursework– announcednextFriday
• Communication– http://www.cs.tut.fi/~dip/
• Exam
WeeklyExercises
• LinuxclassTC217• Inthebeginningofthecoursehands-ontraining
• Intheendofthecoursereceptionforproblemswiththecoursework
• Enrolmentisopen• Notcompulsory,nocreditpoints• Twomoreinstanceswillbeadded
CourseWork
• UsingHadooptoolsandframeworktosolvetypicalBigDataproblem(inJava)
• Groupsofthree• Hardware– Yourownlaptopwithself-installedHadoop– YourownlaptopwithVirtualBox 5.1andUbuntuVM– ATUTvirtualmachine
Exam
• Electronicexamafterthecourse• Testsratherunderstandingthanexactsyntax• ”UsepseudocodetowriteaMapReduceprogramwhich…”
• GeneralquestionsonHadoopandrelatedtechnologies
Today
• Bigdata• DataScience• Hadoop• HDFS• ApacheFlume
1:BigData
• Worldisdrowningindata– clickstreamdataiscollectedbywebservers– NYSEgenerates1TBtradedataeveryday–MTCcollects5000attributesforeachcall– Smartmarketerscollectpurchasinghabits
• “Moredatausuallybeatsbetteralgorithms”
ThreeVsofBigData
• Volume:amountofdata– Transactiondatastoredthroughtheyears,unstructureddatastreaminginfromsocialmedia,increasingamountsofsensorandmachine-to-machinedata
• Velocity:speedofdatainandout– streamingdatafromRFID,sensors,…
• Variety:rangeofdatatypesandsources– structured,unstructured
BigData
• Variability– Dataflowscanbehighlyinconsistentwithperiodicpeaks
• Complexity– Datacomesfrommultiplesources.– linking,matching,cleansingandtransformingdataacrosssystemsisacomplextask
DataScience
• Definition:Datascienceisanactivitytoextractsinsightsfrommessydata
• Facebookanalyzeslocationdata– toidentifyglobalmigrationpatterns– tofindoutthefanbases todifferentsportteams
• Aretailermighttrackpurchasesbothonlineandin-storetotargetedmarketing
DataScience
NewChallenges
• Compute-intensiveness– rawcomputingpower
• Challengesofdataintensiveness– amountofdata– complexityofdata– speedinwhichdataischanging
DataStorageAnalysis
• Harddrivefrom1990– store1,370MB– speed4.4MB/s
• Harddrive2010s– store1TB– speed100MB/s
Scalability
• Growswithoutrequiringdeveloperstore-architecttheiralgorithms/application
• Horizontalscaling• Verticalscaling
ParallelApproach
• Readingfrommultipledisksinparallel– 100driveshaving1/100ofthedata=>1/100readingtime
• Problem:Hardwarefailures– replication
• Problem:Mostanalysistasksneedtobeabletocombinedatainsomeway–MapReduce
• Hadoop
2:ApacheHadoop
• Hadoopisaframeworksoftools– librariesandmethodologies
• Operatesonlargeunstructureddatasets• Opensource(ApacheLicense)• Simpleprogrammingmodel• Scalable
Hadoop
• Ascalablefault-tolerantdistributedsystemfordatastorageandprocessing(opensourceundertheApachelicense)
• CoreHadoophastwomainsystems:– HadoopDistributedFileSystem:self-healinghigh-bandwidthclusteredstorage
–MapReduce:distributedfault-tolerantresourcemanagementandschedulingcoupledwithascalabledataprogrammingabstraction
Hadoop
• Administrators– Installation–Monitor/ManageSystems– TuneSystems
• EndUsers– DesignMapReduce Applications– Importandexportdata–WorkwithvariousHadoopTools
Hadoop
• DevelopedbyDougCuttingandMichaelJ.Cafarella
• BasedonGoogleMapReduce technology• Designedtohandlelargeamountsofdataandberobust
• DonatedtoApacheFoundationin2006byYahoo
HadoopDesignPrinciples
• Movingcomputationischeaperthanmovingdata• Hardwarewillfail• Hideexecutiondetailsfromtheuser• Usestreamingdataaccess• Usesimplefilesystemcoherencymodel
• HadoopisnotareplacementforSQL,alwaysfastandefficientquickad-hocquerying
HadoopMapReduce
• MapReduce(MR)istheoriginalprogrammingmodelforHadoop
• Collocatedatawithcomputenode– dataaccessisfastsinceitslocal(datalocality)
• Networkbandwidthisthemostpreciousresourceinthedatacenter–MRimplementationsexplicitmodelthenetworktopology
HadoopMapReduce
• MRoperatesatahighlevelofabstraction– programmerthinksintermsoffunctionsofkeyandvaluepairs
• MRisa shared-nothingarchitecture– tasksdonotdependoneachother– failedtaskscanberescheduledbythesystem
• MRwasintroducedbyGoogle– usedforproducingsearchindexes– applicabletomanyotherproblemstoo
HadoopComponents
• HadoopCommon– AsetofcomponentsandinterfacesfordistributedfilesystemsandgeneralI/O
• HadoopDistributedFilesystem(HDFS)• HadoopYARN– aresource-managementplatform,scheduling
• HadoopMapReduce– Distributedprogrammingmodelandexecutionenvironment
HadoopStackTransition
HadoopEcosystem• HBase – ascalabledatawarehousewithsupportforlargetables
• Hive – adatawarehouseinfrastructurethatprovidesdatasummarizationandadhocquerying
• Pig – ahigh-leveldata-flowlanguageandexecutionframeworkforparallelcomputation
• Spark – afastandgeneralcomputeengineforHadoopdata.Widerangeofapplications– ETL,MachineLearning,streamprocessing,andgraphanalytics
Flexibility:ComplexDataProcessing
1. JavaMapReduce:Mostflexibilityandperformance,buttediousdevelopmentcycle(theassemblylanguageofHadoop).
2. StreamingMapReduce (akaPipes):Allowsyoutodevelopinanyprogramminglanguageofyourchoice,butslightlylowerperformanceandlessflexibilitythannativeJavaMapReduce.
3. Crunch:Alibraryformulti-stageMapReduce pipelinesinJava(modeledAfterGoogle’sFlumeJava)
4. PigLatin:Ahigh-levellanguageoutofYahoo,suitableforbatchdataflowworkloads.
5. Hive:ASQLinterpreteroutofFacebook,alsoincludesameta-storemappingfilestotheirschemasandassociatedSerDes.
6. Oozie:Aworkflowenginethatenablescreatingaworkflowofjobscomposedofanyoftheabove.
3:HadoopDistributedFileSystem
• HadoopcomeswithdistributedfilesystemcalledHDFS (HadoopDistributedFileSystem)
• BasedonGoogle’sGFS (GoogleFileSystem)• HDFSprovidesredundantstorageformassiveamountsofdata– usingcommodityhardware
• DatainHDFSisdistributedacrossalldatanodes– EfficientforMapReduceprocessing
HDFSDesign
• Filesystemoncommodityhardware– Survivesevenwithhighfailureratesofthecomponents
• Supportslotsoflargefiles– FilesizehundredsGBorseveralTB
• Maindesignprinciples– Writeonce,readmanytimes– Ratherstreamingreads,thanfrequentrandomaccess– Highthroughputismoreimportantthanlowlatency
HDFSArchitecture• HDFSoperatesontopofexistingfilesystem• Filesarestoredasblocks(defaultsize128MB,differentfromfilesystemblocks)
• Filereliabilityisbasedonblock-basedreplication– EachblockofafileistypicallyreplicatedacrossseveralDataNodes (defaultreplicationis3)
• NameNode storesmetadata,managesreplicationandprovidesaccesstofiles
• Nodatacaching(becauseoflargedatasets),butdirectreading/streamingfromDataNode toclient
HDFSArchitecture
• NameNode storesHDFSmetadata– filenames,locationsofblocks,fileattributes–MetadataiskeptinRAMforfastlookups
• ThenumberoffilesinHDFSislimitedbytheamountofavailableRAMintheNameNode– HDFSNameNode federationcanhelpinRAMissues:severalNameNodes,eachofwhichmanagesaportionofthefilesystemnamespace
HDFSArchitecture
• DataNode storesfilecontentsasblocks– DifferentblocksofthesamefilearestoredondifferentDataNodes
– SameblockistypicallyreplicatedacrossseveralDataNodes forredundancy
– PeriodicallysendsreportofallexistingblockstotheNameNode
– DataNodes exchangeheartbeatswiththeNameNode
HDFSArchitecture
• Built-inprotectionagainstDataNode failure• IfNameNode doesnotreceiveanyheartbeatfromaDataNode withincertaintimeperiod,DataNode isassumedtobelost
• IncaseoffailingDataNode,blockreplicationisactivelymaintained– NameNode determineswhichblockswereonthelostDataNode
– TheNameNode findsothercopiesoftheselostblocksandreplicatesthemtoothernodes
HDFS
• HDFSFederation–MultipleNamenode servers–Multiplenamespaces
• HighAvailability– redundantNameNodes• HeterogeneousStorageandArchivalStorageARCHIVE,DISK,SSD,RAM_DISK
High-Availability(HA)Issues:NameNode Failure
• NameNode failurecorrespondstolosingallfilesonafilesystem
% sudo rm --dont-do-this /• Forrecovery,Hadoopprovidestwooptions– Backupfilesthatmakeupthepersistentstateofthefilesystem
– SecondaryNameNode• Alsosomemoreadvancedtechniquesexist
HAIssues:thesecondaryNameNode
• ThesecondaryNameNode isnotmirroredNameNode• Requiredmemory-intensiveadministrativefunctions– NameNode keepsmetadatainmemoryandwriteschangestoaneditlog
– ThesecondaryNameNode periodicallycombinespreviousnamespaceimageandtheeditlogintoanewnamespaceimage,preventingthelogtobecometoolarge
• Keepsacopyofthemergednamespaceimage,whichcanbeusedintheeventoftheNameNode failure
NetworkTopology
• HDFSisawarehowclosetwonodesareinthenetwork
• Fromclosertofurther0:Processesinthesamenode2:Differentnodesinthesamerack4:Nodesindifferentracksinthesamedatacenter6:Nodesindifferentdatacenters
NetworkTopology
FileBlockPlacement
• Clientsalwaysreadfromtheclosestnode• Defaultplacementstrategy– Onereplicainthesamelocalnodeasclient– Secondreplicainadifferentrack– Thirdreplicaindifferent,randomlyselected,nodeinthesamerackasthesecondreplica
• Additional(3+)replicasarerandom
Balancing
• Hadoopworksbestwhenblocksareevenlyspreadout
• SupportforDataNodes ofdifferentsize– InoptimalcasethediskusagepercentageinallDataNodes approximatelythesamelevel
• Hadoopprovidesbalancerdaemon– Re-distributesblocks– ShouldberunwhennewDataNodes areadded
RunningHadoop
• Threeconfigurations– standalone– pseudo-distributed– fully-distributed– https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html
ConfiguringHDFS
• VariableHADOOP_CONF_DIRdefinesthedirectoryfortheHadoopconfigurationfiles
• core-site.xml<configuration>
<property><name>fs.defaultFS</name><value>hdfs://localhost:9001</value>
</property></configuration>
• hdfs-site.xml<configuration>
<property><name>dfs.replication</name><value>1</value>
</property>
<property><name>dfs.namenode.name.dir</name><value>file:///home/NN/hadoop/namenode</value>
</property>
<property><name>dfs.datanode.name.dir</name> <value>file:///home/NN/hadoop/datanode</value>
</property></configuration>
AccessingData
• Datacanbeaccessedusingvariousmethods– JavaAPI– CAPI– Commandline/POSIX(FUSEmount)– Commandline/HDFSclient:Demo– HTTP– Varioustools
HDFSURI
• AllHDFS(CLI)commandstakepathURIsasarguments
• URIexample– hdfs://localhost:9000/user/hduser/log-data/file1.log
• Theschemeandauthorityareoptional– /user/hduser/log-data/file1.log
• Homedirectory– log-data/file1.log
RDBMSvsHDFS• Schema-on-Write
(RDBMS)– Schemamustbecreated
beforeanydatacanbeloaded
– AnexplicitloadoperationwhichtransformsdatatoDBinternalstructure
– NewcolumnsmustbeaddedexplicitlybeforenewdataforsuchcolumnscanbeloadedintotheDB
• Schema-on-Read(HDFS)– Dataissimplycopiedto
thefilestore,notransformationisneeded
– ASerDe (Serializer/Deserlizer)isappliedduringreadtimetoextracttherequiredcolumns(latebinding)
– NewdatacanstartflowinganytimeandwillappearretroactivelyoncetheSerDe isupdatedtoparseit
Conclusions
• Pros– Supportforverylargefiles– Designedforstreamingdata– Commodityhardware
• Cons– Notdesignedforlow-latencydataaccess– Architecturedoesnotsupportlotsofsmallfiles– Nosupportformultiplewriters/arbitraryfilemodifications(Writesalwaysattheendofthefile)
Readingdata
Flume
4:DataModeling
• HDFSisaSchema-on-readsystem– allowsstoringallofyourrawdata
• Stillfollowingmustbeconsidered– Datastorageformats–Multitenancy– Schemadesign–Metadatamanagement
DataStorageOptions
• Nostandarddatastorageformat– Hadoopallowsstoringofdatainanyformat
• Majorconsiderationsfordatastorageinclude– Fileformat(e.g.plaintext,SequenceFile ormorecomplexbutmorefunctionallyrichoptions,suchasAvroandParquet)
– Compression(splittability)– Datastoragesystem(HDFS,HBase,Hive,Impala)
FileFormats:TextFile
• Commonusecase:weblogsandserverlogs– comesinmanyformats
• Organizationofthefilesinthefilesystem• Textfilesconsumespace->compression• Overheadforconversion(‘123’->123)• Structuredtextdata– XMLandJSONpresentchallengestoHADOOP
• hardtosplit– Dedicatedlibrariesexist
FileFormats:BinaryData
• Hadoopcanbeusedtoprocessbinaryfiles– e.g.images
• Containerformatispreferred– e.g.SequenceFile
• Ifthesplittable unitofbinarydataislargerthan64MB,youmayconsiderputtingthedatainitsownfile,withoutusingacontainerformat
HadoopFileTypes• Hadoop-specificfileformatsarespecificallycreatedtoworkwellwithMapReduce– file-baseddatastructuressuchassequencefiles,– serializationformatslikeAvro,and– columnarformatssuchasRCFile andParquet
• Splittable compression– Theseformatssupportcommoncompressionformatsandarealsosplittable
• Agnosticcompression– codecisstoredintheheadermetadataofthefileformat->thefilecanbecompressedwithanycompressioncodec,withoutreadershavingtoknowthecodec
File-BasedDataStructures
• SequenceFile formatisoneofthemostcommonlyusedfile-basedformatsinHadoop– otherformats:MapFiles,SetFiles,ArrayFiles,BloomMapFiles,…
– storesdataasbinarykey-valuepairs– threeformatsavailableforrecords:uncompressed,record-compressed,block-compressed
SequenceFile
• Headermetadata– compressioncodec,keyandvalueclassnames,user-definedmetadata,randomlygeneratedsyncmarker
• Oftenusedacontainerforsmallerfiles
Compression
• AlsoforspeedingMapReduce– Notonlyforreducingstoragerequirements
• Compressionmustbesplittable–MapReduceframeworksplitsdataforinputtomultipletasks
HDFSSchemaDesign• Hadoopisoftenadatahubfortheentireorganization– dataissharedbymanydepartmentsandteams
• Carefullystructuredandorganizedrepositoryhasseveralbenefits– standarddirectorystructuremakesiteasiertosharedatabetweenteams
– allowsforenforcingaccessrightsandquota– conventionsregardinge.g.stagingdata leadlesserrors– codereuse– Hadooptoolsmakeassumptionsofthedataplacement
RecommendedLocationsofFiles
• /user/<username>– data,JARs,andconfig filesofaspecificuser
• /etl– datainallphasesofanETLworkflow– /etl/<group>/<application>/<process>/{input,processing,output,bad}
• /tmp– temporarydata
RecommendedLocationsofFiles
• /data– datasetssharedacrossorganization– dataiswrittenbyautomatedETLprocesses– read-onlyforusers– subdirectoriesforeachdataset
• /app– JARs,Oozie workflowdefinitions,HiveHQLfiles,…– /app/<group>/<application>/<version>/<artifactdirectory>/<artifact>
RecommendedLocationsofFiles
• /metadata– themetadatarequiredbysometools
Partitioning
• HDFShasnoindexes– pro:fasttoingestdata– con:mightleadtofulltablescan(FTC),evenwhenonlyaportionofdataisneeded
• Solution:breakdatasetintosmallersubsets(partitions)– aHDFSsubdirectoryforeachpartition– allowsqueriestoreadonlythespecificpartitions
Partitioning:Example
• Assumedatasetsforallordersforvariouspharmacies
• Withoutpartitioningcheckingorderhistoryforjustonephysicianoverthepastthreemonthsleadstofulltablescan
• medication_orders/date=20160824/{order1.csv,order2.csv}
– only90directoriesmustbescanned
5:DataMovement
• Filesystemclientforsimpleusage• CommondatasourcesforHadoopinclude– traditionaldatamanagementsystemssuchasrelationaldatabasesandmainframes
– logs,machine-generateddata,andotherformsofeventdata
– filesbeingimportedfromexistingenterprisedatastoragesystems
DataMovement:Considerations
• Timelinessofdataingestionandaccessibility–Whataretherequirementsaroundhowoftendataneedstobeingested?Howsoondoesdataneedtobeavailabletodownstreamprocessing?
• Incrementalupdates– Howwillnewdatabeadded?Doesitneedtobeappendedtoexistingdata?Oroverwriteexistingdata?
DataMovement:Considerations
• Dataaccessandprocessing–Willthedatabeusedinprocessing?Ifso,willitbeusedinbatchprocessingjobs?Orisrandomaccesstothedatarequired?
• Sourcesystemanddatastructure–Whereisthedatacomingfrom?Arelationaldatabase?Logs?Isitstructured,semistructured,orunstructureddata?
DataMovement:Considerations
• Partitioningandsplittingofdata– Howshoulddatabepartitionedafteringest?Doesthedataneedtobeingestedintomultipletargetsystems(e.g.,HDFSandHBase)?
• Storageformat–Whatformatwillthedatabestoredin?
• Datatransformation– Doesthedataneedtobetransformedinflight?
TimelinessofDataIngestion• Timelagfromwhendataisavailableforingestiontowhenit’s
accessibleinHadoop• Classificationsingestionrequirements:• Macrobatch
– anythingover15minutestohours,orevenadailyjob.• Microbatch
– firedoffevery2minutesorso,butnomorethan15minutesintotal.• Near-Real-TimeDecisionSupport
– “immediatelyactionable”bytherecipientoftheinformation– deliveredinlessthan2minutesbutgreaterthan2seconds.
• Near-Real-TimeEventProcessing– under2seconds,andcanbeasfastasa100-millisecondrange.
• RealTime– anythingunder100milliseconds.
IncrementalUpdates• Dataiseitherappendedtoanexistingdatasetoritismodified– HDFSworksfineforappendonlyimplementations.
• ThedownsidetoHDFSistheinabilitytodoappendsorrandomwritestofilesafterthey’recreated
• HDFSisoptimizedforlargefiles– Iftherequirementscallforatwo-minuteappendprocessthatendsupproducinglotsofsmallfiles,thenaperiodicprocesstocombinesmallerfileswillberequiredtogetthebenefitsfromlargerfiles
OriginalSourceSystemandDataStructure
• Originalfiletype– anyformat:delimited,XML,JSON,Avro,fixedlength,variablelength,copybooks,…
• Hadoopcanacceptanyfileformat– notallformatsareoptimalforparticularusecases– notallfileformatscanworkwithalltoolsintheHadoopecosystem,example:variable-lengthfiles
Compression
• Pro– transferringacompressedfileoverthenetworkrequireslessI/Oandnetworkbandwidth
• Con– mostcompressioncodecsappliedoutsideofHadooparenotsplittable (e.g.,Gzip)
Misc• RDBMS
– Tool:Sqoop• StreamingData
– Twitterfeeds,aJavaMessageService(JMS)queue,eventsfiringfromawebapplicationserver
– Tools:FlumeorKafka• Logfiles
– ananti-patternistoreadthelogfiles fromdiskastheyarewrittenbecausethisisalmostimpossibletoimplementwithoutlosingdata
– Thecorrectwayofingestinglogfiles istostreamthelogsdirectlytoatoollikeFlumeorKafka,whichwillwritedirectlytoHadoopinstead
Transformations
• modificationsonincomingdata,distributingthedataintopartitionsorbuckets,sendingthedatatomorethanonestoreorlocation– Transformation:XMLorJSONisconvertedtodelimiteddata
– Partitioning:incomingdataisstocktradedataandpartitioningbytickerisrequired
– Splitting:ThedataneedstolandinHDFSandHBase fordifferentaccesspatterns.
DataIngestionOptions
• filetransfers• ToolslikeFlume,Sqoop,andKafka