tie-22306 data-intensive programming - tunidip/slides/slides1.1.pdf · 2016-09-01 · data storage...

Post on 20-May-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

TIE-22306Data-intensiveProgramming

Dr.TimoAaltonenDepartmentofPervasiveComputing

Data-IntensiveProgramming

• Lecturer:TimoAaltonen– timo.aaltonen@tut.fi

• Assistants– AdnanMushtaq–MScAnttiLuoto–MScAnttiKallonen

Lecturer

• UniversityLecturer• DoctoraldegreeinSoftwareEngineering,TUT,SoftwareEngineering,2005

• Workhistory– Variouspositions,TUT,1995– 2010– PrincipalResearcher,SystemSoftwareEngineering,NokiaResearchCenter,2010- 2012

– Universitylecturer,TUT

Workingatthecourse

• LecturesonFridays• Weeklyexercises– beginningfromtheweek#2

• Coursework– announcednextFriday

• Communication– http://www.cs.tut.fi/~dip/

• Exam

WeeklyExercises

• LinuxclassTC217• Inthebeginningofthecoursehands-ontraining

• Intheendofthecoursereceptionforproblemswiththecoursework

• Enrolmentisopen• Notcompulsory,nocreditpoints• Twomoreinstanceswillbeadded

CourseWork

• UsingHadooptoolsandframeworktosolvetypicalBigDataproblem(inJava)

• Groupsofthree• Hardware– Yourownlaptopwithself-installedHadoop– YourownlaptopwithVirtualBox 5.1andUbuntuVM– ATUTvirtualmachine

Exam

• Electronicexamafterthecourse• Testsratherunderstandingthanexactsyntax• ”UsepseudocodetowriteaMapReduceprogramwhich…”

• GeneralquestionsonHadoopandrelatedtechnologies

Today

• Bigdata• DataScience• Hadoop• HDFS• ApacheFlume

1:BigData

• Worldisdrowningindata– clickstreamdataiscollectedbywebservers– NYSEgenerates1TBtradedataeveryday–MTCcollects5000attributesforeachcall– Smartmarketerscollectpurchasinghabits

• “Moredatausuallybeatsbetteralgorithms”

ThreeVsofBigData

• Volume:amountofdata– Transactiondatastoredthroughtheyears,unstructureddatastreaminginfromsocialmedia,increasingamountsofsensorandmachine-to-machinedata

• Velocity:speedofdatainandout– streamingdatafromRFID,sensors,…

• Variety:rangeofdatatypesandsources– structured,unstructured

BigData

• Variability– Dataflowscanbehighlyinconsistentwithperiodicpeaks

• Complexity– Datacomesfrommultiplesources.– linking,matching,cleansingandtransformingdataacrosssystemsisacomplextask

DataScience

• Definition:Datascienceisanactivitytoextractsinsightsfrommessydata

• Facebookanalyzeslocationdata– toidentifyglobalmigrationpatterns– tofindoutthefanbases todifferentsportteams

• Aretailermighttrackpurchasesbothonlineandin-storetotargetedmarketing

DataScience

NewChallenges

• Compute-intensiveness– rawcomputingpower

• Challengesofdataintensiveness– amountofdata– complexityofdata– speedinwhichdataischanging

DataStorageAnalysis

• Harddrivefrom1990– store1,370MB– speed4.4MB/s

• Harddrive2010s– store1TB– speed100MB/s

Scalability

• Growswithoutrequiringdeveloperstore-architecttheiralgorithms/application

• Horizontalscaling• Verticalscaling

ParallelApproach

• Readingfrommultipledisksinparallel– 100driveshaving1/100ofthedata=>1/100readingtime

• Problem:Hardwarefailures– replication

• Problem:Mostanalysistasksneedtobeabletocombinedatainsomeway–MapReduce

• Hadoop

2:ApacheHadoop

• Hadoopisaframeworksoftools– librariesandmethodologies

• Operatesonlargeunstructureddatasets• Opensource(ApacheLicense)• Simpleprogrammingmodel• Scalable

Hadoop

• Ascalablefault-tolerantdistributedsystemfordatastorageandprocessing(opensourceundertheApachelicense)

• CoreHadoophastwomainsystems:– HadoopDistributedFileSystem:self-healinghigh-bandwidthclusteredstorage

–MapReduce:distributedfault-tolerantresourcemanagementandschedulingcoupledwithascalabledataprogrammingabstraction

Hadoop

• Administrators– Installation–Monitor/ManageSystems– TuneSystems

• EndUsers– DesignMapReduce Applications– Importandexportdata–WorkwithvariousHadoopTools

Hadoop

• DevelopedbyDougCuttingandMichaelJ.Cafarella

• BasedonGoogleMapReduce technology• Designedtohandlelargeamountsofdataandberobust

• DonatedtoApacheFoundationin2006byYahoo

HadoopDesignPrinciples

• Movingcomputationischeaperthanmovingdata• Hardwarewillfail• Hideexecutiondetailsfromtheuser• Usestreamingdataaccess• Usesimplefilesystemcoherencymodel

• HadoopisnotareplacementforSQL,alwaysfastandefficientquickad-hocquerying

HadoopMapReduce

• MapReduce(MR)istheoriginalprogrammingmodelforHadoop

• Collocatedatawithcomputenode– dataaccessisfastsinceitslocal(datalocality)

• Networkbandwidthisthemostpreciousresourceinthedatacenter–MRimplementationsexplicitmodelthenetworktopology

HadoopMapReduce

• MRoperatesatahighlevelofabstraction– programmerthinksintermsoffunctionsofkeyandvaluepairs

• MRisa shared-nothingarchitecture– tasksdonotdependoneachother– failedtaskscanberescheduledbythesystem

• MRwasintroducedbyGoogle– usedforproducingsearchindexes– applicabletomanyotherproblemstoo

HadoopComponents

• HadoopCommon– AsetofcomponentsandinterfacesfordistributedfilesystemsandgeneralI/O

• HadoopDistributedFilesystem(HDFS)• HadoopYARN– aresource-managementplatform,scheduling

• HadoopMapReduce– Distributedprogrammingmodelandexecutionenvironment

HadoopStackTransition

HadoopEcosystem• HBase – ascalabledatawarehousewithsupportforlargetables

• Hive – adatawarehouseinfrastructurethatprovidesdatasummarizationandadhocquerying

• Pig – ahigh-leveldata-flowlanguageandexecutionframeworkforparallelcomputation

• Spark – afastandgeneralcomputeengineforHadoopdata.Widerangeofapplications– ETL,MachineLearning,streamprocessing,andgraphanalytics

Flexibility:ComplexDataProcessing

1. JavaMapReduce:Mostflexibilityandperformance,buttediousdevelopmentcycle(theassemblylanguageofHadoop).

2. StreamingMapReduce (akaPipes):Allowsyoutodevelopinanyprogramminglanguageofyourchoice,butslightlylowerperformanceandlessflexibilitythannativeJavaMapReduce.

3. Crunch:Alibraryformulti-stageMapReduce pipelinesinJava(modeledAfterGoogle’sFlumeJava)

4. PigLatin:Ahigh-levellanguageoutofYahoo,suitableforbatchdataflowworkloads.

5. Hive:ASQLinterpreteroutofFacebook,alsoincludesameta-storemappingfilestotheirschemasandassociatedSerDes.

6. Oozie:Aworkflowenginethatenablescreatingaworkflowofjobscomposedofanyoftheabove.

3:HadoopDistributedFileSystem

• HadoopcomeswithdistributedfilesystemcalledHDFS (HadoopDistributedFileSystem)

• BasedonGoogle’sGFS (GoogleFileSystem)• HDFSprovidesredundantstorageformassiveamountsofdata– usingcommodityhardware

• DatainHDFSisdistributedacrossalldatanodes– EfficientforMapReduceprocessing

HDFSDesign

• Filesystemoncommodityhardware– Survivesevenwithhighfailureratesofthecomponents

• Supportslotsoflargefiles– FilesizehundredsGBorseveralTB

• Maindesignprinciples– Writeonce,readmanytimes– Ratherstreamingreads,thanfrequentrandomaccess– Highthroughputismoreimportantthanlowlatency

HDFSArchitecture• HDFSoperatesontopofexistingfilesystem• Filesarestoredasblocks(defaultsize128MB,differentfromfilesystemblocks)

• Filereliabilityisbasedonblock-basedreplication– EachblockofafileistypicallyreplicatedacrossseveralDataNodes (defaultreplicationis3)

• NameNode storesmetadata,managesreplicationandprovidesaccesstofiles

• Nodatacaching(becauseoflargedatasets),butdirectreading/streamingfromDataNode toclient

HDFSArchitecture

• NameNode storesHDFSmetadata– filenames,locationsofblocks,fileattributes–MetadataiskeptinRAMforfastlookups

• ThenumberoffilesinHDFSislimitedbytheamountofavailableRAMintheNameNode– HDFSNameNode federationcanhelpinRAMissues:severalNameNodes,eachofwhichmanagesaportionofthefilesystemnamespace

HDFSArchitecture

• DataNode storesfilecontentsasblocks– DifferentblocksofthesamefilearestoredondifferentDataNodes

– SameblockistypicallyreplicatedacrossseveralDataNodes forredundancy

– PeriodicallysendsreportofallexistingblockstotheNameNode

– DataNodes exchangeheartbeatswiththeNameNode

HDFSArchitecture

• Built-inprotectionagainstDataNode failure• IfNameNode doesnotreceiveanyheartbeatfromaDataNode withincertaintimeperiod,DataNode isassumedtobelost

• IncaseoffailingDataNode,blockreplicationisactivelymaintained– NameNode determineswhichblockswereonthelostDataNode

– TheNameNode findsothercopiesoftheselostblocksandreplicatesthemtoothernodes

HDFS

• HDFSFederation–MultipleNamenode servers–Multiplenamespaces

• HighAvailability– redundantNameNodes• HeterogeneousStorageandArchivalStorageARCHIVE,DISK,SSD,RAM_DISK

High-Availability(HA)Issues:NameNode Failure

• NameNode failurecorrespondstolosingallfilesonafilesystem

% sudo rm --dont-do-this /• Forrecovery,Hadoopprovidestwooptions– Backupfilesthatmakeupthepersistentstateofthefilesystem

– SecondaryNameNode• Alsosomemoreadvancedtechniquesexist

HAIssues:thesecondaryNameNode

• ThesecondaryNameNode isnotmirroredNameNode• Requiredmemory-intensiveadministrativefunctions– NameNode keepsmetadatainmemoryandwriteschangestoaneditlog

– ThesecondaryNameNode periodicallycombinespreviousnamespaceimageandtheeditlogintoanewnamespaceimage,preventingthelogtobecometoolarge

• Keepsacopyofthemergednamespaceimage,whichcanbeusedintheeventoftheNameNode failure

NetworkTopology

• HDFSisawarehowclosetwonodesareinthenetwork

• Fromclosertofurther0:Processesinthesamenode2:Differentnodesinthesamerack4:Nodesindifferentracksinthesamedatacenter6:Nodesindifferentdatacenters

NetworkTopology

FileBlockPlacement

• Clientsalwaysreadfromtheclosestnode• Defaultplacementstrategy– Onereplicainthesamelocalnodeasclient– Secondreplicainadifferentrack– Thirdreplicaindifferent,randomlyselected,nodeinthesamerackasthesecondreplica

• Additional(3+)replicasarerandom

Balancing

• Hadoopworksbestwhenblocksareevenlyspreadout

• SupportforDataNodes ofdifferentsize– InoptimalcasethediskusagepercentageinallDataNodes approximatelythesamelevel

• Hadoopprovidesbalancerdaemon– Re-distributesblocks– ShouldberunwhennewDataNodes areadded

RunningHadoop

• Threeconfigurations– standalone– pseudo-distributed– fully-distributed– https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html

ConfiguringHDFS

• VariableHADOOP_CONF_DIRdefinesthedirectoryfortheHadoopconfigurationfiles

• core-site.xml<configuration>

<property><name>fs.defaultFS</name><value>hdfs://localhost:9001</value>

</property></configuration>

• hdfs-site.xml<configuration>

<property><name>dfs.replication</name><value>1</value>

</property>

<property><name>dfs.namenode.name.dir</name><value>file:///home/NN/hadoop/namenode</value>

</property>

<property><name>dfs.datanode.name.dir</name> <value>file:///home/NN/hadoop/datanode</value>

</property></configuration>

AccessingData

• Datacanbeaccessedusingvariousmethods– JavaAPI– CAPI– Commandline/POSIX(FUSEmount)– Commandline/HDFSclient:Demo– HTTP– Varioustools

HDFSURI

• AllHDFS(CLI)commandstakepathURIsasarguments

• URIexample– hdfs://localhost:9000/user/hduser/log-data/file1.log

• Theschemeandauthorityareoptional– /user/hduser/log-data/file1.log

• Homedirectory– log-data/file1.log

RDBMSvsHDFS• Schema-on-Write

(RDBMS)– Schemamustbecreated

beforeanydatacanbeloaded

– AnexplicitloadoperationwhichtransformsdatatoDBinternalstructure

– NewcolumnsmustbeaddedexplicitlybeforenewdataforsuchcolumnscanbeloadedintotheDB

• Schema-on-Read(HDFS)– Dataissimplycopiedto

thefilestore,notransformationisneeded

– ASerDe (Serializer/Deserlizer)isappliedduringreadtimetoextracttherequiredcolumns(latebinding)

– NewdatacanstartflowinganytimeandwillappearretroactivelyoncetheSerDe isupdatedtoparseit

Conclusions

• Pros– Supportforverylargefiles– Designedforstreamingdata– Commodityhardware

• Cons– Notdesignedforlow-latencydataaccess– Architecturedoesnotsupportlotsofsmallfiles– Nosupportformultiplewriters/arbitraryfilemodifications(Writesalwaysattheendofthefile)

Readingdata

Flume

4:DataModeling

• HDFSisaSchema-on-readsystem– allowsstoringallofyourrawdata

• Stillfollowingmustbeconsidered– Datastorageformats–Multitenancy– Schemadesign–Metadatamanagement

DataStorageOptions

• Nostandarddatastorageformat– Hadoopallowsstoringofdatainanyformat

• Majorconsiderationsfordatastorageinclude– Fileformat(e.g.plaintext,SequenceFile ormorecomplexbutmorefunctionallyrichoptions,suchasAvroandParquet)

– Compression(splittability)– Datastoragesystem(HDFS,HBase,Hive,Impala)

FileFormats:TextFile

• Commonusecase:weblogsandserverlogs– comesinmanyformats

• Organizationofthefilesinthefilesystem• Textfilesconsumespace->compression• Overheadforconversion(‘123’->123)• Structuredtextdata– XMLandJSONpresentchallengestoHADOOP

• hardtosplit– Dedicatedlibrariesexist

FileFormats:BinaryData

• Hadoopcanbeusedtoprocessbinaryfiles– e.g.images

• Containerformatispreferred– e.g.SequenceFile

• Ifthesplittable unitofbinarydataislargerthan64MB,youmayconsiderputtingthedatainitsownfile,withoutusingacontainerformat

HadoopFileTypes• Hadoop-specificfileformatsarespecificallycreatedtoworkwellwithMapReduce– file-baseddatastructuressuchassequencefiles,– serializationformatslikeAvro,and– columnarformatssuchasRCFile andParquet

• Splittable compression– Theseformatssupportcommoncompressionformatsandarealsosplittable

• Agnosticcompression– codecisstoredintheheadermetadataofthefileformat->thefilecanbecompressedwithanycompressioncodec,withoutreadershavingtoknowthecodec

File-BasedDataStructures

• SequenceFile formatisoneofthemostcommonlyusedfile-basedformatsinHadoop– otherformats:MapFiles,SetFiles,ArrayFiles,BloomMapFiles,…

– storesdataasbinarykey-valuepairs– threeformatsavailableforrecords:uncompressed,record-compressed,block-compressed

SequenceFile

• Headermetadata– compressioncodec,keyandvalueclassnames,user-definedmetadata,randomlygeneratedsyncmarker

• Oftenusedacontainerforsmallerfiles

Compression

• AlsoforspeedingMapReduce– Notonlyforreducingstoragerequirements

• Compressionmustbesplittable–MapReduceframeworksplitsdataforinputtomultipletasks

HDFSSchemaDesign• Hadoopisoftenadatahubfortheentireorganization– dataissharedbymanydepartmentsandteams

• Carefullystructuredandorganizedrepositoryhasseveralbenefits– standarddirectorystructuremakesiteasiertosharedatabetweenteams

– allowsforenforcingaccessrightsandquota– conventionsregardinge.g.stagingdata leadlesserrors– codereuse– Hadooptoolsmakeassumptionsofthedataplacement

RecommendedLocationsofFiles

• /user/<username>– data,JARs,andconfig filesofaspecificuser

• /etl– datainallphasesofanETLworkflow– /etl/<group>/<application>/<process>/{input,processing,output,bad}

• /tmp– temporarydata

RecommendedLocationsofFiles

• /data– datasetssharedacrossorganization– dataiswrittenbyautomatedETLprocesses– read-onlyforusers– subdirectoriesforeachdataset

• /app– JARs,Oozie workflowdefinitions,HiveHQLfiles,…– /app/<group>/<application>/<version>/<artifactdirectory>/<artifact>

RecommendedLocationsofFiles

• /metadata– themetadatarequiredbysometools

Partitioning

• HDFShasnoindexes– pro:fasttoingestdata– con:mightleadtofulltablescan(FTC),evenwhenonlyaportionofdataisneeded

• Solution:breakdatasetintosmallersubsets(partitions)– aHDFSsubdirectoryforeachpartition– allowsqueriestoreadonlythespecificpartitions

Partitioning:Example

• Assumedatasetsforallordersforvariouspharmacies

• Withoutpartitioningcheckingorderhistoryforjustonephysicianoverthepastthreemonthsleadstofulltablescan

• medication_orders/date=20160824/{order1.csv,order2.csv}

– only90directoriesmustbescanned

5:DataMovement

• Filesystemclientforsimpleusage• CommondatasourcesforHadoopinclude– traditionaldatamanagementsystemssuchasrelationaldatabasesandmainframes

– logs,machine-generateddata,andotherformsofeventdata

– filesbeingimportedfromexistingenterprisedatastoragesystems

DataMovement:Considerations

• Timelinessofdataingestionandaccessibility–Whataretherequirementsaroundhowoftendataneedstobeingested?Howsoondoesdataneedtobeavailabletodownstreamprocessing?

• Incrementalupdates– Howwillnewdatabeadded?Doesitneedtobeappendedtoexistingdata?Oroverwriteexistingdata?

DataMovement:Considerations

• Dataaccessandprocessing–Willthedatabeusedinprocessing?Ifso,willitbeusedinbatchprocessingjobs?Orisrandomaccesstothedatarequired?

• Sourcesystemanddatastructure–Whereisthedatacomingfrom?Arelationaldatabase?Logs?Isitstructured,semistructured,orunstructureddata?

DataMovement:Considerations

• Partitioningandsplittingofdata– Howshoulddatabepartitionedafteringest?Doesthedataneedtobeingestedintomultipletargetsystems(e.g.,HDFSandHBase)?

• Storageformat–Whatformatwillthedatabestoredin?

• Datatransformation– Doesthedataneedtobetransformedinflight?

TimelinessofDataIngestion• Timelagfromwhendataisavailableforingestiontowhenit’s

accessibleinHadoop• Classificationsingestionrequirements:• Macrobatch

– anythingover15minutestohours,orevenadailyjob.• Microbatch

– firedoffevery2minutesorso,butnomorethan15minutesintotal.• Near-Real-TimeDecisionSupport

– “immediatelyactionable”bytherecipientoftheinformation– deliveredinlessthan2minutesbutgreaterthan2seconds.

• Near-Real-TimeEventProcessing– under2seconds,andcanbeasfastasa100-millisecondrange.

• RealTime– anythingunder100milliseconds.

IncrementalUpdates• Dataiseitherappendedtoanexistingdatasetoritismodified– HDFSworksfineforappendonlyimplementations.

• ThedownsidetoHDFSistheinabilitytodoappendsorrandomwritestofilesafterthey’recreated

• HDFSisoptimizedforlargefiles– Iftherequirementscallforatwo-minuteappendprocessthatendsupproducinglotsofsmallfiles,thenaperiodicprocesstocombinesmallerfileswillberequiredtogetthebenefitsfromlargerfiles

OriginalSourceSystemandDataStructure

• Originalfiletype– anyformat:delimited,XML,JSON,Avro,fixedlength,variablelength,copybooks,…

• Hadoopcanacceptanyfileformat– notallformatsareoptimalforparticularusecases– notallfileformatscanworkwithalltoolsintheHadoopecosystem,example:variable-lengthfiles

Compression

• Pro– transferringacompressedfileoverthenetworkrequireslessI/Oandnetworkbandwidth

• Con– mostcompressioncodecsappliedoutsideofHadooparenotsplittable (e.g.,Gzip)

Misc• RDBMS

– Tool:Sqoop• StreamingData

– Twitterfeeds,aJavaMessageService(JMS)queue,eventsfiringfromawebapplicationserver

– Tools:FlumeorKafka• Logfiles

– ananti-patternistoreadthelogfiles fromdiskastheyarewrittenbecausethisisalmostimpossibletoimplementwithoutlosingdata

– Thecorrectwayofingestinglogfiles istostreamthelogsdirectlytoatoollikeFlumeorKafka,whichwillwritedirectlytoHadoopinstead

Transformations

• modificationsonincomingdata,distributingthedataintopartitionsorbuckets,sendingthedatatomorethanonestoreorlocation– Transformation:XMLorJSONisconvertedtodelimiteddata

– Partitioning:incomingdataisstocktradedataandpartitioningbytickerisrequired

– Splitting:ThedataneedstolandinHDFSandHBase fordifferentaccesspatterns.

DataIngestionOptions

• filetransfers• ToolslikeFlume,Sqoop,andKafka

top related