tie-22306 data-intensive programming - tunidip/slides/slides1.1.pdf · 2016-09-01 · data storage...

75
TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department of Pervasive Computing

Upload: others

Post on 20-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

TIE-22306Data-intensiveProgramming

Dr.TimoAaltonenDepartmentofPervasiveComputing

Page 2: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Data-IntensiveProgramming

• Lecturer:TimoAaltonen– [email protected]

• Assistants– AdnanMushtaq–MScAnttiLuoto–MScAnttiKallonen

Page 3: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Lecturer

• UniversityLecturer• DoctoraldegreeinSoftwareEngineering,TUT,SoftwareEngineering,2005

• Workhistory– Variouspositions,TUT,1995– 2010– PrincipalResearcher,SystemSoftwareEngineering,NokiaResearchCenter,2010- 2012

– Universitylecturer,TUT

Page 4: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Workingatthecourse

• LecturesonFridays• Weeklyexercises– beginningfromtheweek#2

• Coursework– announcednextFriday

• Communication– http://www.cs.tut.fi/~dip/

• Exam

Page 5: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

WeeklyExercises

• LinuxclassTC217• Inthebeginningofthecoursehands-ontraining

• Intheendofthecoursereceptionforproblemswiththecoursework

• Enrolmentisopen• Notcompulsory,nocreditpoints• Twomoreinstanceswillbeadded

Page 6: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

CourseWork

• UsingHadooptoolsandframeworktosolvetypicalBigDataproblem(inJava)

• Groupsofthree• Hardware– Yourownlaptopwithself-installedHadoop– YourownlaptopwithVirtualBox 5.1andUbuntuVM– ATUTvirtualmachine

Page 7: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Exam

• Electronicexamafterthecourse• Testsratherunderstandingthanexactsyntax• ”UsepseudocodetowriteaMapReduceprogramwhich…”

• GeneralquestionsonHadoopandrelatedtechnologies

Page 8: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Today

• Bigdata• DataScience• Hadoop• HDFS• ApacheFlume

Page 9: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

1:BigData

• Worldisdrowningindata– clickstreamdataiscollectedbywebservers– NYSEgenerates1TBtradedataeveryday–MTCcollects5000attributesforeachcall– Smartmarketerscollectpurchasinghabits

• “Moredatausuallybeatsbetteralgorithms”

Page 10: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

ThreeVsofBigData

• Volume:amountofdata– Transactiondatastoredthroughtheyears,unstructureddatastreaminginfromsocialmedia,increasingamountsofsensorandmachine-to-machinedata

• Velocity:speedofdatainandout– streamingdatafromRFID,sensors,…

• Variety:rangeofdatatypesandsources– structured,unstructured

Page 11: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

BigData

• Variability– Dataflowscanbehighlyinconsistentwithperiodicpeaks

• Complexity– Datacomesfrommultiplesources.– linking,matching,cleansingandtransformingdataacrosssystemsisacomplextask

Page 12: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

DataScience

• Definition:Datascienceisanactivitytoextractsinsightsfrommessydata

• Facebookanalyzeslocationdata– toidentifyglobalmigrationpatterns– tofindoutthefanbases todifferentsportteams

• Aretailermighttrackpurchasesbothonlineandin-storetotargetedmarketing

Page 13: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

DataScience

Page 14: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

NewChallenges

• Compute-intensiveness– rawcomputingpower

• Challengesofdataintensiveness– amountofdata– complexityofdata– speedinwhichdataischanging

Page 15: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

DataStorageAnalysis

• Harddrivefrom1990– store1,370MB– speed4.4MB/s

• Harddrive2010s– store1TB– speed100MB/s

Page 16: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Scalability

• Growswithoutrequiringdeveloperstore-architecttheiralgorithms/application

• Horizontalscaling• Verticalscaling

Page 17: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

ParallelApproach

• Readingfrommultipledisksinparallel– 100driveshaving1/100ofthedata=>1/100readingtime

• Problem:Hardwarefailures– replication

• Problem:Mostanalysistasksneedtobeabletocombinedatainsomeway–MapReduce

• Hadoop

Page 18: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

2:ApacheHadoop

• Hadoopisaframeworksoftools– librariesandmethodologies

• Operatesonlargeunstructureddatasets• Opensource(ApacheLicense)• Simpleprogrammingmodel• Scalable

Page 19: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Hadoop

• Ascalablefault-tolerantdistributedsystemfordatastorageandprocessing(opensourceundertheApachelicense)

• CoreHadoophastwomainsystems:– HadoopDistributedFileSystem:self-healinghigh-bandwidthclusteredstorage

–MapReduce:distributedfault-tolerantresourcemanagementandschedulingcoupledwithascalabledataprogrammingabstraction

Page 20: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Hadoop

• Administrators– Installation–Monitor/ManageSystems– TuneSystems

• EndUsers– DesignMapReduce Applications– Importandexportdata–WorkwithvariousHadoopTools

Page 21: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Hadoop

• DevelopedbyDougCuttingandMichaelJ.Cafarella

• BasedonGoogleMapReduce technology• Designedtohandlelargeamountsofdataandberobust

• DonatedtoApacheFoundationin2006byYahoo

Page 22: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HadoopDesignPrinciples

• Movingcomputationischeaperthanmovingdata• Hardwarewillfail• Hideexecutiondetailsfromtheuser• Usestreamingdataaccess• Usesimplefilesystemcoherencymodel

• HadoopisnotareplacementforSQL,alwaysfastandefficientquickad-hocquerying

Page 23: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HadoopMapReduce

• MapReduce(MR)istheoriginalprogrammingmodelforHadoop

• Collocatedatawithcomputenode– dataaccessisfastsinceitslocal(datalocality)

• Networkbandwidthisthemostpreciousresourceinthedatacenter–MRimplementationsexplicitmodelthenetworktopology

Page 24: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HadoopMapReduce

• MRoperatesatahighlevelofabstraction– programmerthinksintermsoffunctionsofkeyandvaluepairs

• MRisa shared-nothingarchitecture– tasksdonotdependoneachother– failedtaskscanberescheduledbythesystem

• MRwasintroducedbyGoogle– usedforproducingsearchindexes– applicabletomanyotherproblemstoo

Page 25: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HadoopComponents

• HadoopCommon– AsetofcomponentsandinterfacesfordistributedfilesystemsandgeneralI/O

• HadoopDistributedFilesystem(HDFS)• HadoopYARN– aresource-managementplatform,scheduling

• HadoopMapReduce– Distributedprogrammingmodelandexecutionenvironment

Page 26: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HadoopStackTransition

Page 27: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HadoopEcosystem• HBase – ascalabledatawarehousewithsupportforlargetables

• Hive – adatawarehouseinfrastructurethatprovidesdatasummarizationandadhocquerying

• Pig – ahigh-leveldata-flowlanguageandexecutionframeworkforparallelcomputation

• Spark – afastandgeneralcomputeengineforHadoopdata.Widerangeofapplications– ETL,MachineLearning,streamprocessing,andgraphanalytics

Page 28: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Flexibility:ComplexDataProcessing

1. JavaMapReduce:Mostflexibilityandperformance,buttediousdevelopmentcycle(theassemblylanguageofHadoop).

2. StreamingMapReduce (akaPipes):Allowsyoutodevelopinanyprogramminglanguageofyourchoice,butslightlylowerperformanceandlessflexibilitythannativeJavaMapReduce.

3. Crunch:Alibraryformulti-stageMapReduce pipelinesinJava(modeledAfterGoogle’sFlumeJava)

4. PigLatin:Ahigh-levellanguageoutofYahoo,suitableforbatchdataflowworkloads.

5. Hive:ASQLinterpreteroutofFacebook,alsoincludesameta-storemappingfilestotheirschemasandassociatedSerDes.

6. Oozie:Aworkflowenginethatenablescreatingaworkflowofjobscomposedofanyoftheabove.

Page 29: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

3:HadoopDistributedFileSystem

• HadoopcomeswithdistributedfilesystemcalledHDFS (HadoopDistributedFileSystem)

• BasedonGoogle’sGFS (GoogleFileSystem)• HDFSprovidesredundantstorageformassiveamountsofdata– usingcommodityhardware

• DatainHDFSisdistributedacrossalldatanodes– EfficientforMapReduceprocessing

Page 30: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HDFSDesign

• Filesystemoncommodityhardware– Survivesevenwithhighfailureratesofthecomponents

• Supportslotsoflargefiles– FilesizehundredsGBorseveralTB

• Maindesignprinciples– Writeonce,readmanytimes– Ratherstreamingreads,thanfrequentrandomaccess– Highthroughputismoreimportantthanlowlatency

Page 31: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HDFSArchitecture• HDFSoperatesontopofexistingfilesystem• Filesarestoredasblocks(defaultsize128MB,differentfromfilesystemblocks)

• Filereliabilityisbasedonblock-basedreplication– EachblockofafileistypicallyreplicatedacrossseveralDataNodes (defaultreplicationis3)

• NameNode storesmetadata,managesreplicationandprovidesaccesstofiles

• Nodatacaching(becauseoflargedatasets),butdirectreading/streamingfromDataNode toclient

Page 32: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HDFSArchitecture

• NameNode storesHDFSmetadata– filenames,locationsofblocks,fileattributes–MetadataiskeptinRAMforfastlookups

• ThenumberoffilesinHDFSislimitedbytheamountofavailableRAMintheNameNode– HDFSNameNode federationcanhelpinRAMissues:severalNameNodes,eachofwhichmanagesaportionofthefilesystemnamespace

Page 33: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HDFSArchitecture

• DataNode storesfilecontentsasblocks– DifferentblocksofthesamefilearestoredondifferentDataNodes

– SameblockistypicallyreplicatedacrossseveralDataNodes forredundancy

– PeriodicallysendsreportofallexistingblockstotheNameNode

– DataNodes exchangeheartbeatswiththeNameNode

Page 34: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HDFSArchitecture

• Built-inprotectionagainstDataNode failure• IfNameNode doesnotreceiveanyheartbeatfromaDataNode withincertaintimeperiod,DataNode isassumedtobelost

• IncaseoffailingDataNode,blockreplicationisactivelymaintained– NameNode determineswhichblockswereonthelostDataNode

– TheNameNode findsothercopiesoftheselostblocksandreplicatesthemtoothernodes

Page 35: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HDFS

• HDFSFederation–MultipleNamenode servers–Multiplenamespaces

• HighAvailability– redundantNameNodes• HeterogeneousStorageandArchivalStorageARCHIVE,DISK,SSD,RAM_DISK

Page 36: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

High-Availability(HA)Issues:NameNode Failure

• NameNode failurecorrespondstolosingallfilesonafilesystem

% sudo rm --dont-do-this /• Forrecovery,Hadoopprovidestwooptions– Backupfilesthatmakeupthepersistentstateofthefilesystem

– SecondaryNameNode• Alsosomemoreadvancedtechniquesexist

Page 37: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HAIssues:thesecondaryNameNode

• ThesecondaryNameNode isnotmirroredNameNode• Requiredmemory-intensiveadministrativefunctions– NameNode keepsmetadatainmemoryandwriteschangestoaneditlog

– ThesecondaryNameNode periodicallycombinespreviousnamespaceimageandtheeditlogintoanewnamespaceimage,preventingthelogtobecometoolarge

• Keepsacopyofthemergednamespaceimage,whichcanbeusedintheeventoftheNameNode failure

Page 38: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

NetworkTopology

• HDFSisawarehowclosetwonodesareinthenetwork

• Fromclosertofurther0:Processesinthesamenode2:Differentnodesinthesamerack4:Nodesindifferentracksinthesamedatacenter6:Nodesindifferentdatacenters

Page 39: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

NetworkTopology

Page 40: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

FileBlockPlacement

• Clientsalwaysreadfromtheclosestnode• Defaultplacementstrategy– Onereplicainthesamelocalnodeasclient– Secondreplicainadifferentrack– Thirdreplicaindifferent,randomlyselected,nodeinthesamerackasthesecondreplica

• Additional(3+)replicasarerandom

Page 41: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Balancing

• Hadoopworksbestwhenblocksareevenlyspreadout

• SupportforDataNodes ofdifferentsize– InoptimalcasethediskusagepercentageinallDataNodes approximatelythesamelevel

• Hadoopprovidesbalancerdaemon– Re-distributesblocks– ShouldberunwhennewDataNodes areadded

Page 42: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

RunningHadoop

• Threeconfigurations– standalone– pseudo-distributed– fully-distributed– https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html

Page 43: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

ConfiguringHDFS

• VariableHADOOP_CONF_DIRdefinesthedirectoryfortheHadoopconfigurationfiles

• core-site.xml<configuration>

<property><name>fs.defaultFS</name><value>hdfs://localhost:9001</value>

</property></configuration>

Page 44: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

• hdfs-site.xml<configuration>

<property><name>dfs.replication</name><value>1</value>

</property>

<property><name>dfs.namenode.name.dir</name><value>file:///home/NN/hadoop/namenode</value>

</property>

<property><name>dfs.datanode.name.dir</name> <value>file:///home/NN/hadoop/datanode</value>

</property></configuration>

Page 45: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

AccessingData

• Datacanbeaccessedusingvariousmethods– JavaAPI– CAPI– Commandline/POSIX(FUSEmount)– Commandline/HDFSclient:Demo– HTTP– Varioustools

Page 46: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HDFSURI

• AllHDFS(CLI)commandstakepathURIsasarguments

• URIexample– hdfs://localhost:9000/user/hduser/log-data/file1.log

• Theschemeandauthorityareoptional– /user/hduser/log-data/file1.log

• Homedirectory– log-data/file1.log

Page 47: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

RDBMSvsHDFS• Schema-on-Write

(RDBMS)– Schemamustbecreated

beforeanydatacanbeloaded

– AnexplicitloadoperationwhichtransformsdatatoDBinternalstructure

– NewcolumnsmustbeaddedexplicitlybeforenewdataforsuchcolumnscanbeloadedintotheDB

• Schema-on-Read(HDFS)– Dataissimplycopiedto

thefilestore,notransformationisneeded

– ASerDe (Serializer/Deserlizer)isappliedduringreadtimetoextracttherequiredcolumns(latebinding)

– NewdatacanstartflowinganytimeandwillappearretroactivelyoncetheSerDe isupdatedtoparseit

Page 48: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Conclusions

• Pros– Supportforverylargefiles– Designedforstreamingdata– Commodityhardware

• Cons– Notdesignedforlow-latencydataaccess– Architecturedoesnotsupportlotsofsmallfiles– Nosupportformultiplewriters/arbitraryfilemodifications(Writesalwaysattheendofthefile)

Page 49: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Readingdata

Page 50: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Flume

Page 51: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

4:DataModeling

• HDFSisaSchema-on-readsystem– allowsstoringallofyourrawdata

• Stillfollowingmustbeconsidered– Datastorageformats–Multitenancy– Schemadesign–Metadatamanagement

Page 52: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

DataStorageOptions

• Nostandarddatastorageformat– Hadoopallowsstoringofdatainanyformat

• Majorconsiderationsfordatastorageinclude– Fileformat(e.g.plaintext,SequenceFile ormorecomplexbutmorefunctionallyrichoptions,suchasAvroandParquet)

– Compression(splittability)– Datastoragesystem(HDFS,HBase,Hive,Impala)

Page 53: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

FileFormats:TextFile

• Commonusecase:weblogsandserverlogs– comesinmanyformats

• Organizationofthefilesinthefilesystem• Textfilesconsumespace->compression• Overheadforconversion(‘123’->123)• Structuredtextdata– XMLandJSONpresentchallengestoHADOOP

• hardtosplit– Dedicatedlibrariesexist

Page 54: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

FileFormats:BinaryData

• Hadoopcanbeusedtoprocessbinaryfiles– e.g.images

• Containerformatispreferred– e.g.SequenceFile

• Ifthesplittable unitofbinarydataislargerthan64MB,youmayconsiderputtingthedatainitsownfile,withoutusingacontainerformat

Page 55: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HadoopFileTypes• Hadoop-specificfileformatsarespecificallycreatedtoworkwellwithMapReduce– file-baseddatastructuressuchassequencefiles,– serializationformatslikeAvro,and– columnarformatssuchasRCFile andParquet

• Splittable compression– Theseformatssupportcommoncompressionformatsandarealsosplittable

• Agnosticcompression– codecisstoredintheheadermetadataofthefileformat->thefilecanbecompressedwithanycompressioncodec,withoutreadershavingtoknowthecodec

Page 56: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

File-BasedDataStructures

• SequenceFile formatisoneofthemostcommonlyusedfile-basedformatsinHadoop– otherformats:MapFiles,SetFiles,ArrayFiles,BloomMapFiles,…

– storesdataasbinarykey-valuepairs– threeformatsavailableforrecords:uncompressed,record-compressed,block-compressed

Page 57: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

SequenceFile

• Headermetadata– compressioncodec,keyandvalueclassnames,user-definedmetadata,randomlygeneratedsyncmarker

• Oftenusedacontainerforsmallerfiles

Page 58: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Compression

• AlsoforspeedingMapReduce– Notonlyforreducingstoragerequirements

• Compressionmustbesplittable–MapReduceframeworksplitsdataforinputtomultipletasks

Page 59: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

HDFSSchemaDesign• Hadoopisoftenadatahubfortheentireorganization– dataissharedbymanydepartmentsandteams

• Carefullystructuredandorganizedrepositoryhasseveralbenefits– standarddirectorystructuremakesiteasiertosharedatabetweenteams

– allowsforenforcingaccessrightsandquota– conventionsregardinge.g.stagingdata leadlesserrors– codereuse– Hadooptoolsmakeassumptionsofthedataplacement

Page 60: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

RecommendedLocationsofFiles

• /user/<username>– data,JARs,andconfig filesofaspecificuser

• /etl– datainallphasesofanETLworkflow– /etl/<group>/<application>/<process>/{input,processing,output,bad}

• /tmp– temporarydata

Page 61: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

RecommendedLocationsofFiles

• /data– datasetssharedacrossorganization– dataiswrittenbyautomatedETLprocesses– read-onlyforusers– subdirectoriesforeachdataset

• /app– JARs,Oozie workflowdefinitions,HiveHQLfiles,…– /app/<group>/<application>/<version>/<artifactdirectory>/<artifact>

Page 62: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

RecommendedLocationsofFiles

• /metadata– themetadatarequiredbysometools

Page 63: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Partitioning

• HDFShasnoindexes– pro:fasttoingestdata– con:mightleadtofulltablescan(FTC),evenwhenonlyaportionofdataisneeded

• Solution:breakdatasetintosmallersubsets(partitions)– aHDFSsubdirectoryforeachpartition– allowsqueriestoreadonlythespecificpartitions

Page 64: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Partitioning:Example

• Assumedatasetsforallordersforvariouspharmacies

• Withoutpartitioningcheckingorderhistoryforjustonephysicianoverthepastthreemonthsleadstofulltablescan

• medication_orders/date=20160824/{order1.csv,order2.csv}

– only90directoriesmustbescanned

Page 65: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

5:DataMovement

• Filesystemclientforsimpleusage• CommondatasourcesforHadoopinclude– traditionaldatamanagementsystemssuchasrelationaldatabasesandmainframes

– logs,machine-generateddata,andotherformsofeventdata

– filesbeingimportedfromexistingenterprisedatastoragesystems

Page 66: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

DataMovement:Considerations

• Timelinessofdataingestionandaccessibility–Whataretherequirementsaroundhowoftendataneedstobeingested?Howsoondoesdataneedtobeavailabletodownstreamprocessing?

• Incrementalupdates– Howwillnewdatabeadded?Doesitneedtobeappendedtoexistingdata?Oroverwriteexistingdata?

Page 67: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

DataMovement:Considerations

• Dataaccessandprocessing–Willthedatabeusedinprocessing?Ifso,willitbeusedinbatchprocessingjobs?Orisrandomaccesstothedatarequired?

• Sourcesystemanddatastructure–Whereisthedatacomingfrom?Arelationaldatabase?Logs?Isitstructured,semistructured,orunstructureddata?

Page 68: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

DataMovement:Considerations

• Partitioningandsplittingofdata– Howshoulddatabepartitionedafteringest?Doesthedataneedtobeingestedintomultipletargetsystems(e.g.,HDFSandHBase)?

• Storageformat–Whatformatwillthedatabestoredin?

• Datatransformation– Doesthedataneedtobetransformedinflight?

Page 69: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

TimelinessofDataIngestion• Timelagfromwhendataisavailableforingestiontowhenit’s

accessibleinHadoop• Classificationsingestionrequirements:• Macrobatch

– anythingover15minutestohours,orevenadailyjob.• Microbatch

– firedoffevery2minutesorso,butnomorethan15minutesintotal.• Near-Real-TimeDecisionSupport

– “immediatelyactionable”bytherecipientoftheinformation– deliveredinlessthan2minutesbutgreaterthan2seconds.

• Near-Real-TimeEventProcessing– under2seconds,andcanbeasfastasa100-millisecondrange.

• RealTime– anythingunder100milliseconds.

Page 70: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

IncrementalUpdates• Dataiseitherappendedtoanexistingdatasetoritismodified– HDFSworksfineforappendonlyimplementations.

• ThedownsidetoHDFSistheinabilitytodoappendsorrandomwritestofilesafterthey’recreated

• HDFSisoptimizedforlargefiles– Iftherequirementscallforatwo-minuteappendprocessthatendsupproducinglotsofsmallfiles,thenaperiodicprocesstocombinesmallerfileswillberequiredtogetthebenefitsfromlargerfiles

Page 71: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

OriginalSourceSystemandDataStructure

• Originalfiletype– anyformat:delimited,XML,JSON,Avro,fixedlength,variablelength,copybooks,…

• Hadoopcanacceptanyfileformat– notallformatsareoptimalforparticularusecases– notallfileformatscanworkwithalltoolsintheHadoopecosystem,example:variable-lengthfiles

Page 72: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Compression

• Pro– transferringacompressedfileoverthenetworkrequireslessI/Oandnetworkbandwidth

• Con– mostcompressioncodecsappliedoutsideofHadooparenotsplittable (e.g.,Gzip)

Page 73: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Misc• RDBMS

– Tool:Sqoop• StreamingData

– Twitterfeeds,aJavaMessageService(JMS)queue,eventsfiringfromawebapplicationserver

– Tools:FlumeorKafka• Logfiles

– ananti-patternistoreadthelogfiles fromdiskastheyarewrittenbecausethisisalmostimpossibletoimplementwithoutlosingdata

– Thecorrectwayofingestinglogfiles istostreamthelogsdirectlytoatoollikeFlumeorKafka,whichwillwritedirectlytoHadoopinstead

Page 74: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

Transformations

• modificationsonincomingdata,distributingthedataintopartitionsorbuckets,sendingthedatatomorethanonestoreorlocation– Transformation:XMLorJSONisconvertedtodelimiteddata

– Partitioning:incomingdataisstocktradedataandpartitioningbytickerisrequired

– Splitting:ThedataneedstolandinHDFSandHBase fordifferentaccesspatterns.

Page 75: TIE-22306 Data-intensive Programming - TUNIdip/slides/slides1.1.pdf · 2016-09-01 · Data Storage Options • No standard data storage format – Hadoop allows storing of data in

DataIngestionOptions

• filetransfers• ToolslikeFlume,Sqoop,andKafka