introduction to apache apex - files.meetup.com · apache apex meetup windowing tuples divided into...
TRANSCRIPT
Apache Apex Meetup
Introduction to Apache ApexReal time streaming.. Really!!!
Chinmay [email protected]
February 13, 2016
Apache Apex Meetup
Agenda➔ Project History
➔ What is Apache Apex?
➔ Directed Acyclic Graph (DAG)
➔ Components of DAG
➔ Windowing
➔ Operator Lifecycle
➔ Apache Apex Architecture
➔ Other features
Apache Apex Meetup
Project History➔ Started development at DataTorrent in 2012
➔ Open-sourced under ASF in 2015
➔ Currently Have 50+ committers
➔ Free to use Streaming Application platform
Apache Apex Meetup
What is Apache Apex?➔ Apex project is under Apache Software Foundation
➔ Apex is a Streaming Application platform
➔ YARN-native application
➔ Complete implementation is done in Java
➔ Consist of 2 primary components
◆ Apex Core - Engine which facilitates Real time processing
◆ Apex Malhar - Out-of-the-box operators that can be used with Apex
Core
Apache Apex Meetup
➔ Defines compute stages
➔ Defined how tuple flow over compute stages over stream
Directed Acyclic Graph (DAG)
Filtered
Stream
Output StreamTuple Tuple
Filtered Stream
Enriched Stream
Enriched
Stream
er
Operator
er
Operator
er
Operator
er
Operator
Apache Apex Meetup
➔ Smallest atomic data that flows over a
stream
➔ Emitted by Operators after processing
➔ Received by next Operator for
processing
➔ Java objects which are serializable
➔ Types:
◆ Data Tuple
◆ Control Tuple
Components of DAG - Tuple
Apache Apex Meetup
➔ Logical compute unit
➔ Java code which processes a tuple
➔ Runs inside a JVM
➔ Types
◆ Input Adapter
◆ Generic Operator
◆ Output Adapter
Components of DAG - Operator
Apache Apex Meetup
➔ Connect operators
➔ Channel that carries the tuples from
one operator to another
Components of DAG - Stream
Apache Apex Meetup
➔ Ends of a stream
➔ Part of operator
➔ Types of ports
◆ Input Port
◆ Output Port
Components of DAG - Ports
Apache Apex Meetup
Windowing
➔ Tuples divided into time slices
➔ Windows are given ids (type:long)
➔ Also called as Streaming Window
● Default 500ms
Apache Apex Meetup
➔ Input Operator inserts control tuple
➔ Control tuple marks window boundary
➔ Different operator may be processing
different windows
➔ All management activities of data
happens at the boundary of window
Windowing (contd…)BeginWindowControl Tuple
EndWindowControl Tuple
Data Tuples
Window nWindow n+1 OutputAdapter
InputAdapter
GenericOperator
Apache Apex Meetup
➔ Called by Apex Platform
➔ Simple unit test like lifecycle
➔ Governed by control tuples
➔ All operators in DAG go through
this life-cycle
Operator Lifecycle
Apache Apex Meetup
➔ Setup
◆ Start of operator lifecycle
◆ Do any initialization here
➔ beginWindow
◆ Marks starting of window
➔ endWindow
◆ Marks end of window
➔ teardown
◆ Do any finalization here
◆ End of operator lifecycle
Operator Lifecycle (contd...)
Apache Apex Meetup
➔ emitTuples
◆ Called for Input Adapters
◆ Called in an infinite while
loop by platform
➔ process
◆ Called for Generic Operators
and Output Adapters
◆ Associated to to a port
◆ Called for every incoming
tuple
Operator Lifecycle (contd...)
Apache Apex Meetup
➔ OutputPort::emit
◆ Special method not part of
operator lifecycle
◆ To be called by operator
code
◆ Emits the tuples to next
operator
◆ Bound by Window
Operator Lifecycle (contd...)
Apache Apex Meetup
Apache Apex Architecture
Machine nodes (Physical or Virtual)
Hadoop (YARN) Distributed File System (e.g. HDFS)
Apache Apex Core (Streaming Engine)
Streaming Application Streaming ApplicationR
EST API
External Data
SourcesApache Apex Malhar
(Reusable Operators, Connectors)Custom Operators
Apache Apex Meetup
➔ Ease of Use
➔ Locality
➔ Fault Tolerance
➔ Scalability
◆ Partitioning
◆ Auto-scaling
Other features of platform
Apache Apex Meetup
● Apache Apex Page○ http://apex.incubator.apache.org
● Mailing Lists○ [email protected]
● Repository○ https://github.com/apache/incubator-apex-core
○ https://github.com/apache/incubator-apex-malhar
● Issue Tracking○ https://issues.apache.org/jira/browse/APEXCORE
○ https://issues.apache.org/jira/browse/APEXMALHAR
Resources● @ApacheApex
● /groups/7020520
Apache Apex Meetup
Apex in Distributed Environment
Hadoop Edge Node
dtManage (Web UI)
Hadoop Node
YARN Container
App Master
Hadoop Node
YARN ContainerYARN Container
YARN Container
Thread1
Op2
Op1
Thread-N
Op3
Worker Container
Hadoop Node
YARN ContainerYARN Container
YARN Container
Thread1
Op2
Op1
Thread-N
Op3
Worker Container
CLI
dtGateway (REST API)
Part of DataTorrent RTS
dtGateway (REST API)
dtManage (Web UI)
Web Browser
Apache Apex Meetup
➔ AT_LEAST_ONCE (default)
◆ Windows are processed at least once
➔ AT_MOST_ONCE
◆ Windows are processed at most once
➔ EXACTLY_ONCE
◆ Windows are processed exactly once
Processing Modes
Apache Apex Meetup
➔ Saves operator state on HDFS
➔ Each operator undergoes checkpointing
➔ Done by platform
➔ Happens every 60 streaming windows by default i.e. 30 sec.
➔ Checkpoint is named by the windowId at which it happens
➔ If all operators gets checkpointed at same window, that checkpointed state
becomes “committed” state of application
➔ Committed state is used for recovery in case of failure
Checkpointing