loden_302 bo uc-bods.pdf
Post on 14-Apr-2018
220 Views
Preview:
TRANSCRIPT
-
7/27/2019 Loden_302 BO UC-bods.pdf
1/44
]Session Code: 302
Using BusinessObjects Data Services to combat ETL
problems arising from bad source Data
Don Loden
SARAH MERTZ[ASUG INSTALLATION MEMBER
MEMBER SINCE: 1998
DAVID SWIERENGA[ASUG INSTALLATION MEMBER
MEMBER SINCE: 2005
GREG REISCHLEIN[ASUG INSTALLATION MEMBER
MEMBER SINCE: 2007
Business Intelligence ConsultantDecision First Technologies
-
7/27/2019 Loden_302 BO UC-bods.pdf
2/44
[
Discovering and demonstrating best practices in data flow andETL design
The water bucket brigade
Comment to enhance transparency and documentation
Approach job development with an object-oriented focus
Learning Points
Real Experience. Real Advantage. 2
ange a a ap ure Got good delta process? Recovery, performance, and overlap.
Using source- and target-based CDC
Wrapping complex database functions seamlessly
A multi-user approach to keep the project rolling while waitingfor complex code
-
7/27/2019 Loden_302 BO UC-bods.pdf
3/44
[
Discovering and Demonstrating Best Practices in Data Flow and ETL Design
Examining Change Data Capture (CDC)
Keeping the Project Rolling: A Multi-User Approach Wrap-Up
What Well Cover
Real Experience. Real Advantage. 3
-
7/27/2019 Loden_302 BO UC-bods.pdf
4/44
[
Discovering best practices in Data Flow and ETL design
The water bucket brigade
Comment to enhance transparency and documentation Approach job development with an object-oriented focus
In This Section
Real Experience. Real Advantage. 4
-
7/27/2019 Loden_302 BO UC-bods.pdf
5/44
[
SAP BusinessObjects Data Integrator is an object-orientedtool: Exploit the nature of the tool
Try to approach job design with the idea of reusability When you have two tasks that are primarily the same, try to
control the flow with variables and conditionals
Dont be afraid of sub- obs
Best Practices in ETL: Think Reusability
Real Experience. Real Advantage. 5
Sometimes sub-jobs are the only way to accomplish a task such asswitching system configurations
If you perform an operation more than two or three times,consider writing a function
-
7/27/2019 Loden_302 BO UC-bods.pdf
6/44
[
Benefits of reusability in a team environment: Using functions and variable-driven conditional transforms
Demo
Real Experience. Real Advantage. 6
-
7/27/2019 Loden_302 BO UC-bods.pdf
7/44
[ Reusability Through Conditionals
Use conditionals to reuse
logic. This makes code
flexible, but takes
Real Experience. Real Advantage. 7
-
oriented nature of Data
Services.
All objects have comments, but note the workflow. This is a multi-use
object so an annotation was used. This way the description could be
specific to the current task.
-
7/27/2019 Loden_302 BO UC-bods.pdf
8/44
[ Reusability: Whats Inside the Conditional?
If the variable test from the
last workflow does not
contain a value, then print to
the log.
Real Experience. Real Advantage. 8
If the variable test from the
last workflow does contain a
value, then run the data flowto update the INVDETL table.
-
7/27/2019 Loden_302 BO UC-bods.pdf
9/44
[
Group your workflows into subject areas
Approach job design with the idea of organization
Place logical components together and order them in the way theywill need to execute according to database constraints
Utilize a group-based numbering system
Number our sub ect areas in the order of execution
Best Practices in ETL: Think Organization
Real Experience. Real Advantage. 9
Group 1 Lookup tables Group 2 Address
Group 3 Product
Group 4 Customer
Group 5 Sales
-
7/27/2019 Loden_302 BO UC-bods.pdf
10/44
[ Organizational Work flow Example for a Job
Always use a fully commented
header for the job.
Real Experience. Real Advantage. 10
Notice the comments in the object
descriptions. These expand on the
numeric groupings to share the
subject area name.
-
7/27/2019 Loden_302 BO UC-bods.pdf
11/44
[
Avoid SQL Transforms
Dont lose your metadata by using an SQL Transform when a
query transform will suffice The Metadata Reports are powerful tools; do not short-circuit
them by removing metadata
Best Practices in ETL: Maintain Metadata
Real Experience. Real Advantage. 11
Avoid using database views for data stores
You lose all aspects of the table and are constrained only to thecolumns in the view for metadata reporting
Often these situations present performance problems
Preserve metadata by using built-in transforms when at all possible.
Avoid SQL Transforms!
-
7/27/2019 Loden_302 BO UC-bods.pdf
12/44
[
Divide your data flows into meaningful work sections:
Driver Stage
Limit your data set with your first query transform Push down all possible operations to the database
The database will be much more efficient withmost operations
The Water Bucket Brigade
Real Experience. Real Advantage. 12
Data Services is designed with the intent ofstratifying the work load between the databaseserver and the job engine
Parsing Stage Parsing will mostly occur in the job server
Make this happen after driving down the amount of data Lookup Stage Lookups are expensive and powerful
Perform lookups after data is limited and parsed
The key is to always limit the result set before evoking expensive
operations in the job engine
-
7/27/2019 Loden_302 BO UC-bods.pdf
13/44
[
Divide your data flows into meaningful work sections: (cont.)
Business Rules Stage Validate your data using transforms such
as validation, case logic, and error trapping Always strive to provide metrics around your data
Validation transforms provide great ways to gather statistics
Validation provides statistics of errors that are automatically collected
The Water Bucket Brigade (cont.)
Real Experience. Real Advantage. 13
n e e a a a epor s
To access the Metadata Reports:
Data Services Management Console MetadataReports
To ensure that Validation efforts are not lost in the Metadata
Reports, always give your validations meaningful names while
setting up the transforms.
-
7/27/2019 Loden_302 BO UC-bods.pdf
14/44
[ Driver Stage Example
Real Experience. Real Advantage. 14
Use the first query transform as a driver stage to limit the data set.
Push this query down to the database whenever possible.
-
7/27/2019 Loden_302 BO UC-bods.pdf
15/44
[ Parsing Stage Example
Real Experience. Real Advantage. 15
After limiting the data set, perform the parsing that will most likely occur
in the job server engine.
-
7/27/2019 Loden_302 BO UC-bods.pdf
16/44
[ Lookup Stage Example
Real Experience. Real Advantage. 16
Lookups, by nature, almost always happen in the engine and are very
expensive iterative operations.
Lookups need to be downstream from the driver logic.
-
7/27/2019 Loden_302 BO UC-bods.pdf
17/44
[ Business Rule Enforcement Stage Example
Real Experience. Real Advantage. 17
This example shows case logic as well as validation.
Business rules are enforced last. These may take multiple forms.
Though both could have been hidden in query transforms, strive to show
logic graphically using the built-in transforms.
-
7/27/2019 Loden_302 BO UC-bods.pdf
18/44
[
Sprinkle comments throughout your job to enrichunderstanding as well as provide documentation
Use comments: At the job level
Describe variables and the purpose of the job
Use Comments Generously
Real Experience. Real Advantage. 18
At the workflow level
Describe grouping, parallelism, and logical flow of the job
At the data flow level
Describe the transformation logic and target table options
-
7/27/2019 Loden_302 BO UC-bods.pdf
19/44
[
Always use object-level descriptions on single and reusableobjects
Single-use objects are the most important and should alwayspossess an object description
Be careful with object descriptions on reusable objects
Make the descri tion eneric to the basic function of the ob ect
Use Annotations and Object Descriptions
Real Experience. Real Advantage. 19
Generally as a rule, make object descriptions match anannotation used as a header
Use annotations when you want to call attention to something
Annotations are also great to use for target table options
Use an annotation for the specifics on a reusable object
-
7/27/2019 Loden_302 BO UC-bods.pdf
20/44
[Job-Level Comments
Job-level comments should show variable values,
schedule, execution time, and dependencies.
Real Experience. Real Advantage. 20
-
7/27/2019 Loden_302 BO UC-bods.pdf
21/44
[Work Flow-Level Comments
Real Experience. Real Advantage. 21
Use an annotation instead of an object description when
either the object can change or the comment only
applies to this particular situation.
-
7/27/2019 Loden_302 BO UC-bods.pdf
22/44
[ Data Flow-Level Comments
Real Experience. Real Advantage. 22
Always explicitly state unique source or target table options inannotations. This adds meaningful documentation but also avoids
unnecessary clicking through the job.
-
7/27/2019 Loden_302 BO UC-bods.pdf
23/44
[
Always approach job development with an object-orientedfocus
Think of logic in self-contained pieces If you have to perform an operation multiple times, then use the
same code and drive the behavior with a variable
Turn sin le-use ob ects into multi-use ob ects
Object-Oriented Job Development
Real Experience. Real Advantage. 23
Envelop single-use objects like conditionals in workflows
Dont wait on complex logic that holds up development
Use placeholders in your code while waiting for the logic
Dont hold up an entire flow for one or two columns
-
7/27/2019 Loden_302 BO UC-bods.pdf
24/44
[ Demo: Using Placeholder Logic
Real Experience. Real Advantage. 24
-
7/27/2019 Loden_302 BO UC-bods.pdf
25/44
[ ETL Best Practices Summary
Think Reusability Object-Oriented development
Think Organization Organize code into groups
Maintain Metadata Avoid SQL transforms
Real Experience. Real Advantage. 25
-
7/27/2019 Loden_302 BO UC-bods.pdf
26/44
[What Well Cover
Discovering and Demonstrating Best Practices in Data Flowand ETL Design
Examining Change Data Capture (CDC)
Keeping the Project Rolling: A Multi-User Approach
Wra -U
Real Experience. Real Advantage. 26
-
7/27/2019 Loden_302 BO UC-bods.pdf
27/44
[
Change Data Capture (CDC)
Got good delta process? Recovery, performance, and overlap.
Using source- and target-based CDC Wrapping complex database functions seamlessly
In This Section
Real Experience. Real Advantage. 27
-
7/27/2019 Loden_302 BO UC-bods.pdf
28/44
[
Why bother with change data capture?
Delta loads only load-changed records where initial loads load all
covered data Performance is always better when loading only changed records
Recover a job when there is not time to run the full load
Change Data Capture (CDC)
Real Experience. Real Advantage. 28
- -
Strive for date- or process-ID-based CDC
Use target when no other option presents itself
Use table compares or Auto Correct on lesser data volumes
Wrap database functionality such as hash or checksums with largevolumes
-
7/27/2019 Loden_302 BO UC-bods.pdf
29/44
[
Recovery
What is a job execution function, and why do I want one?
The job execution function determines your delta strategy and isgenerally based upon source-based date stamp or process-IDcomparisons
Performance:
Got Good Delta Process
Real Experience. Real Advantage. 29
Use the job execution function to drive source-based CDC This limits the data set up front to maximize performance
Overlap
What is your overlap strategy?
Can you avoid overlap, must you reconcile, or can you pre-sample?
-
7/27/2019 Loden_302 BO UC-bods.pdf
30/44
[Job Execution Function
The Job Execution Function and Table extend the recoverycapabilities of Data Services:
Note the Job_Execution table below. The job execution functionis used to create the records in this table.
Real Experience. Real Advantage. 30
CDC is handled with the extractlowand extracthigh dates
Data Services Recovery is extended with the status value:
Done creates a new record
Started reuses the dates and process ID to recover from a failure
Min and max dates are derived and attached to a process ID
-
7/27/2019 Loden_302 BO UC-bods.pdf
31/44
[ Demo: Job Execution Function and CDC
Here well step through source- and target-based CDC andhow the job execution table and job execution function play
into this scenario
Real Experience. Real Advantage. 31
-
7/27/2019 Loden_302 BO UC-bods.pdf
32/44
[What Type of Overlap Strategy Suits You?
Overlap strategies: avoidance, reconciliation, and pre-sampling
Which strategy suits your needs?
Overlap avoidance: If the job is small enough, you may avoidperforming overlap at all
If your load window is one day and the initial runs in a day
Delta loads would begin after this point and there is no need to overlap
Real Experience. Real Advantage. 32
Requires a process to apply changes that occurred during your initialload
Load the initial with a termination value, such as amax date
Run the reconciliation job to cover the range fromthe max date to the current date
Delta load starts from the current date
-
7/27/2019 Loden_302 BO UC-bods.pdf
33/44
[ Overlap Strategy (cont.)
Overlap strategies: avoidance, reconciliation, and pre-sampling
Which strategy suits your needs? (cont.)
Pre-sampling: Preferable if possible First sample the max timestamp and record this value, then load data
until this value. Your first delta load then becomes the min value.
Real Experience. Real Advantage. 33
,
and the job
-
7/27/2019 Loden_302 BO UC-bods.pdf
34/44
[ Source- or Target-Based CDC?
Source-based is optimal for performance and clean execution
Strive for date- or process-ID-based CDC
Date is best for jobs from operational source systems
Process ID is useful for date warehousing applications that need tobe removed from artificial date constructs
Real Experience. Real Advantage. 34
, -
Use table compares or Auto Correct on smaller data sets
If data sets are larger, then look to integrate databasefunctionality such as a hash or check-sum value
These are used only when no other option is available
-
7/27/2019 Loden_302 BO UC-bods.pdf
35/44
[
Bring complex functionality into a Data Services job using DataServices functions as wrappers
Take advantage of the strengths of your team by dividingdevelopment
Have one developer create the database function, then import thatfunction into Data Services for the rest of the team
Wrapping Complex Data Functions
Real Experience. Real Advantage. 35
Now the function is fully reusable by any member of the teamwithout special knowledge of the native database, coding, or
store procedure/function creation
-
7/27/2019 Loden_302 BO UC-bods.pdf
36/44
[ Demo: Wrap Database Functions
To demonstrate multi-use and ease of sharing, Ill wrapdatabase check-sum functions for complex CDC into a job
Real Experience. Real Advantage. 36
-
7/27/2019 Loden_302 BO UC-bods.pdf
37/44
[What Well Cover
Discovering and Demonstrating Best Practices in Data Flowand ETL Design
Examining Change Data Capture (CDC) Keeping the Project Rolling: A Multi-User Approach
Wrap-Up
Real Experience. Real Advantage. 37
-
7/27/2019 Loden_302 BO UC-bods.pdf
38/44
[ Multi-User Approach
Try a multi-user approach to keep the project rolling whilewaiting for complex code
Dont make your developers wait on complex logic Split tasks
Example If complex database logic is needed, have the functionsdevelo ed b one team member then wra ed into functions so that all
Real Experience. Real Advantage. 38
can use in Data Services
Use empty tables as placeholders if code is not finished to populatethem
If you have the physical schema for the table, then that is enough toimport into Data Services to construct your code
[
-
7/27/2019 Loden_302 BO UC-bods.pdf
39/44
[ Multi-User Approach (cont.)
Try a multi-user approach to keep the project rolling whilewaiting for complex code (cont.)
Divide up the project by the group-based subject areas todistribute the workload among the team
Group 4 (Customer) development should be going on in tandemwith Grou 5 Sales
Real Experience. Real Advantage. 39
Both groups will need to be run in order; however for development,there is nothing stopping tandem development
Have the ETL Architect integrate the project and maintain
the holistic design for tandem development
[
-
7/27/2019 Loden_302 BO UC-bods.pdf
40/44
[What Well Cover
Discovering and Demonstrating Best Practices in Data Flowand ETL Design
Examining Change Data Capture (CDC)
Keeping the Project Rolling: A Multi-User Approach
Wrap-Up
Real Experience. Real Advantage. 40
[
-
7/27/2019 Loden_302 BO UC-bods.pdf
41/44
[ Key Learnings
Document your flows with comments to maintain transparencyand improve documentation
Remember Auto Documentation will capture these comments toprovide useful additions to your documentation reports
Simplify complex flows for other developers on the team
Real Experience. Real Advantage. 41
n uc e r ga e es gn ows n o sma c un s o
logic Data Services performs best with well-divided flows
Divide your flows for more chances of reusability
Try to maintain metadata by using built-in transformswhenever possible
Avoid SQL Transforms
[
-
7/27/2019 Loden_302 BO UC-bods.pdf
42/44
[ Key Learnings (cont.)
Design jobs from an object-oriented point of view
Use source-based CDC where possible
Performance is best with source-based CDC Wrap database functionality into functions when there is no
other CDC option or function equivalent
Real Experience. Real Advantage. 42
Functions provide excellent wrappers to reuse complex databaselogic
Reuse functions without team members having to re-writecomplex stored procedures and database functions
While waiting in a multi-user environment, use placeholders inyour data flows
Use the object-oriented nature of Data Services to keep theproject moving
[
-
7/27/2019 Loden_302 BO UC-bods.pdf
43/44
[ Resources
SAP Community Network DI Performance Blog:
https://wiki.sdn.sap.com/wiki/display/BOBJ/Performance
SAP Community Network DI Tips and Tricks: https://wiki.sdn.sap.com/wiki/display/BOBJ/BusinessObjects%20D
ata%20Services%20Tips%20and%20Tricks
Real Experience. Real Advantage. 43
BusinessObjects Board forum (not affiliated with SAP)
www.forumtopics.com/busobj/index.php
Useful Data Integrator forum with many categories such asadministration, design, performance, and real time
[
-
7/27/2019 Loden_302 BO UC-bods.pdf
44/44
[
]
Thank you for participating.
SESSION CODE: 302
Please remember to complete and return your
evaluation form following this session.
For ongoing education on this area of focus, visit the Year-Round
Community page at www.asug.com/yrc
Real Experience. Real Advantage. 44
Services to combat ETL problems
arising from bad source data
How to contact me:Don Lodendon.loden@decisionfirst.com
top related