ETL - Extract Transform Load

LevSelector.com

New York

home > ETL - Extract, Transform, Load

ETL - Extract, Transform, Load (page under construction - still you may find it very useful)

On This Page Other Pages

- intro
- performance
- parallel processing
- rerunability recoverability
- good practicies
- etl_tools
- xx

- xx
- xx
- xx
- xx

Intro ------------------------------

ETL ( Extract, Transorm, Load ) - the process to handle data loads for big databases and massive transaction streams.

The typical real-life ETL cycle consists of the following execution steps

Cycle initiation
Build reference data
Extract (from sources)
Validate
Transform (clean, apply business rules, check for data integrity, create aggregates)
Stage (load into staging tables - if they are used)
Audit reports (Are business rules met? Also in case of failure - helps to diagnose/repair).
Publish (to target tables)
Archive
Clean up

Performance --------------------

Performance

ETL vendors benchmark their record-systems at multiple TB (terra-bytes) per hour (or ~1 GB per second) using powerful servers with multiple CPUs, multiple hard drives, multiple gigabit-network connections, and lots of memory. In average day-to-day systems you will see ~1MB/sec.

In real life the slowest part of an ETL process is usually in the database load phase. Database is slow because it has to take care of concurrency, integrity maintenance, indexes. Thus for better performance it makes sense to do most of the ETL processing outside of the database - and use bulk load operations whenever possible. Still even using bulk operations, database access is usually the bottleneck in the ETL process. Here are some common tricks used to increase performance:

Partition tables (and indexes). Try to keep partitions similar in size (watch for "null" values which can skew the partitionning.

Do all validation in ETL layer. Disable integrity checking (disable constraint ...) in the target database tables during the load.

Disable triggers (disable trigger ...) in the target database tables during the load. Simulate their effect as a separate step.

Generate IDs in the ETL layer (not in the database).
Drop the indexes (on a table or partition) before the load - and recreate them after the load (drop index ...; create index ...).

Use parallel bulk load when possible - works good when the table is partitioned or there are no indexes. Note: attempt to do parallel loads into the same table (partition) usually causes locks - if not on the data rows - then on indexes.

If you need to do insert/update/delete - find out which rows should be processd in which way in the ETL layer - and then process these 3 operations in the database separately. You often can do bulk load for inserts, but updates and deletes commonly go through API (using SQL).

Whether or not to do certain operation in the database or outside may be a tradeoff. For example, removing duplicates using "distinct" may be slow in the database - thus it makes sense to do it outside. On the other side if using distinct will significantly (x100) decrease the number rows to be extracted - then it makes sense to do de-duping as early as possible - in the database before unloading data.

Common source of problems in ETL is a big number of interdependencies between ETL jobs. For example, job "B" can not start while job "A" is not finished. You can usually achieve better performance by visualizing all processes on a graph, and trying to reduce the graph making maximum use of parallelism, and making "chains" of consecutive processing as short as possible. Again, partitionning of big tables and of their indexes can really help.

Another common example is a situation when the data is spread between several databases, and processing is done in those databases sequentially. Sometimes database replication may be involved as a method of copying data between databases - and this can significantly slow down the whole process. The common solution is to reduce the processing graph to only 3 layers:

Sources
Central ETL layer
Targets

This allows to take maximum advantage of parallel processing. For example, if you need to load data into 2 databases - you can run the loads in parallel (instead of loading into 1st - and then replicating into the 2nd).

Of course, sometimes the sequential processing is required. For example, you usually need to get dimensional (reference) data before you can get and validate the rows for main "fact" tables.

Parallel processing

There are 3 main types of parallelisms in ETL applications:

Split a single sequential file (or table) into smaller pieces (partitions) for parallel access.
Push data through a series of components so that these components work in parallel - but on different rows of the same data set.
Run multiple different data streams in parallel.

Rerunability, Recoverability

A big ETL process is usually subdivided into smaller pieces running sequentially or in parallel. To keep track of data flows, it makes sense to tag each data row with "row_id", and tag each piece of the process with "run_id". In case of a failure having these IDs will hellp to roll-back and re-run the failed piece. It is also good idea to have "checkpoints" - states when certain phases of the process are completed. Once at a checkpoint, it is a good idea to write everything to disk, clean out some temporary files, log the state, etc.

Good Practicies

Four Layered Approach for ETL Architecture Design

Functional Layer – Core functional ETL processing (Extract ,Transform, and Load).
Operational Management Layer – Job Stream Definition & Management, Parameters, Scheduling, Monitoring, Communication & Alerting.
Audit, Balance and Control (ABC) Layer – Job Execution Statistics, Balancing & Controls, Rejects & Error Handling, Codes Management.
Utility Layer – Common components supporting all other layers.

Use file based ETL processing where possible

Storage is relatively inexpensive in cost
Intermediate files serve multiple purposes
- Used for testing and debugging
- Used for restart and recover processing
- Used to calculate control statistics
Helps to reduce dependencies - enables modular programming.
Allows flexibility for job execution & scheduling
Better performance if coded properly, and can take advantage of parallel processing capabilities when the need arises.

Use data driven methods and minimize custom ETL coding.

Parameter driven jobs, functions, and job control
Code definitions & mapping in database
Consideration for data driven tables to support more complex code mappings and business rule application.

Qualities of a good ETL architecture design:

Performance
Scalable
Migrateable
Recoverable (run_id, ...)
Operable ( completion codes for phases, rerunning from checkpoints, etc.)
Auditable (business requirements dimension and technical troubleshooting dimension)

ETL Tools

http://www.kjube.be/tnenopxe/index.php?section=28
http://www.etltool.com/etltoolslist.htm - list of some ETL tools (definitely not complete - they don't even have Ab Initio).

You can achieve this with simple shell/Perl scripts or 3rd party products:

perl/shell scripts
Informatica
Ab Initio - high priced and secretive, but very good, especially for big tasks. Just one tool. Generates ksh scripts which can be run by any scheduler. High performance (parallel execution), very robust.
Oracle Warehouse Builder
Business Objects (Data Integrator)
IBM (Ascential)
Wherescape RED
Pervasive tools
Altova
SQL Server 2005 Integration Services (CTP=beta)
Embarcadero
DataFlux
Sybase (Solande ETL)
SAS Institute
IWay Software
Sunopsis (Oracle)

On This Page	Other Pages
- intro - performance - parallel processing - rerunability recoverability - good practicies - etl_tools - xx	- xx - xx - xx - xx