LevSelector.com New York
home > ETL - Extract, Transform, Load

ETL - Extract, Transform, Load (page under construction - still you may find it very useful)
On This Page Other Pages

- intro
- performance
- parallel processing
- rerunability recoverability
- good practicies
- etl_tools
- xx

- xx
- xx
- xx
- xx

Intro ------------------------------

ETL ( Extract, Transorm, Load ) - the process to handle data loads for big databases and massive transaction streams.

The typical real-life ETL cycle consists of the following execution steps

  1. Cycle initiation
  2. Build reference data
  3. Extract (from sources)
  4. Validate
  5. Transform (clean, apply business rules, check for data integrity, create aggregates)
  6. Stage (load into staging tables - if they are used)
  7. Audit reports (Are business rules met? Also in case of failure - helps to diagnose/repair).
  8. Publish (to target tables)
  9. Archive
  10. Clean up

Performance --------------------

Performance

ETL vendors benchmark their record-systems at multiple TB (terra-bytes) per hour (or ~1 GB per second) using powerful servers with multiple CPUs, multiple hard drives, multiple gigabit-network connections, and lots of memory. In average day-to-day systems you will see ~1MB/sec.

In real life the slowest part of an ETL process is usually in the database load phase. Database is slow because it has to take care of concurrency, integrity maintenance, indexes. Thus for better performance it makes sense to do most of the ETL processing outside of the database - and use bulk load operations whenever possible. Still even using bulk operations, database access is usually the bottleneck in the ETL process. Here are some common tricks used to increase performance:

Partition tables (and indexes). Try to keep partitions similar in size (watch for "null" values which can skew the partitionning.

Do all validation in ETL layer. Disable integrity checking (disable constraint ...) in the target database tables during the load.

Disable triggers (disable trigger ...) in the target database tables during the load. Simulate their effect as a separate step.

Generate IDs in the ETL layer (not in the database).
Drop the indexes (on a table or partition) before the load - and recreate them after the load (drop index ...; create index ...).

Use parallel bulk load when possible - works good when the table is partitioned or there are no indexes. Note: attempt to do parallel loads into the same table (partition) usually causes locks - if not on the data rows - then on indexes.

If you need to do insert/update/delete - find out which rows should be processd in which way in the ETL layer - and then process these 3 operations in the database separately. You often can do bulk load for inserts, but updates and deletes commonly go through API (using SQL).

Whether or not to do certain operation in the database or outside may be a tradeoff. For example, removing duplicates using "distinct" may be slow in the database - thus it makes sense to do it outside. On the other side if using distinct will significantly (x100) decrease the number rows to be extracted - then it makes sense to do de-duping as early as possible - in the database before unloading data.

Common source of problems in ETL is a big number of interdependencies between ETL jobs. For example, job "B" can not start while job "A" is not finished. You can usually achieve better performance by visualizing all processes on a graph, and trying to reduce the graph making maximum use of parallelism, and making "chains" of consecutive processing as short as possible. Again, partitionning of big tables and of their indexes can really help.

Another common example is a situation when the data is spread between several databases, and processing is done in those databases sequentially. Sometimes database replication may be involved as a method of copying data between databases - and this can significantly slow down the whole process. The common solution is to reduce the processing graph to only 3 layers:

This allows to take maximum advantage of parallel processing. For example, if you need to load data into 2 databases - you can run the loads in parallel (instead of loading into 1st - and then replicating into the 2nd).

Of course, sometimes the sequential processing is required. For example, you usually need to get dimensional (reference) data before you can get and validate the rows for main "fact" tables.

Parallel processing

There are 3 main types of parallelisms in ETL applications:

 

Rerunability, Recoverability

A big ETL process is usually subdivided into smaller pieces running sequentially or in parallel. To keep track of data flows, it makes sense to tag each data row with "row_id", and tag each piece of the process with "run_id". In case of a failure having these IDs will hellp to roll-back and re-run the failed piece. It is also good idea to have "checkpoints" - states when certain phases of the process are completed. Once at a checkpoint, it is a good idea to write everything to disk, clean out some temporary files, log the state, etc.

Good Practicies

Four Layered Approach for ETL Architecture Design

Use file based ETL processing where possible

Use data driven methods and minimize custom ETL coding.

Qualities of a good ETL architecture design:

 

ETL Tools

You can achieve this with simple shell/Perl scripts or 3rd party products: