Extract, transform, load
|
Extract, transform, and load (ETL) is a process in data warehousing that involves
- extracting data from outside sources,
- transforming it to fit business needs, and ultimately
- loading it into the data warehouse.
ETL is important, as it is the way data actually gets loaded into the warehouse. This article assumes that data is always loaded into a data warehouse, whereas the term ETL can in fact refer to a process that loads any database.
Contents |
Extract
The first phase of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format. Common data source formats are relational databases, and flat files, but other source formats exist. Extraction converts the data into records and columns (aka fields).
Transform
The transform phase applies a series of rules or functions to the extracted data to derive the data to be loaded. Some data sources will require very little manipulation of data. However, in other cases any combination of the following transformations types may be required:
- Select only certain columns to load (or if you prefer, cull columns not to load)
- Translate coded values (e.g. If the source system stores M for male and F for female but the warehouse stores 1 for male and 2 for female)
- Derive a new calculated value (e.g. sale_amount = qty * unit_price)
- Join together data from multiple sources (e.g. lookup, merge, etc)
- Summarize multiple rows of data (e.g. total sales for each region)
- Generate a Surrogate_key value
- Transpose / cross tabluate (turn multiple columns into mutiple rows or vice versa)
Load
The load phase loads the data into the data warehouse. Depending on the requirements of the organization, this process ranges widely. Some data warehouses merely overwrite old information with new data. More complex systems can maintain a history and audit trail of all changes to the data.
Challenges
ETL processes can be quite complex, and significant problems can occur. Improperly designed ETL systems or an unexpected change in format of one of the source systems can cause serious problems in the ETL process potentially destroying or corrupting significant amounts of data in the target system. An additional difficulty is making sure the data being uploaded is relatively consistent. Since multiple source databases all have different update cycles (some may be updated every few minutes, while others may take days or weeks), an ETL system may be required to hold back certain data until all sources are synchronized.
Tools
While an ETL process can be created using almost any programming language, creating them from scratch is quite complex. Increasingly, companies are buying ETL tools to help in the creation of ETL processes.
A good ETL tool must be able to communicate with the many different relational databases and read the various file formats used throughout an organization. ETL tools have started to migrate into Enterprise Application Integration (EAI), or even Enterprise Service Bus, systems that now cover much more than just the extraction transformation and loading of data. Many ETL vendors now have data profiling, data quality and metadata capabilities.
Some ETL tools
- Octopus (http://octopus.objectweb.org/) an open source ETL released by the ObjectWeb consortium
- IBM WebSphere DataStage (http://www.ascential.com/)
- Ab Initio (http://www.abinitio.com/)
- BusinessObjects Data Integrator (http://www.businessobjects.com/products/dataintegration/)
- DataMirror (http://www.datamirror.com/)
- Informatica PowerCenter (http://www.informatica.com/)
- Microsoft SQL Server Integration Services (http://www.sqlis.com/)
- Microsoft SQL Server Data Transformation Services (http://www.sqldts.com/)
- Oracle Warehouse Builder (http://www.oracle.com/)
- Pervasive Data Integrator (http://www.pervasive.com/dataintegrator/)
- SAS (http://www.sas.com/)
- Sunopsis (http://www.sunopsis.com/)
- DT/Studio (http://www.embarcadero.com/products/dtstudio/dtdatasheet.html)
See also
External links
- the ETL Guy (http://etlguy.com)de:ETL-Prozess