Monday, June 15, 2015

ETL Data Flow Architectures

ETLing is the way of getting data from one or more sources into your reporting environment or to a repository that holds enterprise-wide data for all reporting and analytical requirements. It is a common and well-known term in data warehousing because a data warehouse solution is impossible to exist without an ETL implementation.

ETL stands for extract, transform and loading, describing the data flow. Generally we extract data from sources, do validations on extracted data, and load the destination, most of the time, destination is a data warehouse. If the ETL solution is very small and less complex, data flow is always from sources to destination without any middle components. However if it is not, the architecture related to data flow is different and it includes additional components. The architecture of the data flow can be determined by considering following;
  • The number of data sources to be accessed.
  • The amount of data to be accessed and transferred in a single execution.
  • The type of loading, whether it is complete refresh or incremental loading.
  • Complexity of validations to be applied (transformations).
  • Data generation frequency at sources.
  • The extraction windows.
  • The accessibility of data sources.
Considering above factors, there are three possible ETL data flow architectures;

Single-stage ETL architecture
The single-stage ETL architecture is suitable for small and less complex data flows. With this architecture, transformation is done at the extraction and in-flight. 


Combining First Name, Middle Name and Last Name into Full Name at the extraction and finding duplicates of Customer Entity are some examples for transformations performed with this architecture.


Two-stage ETL architecture
This data flow architecture maintains an additional environment called Staging. It can be a database (in most cases), set of text files or even Excel files. The staging is used as a temporary container for extracted data and it is implemented based on few factors such as different data acquisition windows, identifying modifications done with previously loaded data and auditing purposes. For more info o staging refer my post: Do we need a staging database for warehousing projects?.


Transformation on extracted data can be done in several stages. It can be performed at the extraction from sources and staging, and in-flight between sources and staging, and between staging and destination.

Three-stage ETL architecture
This extends two-stage architecture by introducing another layer called Landing Zone


This helps to reduce the workload (including transformation at extraction) with the sources. We can extract data as available in the source without adding any complexities for extraction queries, minimizing time it takes for completing the extraction. Then possible transformation can be applied to Landing Zone or later stages. Sometime, sources pushes data (via reports) without letting us to connect with sources. In a scenario like that, Landing Zone is the place where all source files are placed.


1 comment:

Unknown said...

what is the average time needed for an expert to rebuild 10 mappings to migrate them from OWB to ODI for example,