Thursday, April 23, 2015

Do we need a staging database for warehousing projects?

Deciding an architecture for a data warehouse project and applying it is a challenge. The designer, or the architect has to decide many things, what to be included, what to be avoided, what, when and where the processes have to be implemented, so many decisions to be taken, and staging is another thing to be decided. This post speaks about it.

Do we always need a staging environment? Is it a must that has to be implemented as a part of data warehouse solution? Or can we simply ignore it?

Staging is not something that has to be considered as use-it-if-you-like or ignore-it-if-you-dont-like. The necessity of staging comes with few factors. Here are some factors that can be used for determining whether you need a staging environment for your solution.

  • Source data availability during non-processing hours
    Usually, we schedule the ETL process and it is generally off-peak hours. However, if acquisition window for data extraction from the source is during peak hours, then no options, data has to be extracted and stored in a temporary environment until the process starts. In a situation like this, it is better to use a staging database for holding data temporarily.

  • Difference partial-data acquisition windows for data extraction.
    There are some instances that you need a dataset for running an ETL for loading some facts but the full dataset is not available for one-time extraction. Assume that the source processes some data and loads the processed data into another table. If this process happens every two hours and the process flushes existing data before loading new data, and if this is what we need for loading facts, we need to continuously access the source table (that contains processed data) before the next process and hold them until we have the complete set. This requires staging environment for holding extracted data until the dataset is ready for running the ETL.
  • Source pushes data, we do not pull.
    Some sources do not support establishing connections to it and it pushes required data out. Most of the time, sources dump data as either csv or xml files but it might write data directly to a given destination. In order to support this, staging environment is needed because we never allow sources to write data directly into the data warehouse.
  • Multiple data acquisition windows for data extraction.
    Multiple data acquisition windows are common with single-source-extraction as well as multi-source-extractions. It is common to extract from multiple sources (or multiple entities in the same source) and process them together as a part of ETL operation. Since data extraction happens with multiple acquisition windows and all need to be processed at a scheduled time, data extracted has to be held until ETL starts. Because of this, staging environment is required for holding data.
  • Simple transformations.
    Generally transformation happens as part of standard ETL process but some transformation including validation can be done at a different level, either at the source level or an intermediate level before starting the standard ETL process. Good example is, combining all names columns such as first name, middle name, last name and forming a new column named Full Name. If this type of transformation cannot be done at source level, best place is then staging. Converting string-typed-date to date-type, removing nulls and correcting numeric values are common with this.
  • Auditing purposes.
    What if we need to keep records on what we extract from the sources for auditing purposes? Although data warehouse contains all extracted, it cannot be considered as the exact set extracted because transformation makes changes on them. Therefore, best option is, keeping them in the staging environment for a certain period.
  • Identifying data changes in sources.
    It is always better to extract only changes from the sources in order to reduce the load and process them only. But what if, source does not offer any way to identify changes and forces us to extract all. Situation like this is complex but we often use staging database to handle this up to some extent. For example, staging can help us to hold at least three months data for facts. When extract, ETL extracts for last three months data from the source and do the comparison between extracted dataset and data in the staging rather than comparing extracted dataset and data in the data warehouse for identifying changes. Once identified, filtered dataset from staging can be used as the source for continuing the ETL operation.
One more thing to be remember. Staging does not necessarily need to be a database. It can be a CSV file, XML file or SSIS Raw file too.

No comments: