Sunday, July 31, 2016

Dimension Loading: Source sends same record with different Source ID

An interesting question was raised during my lecture today and it was related to dimension loading. The question was; how can we identify and update the dimension record that is already exist if the key of the entity has been changed in the source.

Let me explain the question with more details. Assume that we have a dimension named DimCustomer and it is loaded with following two records;


And with the next loading, what if we get some new records (last three records - 150, 151, 152) and Id 151 is actually the old record which had Id of 100. This can happen with various business scenario, and, since the source id is new, we always consider it as a new record without checking.


This has to be checked, we need to make sure that SourceKey of the dimension record is changed with new SourceId (CustomerId) without inserting a new record, duplicating this customer: Jane Knight. How do we do it with our SSIS package?

Generally, if we cannot find changes in the source, we get all records from the source to ETL package and then check for new records using a Lookup configured for the data warehouse. With a scenario like this, what we have to do is, once new records are identified, use FuzzyLookup for checking and seeing whether records that are similar are exist with records found as "New". We can use exact matching (equi-join) without using Fuzzy Lookup but there can be slight changes which cannot be found using exact matches. If we find some matching rows, we need to update them without inserting.

Have a look data flow in this ETL package;


This package extracts Customers from the source using Get Customers from the source. I have used the query shown above for that. Then the Lookup: Lookup - Find new Customers is used for matching with existing records in DimCustomer and finding new customers. The Lokkup No Match Output gives us new records.


Next is the Fuzzy Lookup for finding whether records identified as "New" are already exist with old records. For this, we have to connect again with the data warehouse and look for similarities on Customer Name and Town. If exist, need to get the existing CustomerKey as CurrentCustomerKey. Not that the Similarity Threshold is set to 95%, not to 99%.


If we get a value for CurrentCustomerKey, and similarity is closer to 100%, we can consider the records as an old records though we get it as a new record from the source. The added Conditional Split checks it and allows us to create two flows for "real new" records and "old but new" records.


After that, we can have two destinations, one for inserting new records, and one for updating old records with new CustomerId. For update, as you see, Script Component is used as below;

xxx

When run the package, as you see, 4 records are coming from the source, three records are identified as new records, Fuzzly Look along with Conditional Split gives 2 records as new and 1 record as a new record but related to an old record. Therefore package inserts two new records and update one records. Here is the result now;






No comments: