Use Case Accelerators > How to Implement Type 2 Slowly Changing Dimensions

How to Implement Type 2 Slowly Changing Dimensions

Article #: Product: Version:

Summary

A dimension table in data warehousing is a logical grouping of data such as customer or patient information. This type of data changes slowly over time rather than on a regular schedule, and thus is referred to as a slowly changing dimension, or SCD.

One way to handle changes to a dimension table is via Type 2 SCD, which allows you to track the history of updates to the table. Whenever the value of any column you are tracking changes for a given record, a new record is created and marked as the current record for that entity, and the previously current record for that entity is updated in a way that denotes it’s no longer the current record. This is typically done using record begin and end dates, which allows for chronological tracking of changes.

With this method, in addition to a natural key, based on some attribute of the record that uniquely identifies the entity, you also need a surrogate key to uniquely identify each record, since the natural keys will be repeated.

Type 2 SCD can be implemented in DMExpress with a copy task to get the current records from a database table, and a join task to join the current records with the update records and output them back to the database table. An additional copy (or aggregate) task gets the current surrogate key value so that it can be incremented for every new record.

Resolution

The attached example demonstrates Type 2 SCD processing for a patient information dimension table.

Consider the following patient_dimension table, consisting of patient data along with record begin and end date fields:

where:

PATIENT_ID is a natural key which uniquely identifies a patient. The PATIENT_ID is the same for all of the records for a specific patient.
PATIENT_KEY is a surrogate key which uniquely identifies any record in the table. Its value is incremented for every new record in the table. More than one record can have the same PATIENT_ID, but not the same PATIENT_KEY.
REC_BEGIN_DATE denotes the date when a specific version of a record for a given natural key was inserted
REC_END_DATE denotes the end date for that version of the record. The end date signifies when any of the columns in the record that you are tracking are updated so that this is no longer the latest version of the record. A null REC_END_DATE denotes the latest or current version of the row for a given natural key.

We want to track the history of changes made to the ADDRESS, CITY, STATE, and ZIP fields. Whenever any of these fields is updated in a record, we add a new record with the new values and make it the active or current record, and we update the REC_END_DATE of the previously current record for that patient so that it is no longer the active record.

In the sample data set above, one record for each patient has the REC_END_DATE set to NULL; these are the active records for all the patients. For patient Scott (PATIENT_ID PHC002), there are two records:

The patient_dimension table is updated daily via a text file referred to as an update file. In the following update file, received on 08/03/2012, the records with PATIENT_ID PHC005 and PHC006 are new patient records, and the record with PATIENT_ID PHC002 is an existing patient record with a new address:

DMExpress processes the records in the update file by inserting new records for new patients, inserting new records for existing patients whose address has changed, and updating existing records for those patients in the patient_dimension table.

Attachments

62_SCD_Type2.zip, compatible with DMExpress version 7.13.15 or higher

Additional Information

Last updated: