Open topic with navigation
A dimension table in data warehousing is a logical grouping of data such as customer or patient information. This type of data changes slowly over time rather than on a regular schedule, and thus is referred to as a slowly changing dimension, or SCD.
One way to handle changes to a dimension table is via Type 2 SCD, which allows you to track the history of updates to the table. Whenever the value of any column you are tracking changes for a given record, a new record is created and marked as the current record for that entity, and the previously current record for that entity is updated in a way that denotes it’s no longer the current record. This is typically done using record begin and end dates, which allows for chronological tracking of changes.
With this method, in addition to a natural key, based on some attribute of the record that uniquely identifies the entity, you also need a surrogate key to uniquely identify each record, since the natural keys will be repeated.
Type 2 SCD can be implemented in DMExpress with a copy task to get the current records from a database table, and a join task to join the current records with the update records and output them back to the database table. An additional copy (or aggregate) task gets the current surrogate key value so that it can be incremented for every new record.
The attached example demonstrates Type 2 SCD processing for a patient information dimension table.
Consider the following patient_dimension table, consisting of patient data along with record begin and end date fields:
||is a natural key which uniquely identifies a patient. The PATIENT_ID is the same for all of the records for a specific patient.|
||is a surrogate key which uniquely identifies any record in the table. Its value is incremented for every new record in the table. More than one record can have the same PATIENT_ID, but not the same PATIENT_KEY.|
||denotes the date when a specific version of a record for a given natural key was inserted|
||denotes the end date for that version of the record. The end date signifies when any of the columns in the record that you are tracking are updated so that this is no longer the latest version of the record. A null REC_END_DATE denotes the latest or current version of the row for a given natural key.|
We want to track the history of changes made to the ADDRESS, CITY, STATE, and ZIP fields. Whenever any of these fields is updated in a record, we add a new record with the new values and make it the active or current record, and we update the REC_END_DATE of the previously current record for that patient so that it is no longer the active record.
In the sample data set above, one record for each patient has the REC_END_DATE set to NULL; these are the active records for all the patients. For patient Scott (PATIENT_ID PHC002), there are two records:
The patient_dimension table is updated daily via a text file referred to as an update file. In the following update file, received on 08/03/2012, the records with PATIENT_ID PHC005 and PHC006 are new patient records, and the record with PATIENT_ID PHC002 is an existing patient record with a new address:
DMExpress processes the records in the update file by inserting new records for new patients, inserting new records for existing patients whose address has changed, and updating existing records for those patients in the patient_dimension table.
Included with the solution is a SQL script that drops and creates the patient_dimension table and loads the seed data required by the sample DMExpress job, so it must be run before running the job. The script is tested on Oracle databases; you may need to modify the script if you are planning to run it on any other RDBMS.
There are different ways to manage surrogate keys depending on the database being used. In this example, using an Oracle database, the DMExpress copy task, GetMaxPatientKey, gets the current maximum value of the surrogate key, PATIENT_KEY, from the database, and stores it in a local file for the subsequent join task.
The source is a user-defined SQL statement against the patient_dimension table that gets the maximum value of patient_key:
select max(patient_key) from patient_dimension
The task writes this maximum value to a text file, MaxPatientKey.txt, with the following DMExpress syntax:
/DERIVEDFIELD MaxPatientKey <maxvalue>
Using this syntax enables the automatic definition of the named value, MaxPatientKey, in the subsequent join task when it specifies MaxPatientKey.txt as linked external metadata.
The database connection information in this task needs to be modified to point to the same database with the same credentials as was used when running the SQL script in the previous section to seed the table with data.
The DMExpress copy task, GetCurrentPatientDimensionData, extracts the current patient_dimension records (only those whose REC_END_DATE field is NULL) from the database and passes them to the subsequent join task. This copy task is needed because the join task cannot use the same table as both a source and a target, but using direct data flow between the two tasks improves performance by passing the data in memory.
The DMExpress join task, JoinPatientDimensionWithStageData, joins the incoming patient records with the existing patient records in the patient_dimension table and processes the data.
If an update record meets one of the following conditions:
Then a new record is inserted into the patient_dimension table as follows:
In the case of a changed patient record, the prior current record for that patient in the patient_dimension table is updated to set its REC_END_DATE to yesterday’s date, marking it as no longer the current record for that patient.
After all the records in the update file are processed on 08/03/2012, the patient_dimension table appears as follows:
62_SCD_Type2.zip, compatible with DMExpress version 7.13.15 or higher
In some databases, such as SQL Server, you can define a PATIENT_KEY column to be of type SERIAL or IDENTITY so that the column’s value is automatically generated and inserted into the record by the RDBMS. In this case, you would not need the GetMaxPatientKey task, and when inserting a new record in the join task, you would skip the surrogate key in the target record mapping.
The example used in this article is based on an Oracle database, which has no SERIAL/IDENTITY column, hence we have used a copy task to get the max value of the surrogate key. In the subsequent join task, we generate a new value for the surrogate key by adding the max value to the target record number, essentially mimicking the Oracle sequence number generation. This copy task could be replaced with an aggregate task that gets the maximum value of patient_key from the table.
Some Oracle customers use an insert trigger on the dimension table to insert the current value of a sequence into the table for each new record. Again, in this situation, you can skip the surrogate key in the target record mapping of the join task.
If the dimension table is very large, it may not be feasible to read all the data from the table and join it with the update file. In that case, you could extract only the data for the rows in the update file, and then join those rows with the update file.
Copyright © 2016 Syncsort All rights reserved.