Wednesday, July 23, 2008

The Importance of Data Quality

The key concern in data warehousing is quality data. If you have seen me present, you probably won't forget my emphasis on having data quality mechanisms built into ETL processes. Let me tell you about a scenario I was dealing with at a client recently.

The client wanted and ETL process built that would do the following:
1. Pull all records from table A with a insert date of yesterday into a flat file
2. Delete all records from table A
3. Run this process daily

Okay, so this is simple enough. I set up an ETL job that read the rows from a table and put them into a flat file. I then prepared a second job that would do the delete with the same date criteria. The purpose of doing this in two jobs was to ensure the backup file was created and populated before any delete process executes.

I tested this process and the counts looked great. I implemented in production and checked the counts of the two processes in the morning and they were different. I thought something must be horribly wrong because I had tested this process. After doing a little research with the client it was determined that Table A was being updated on a 24-hour window process, so by the time my process ran in the morning, all the records had not been loaded from the previous night, resulting in records being deleted that were not in the backup file.

The solution to the problem was to set the date back to 2 days prior and voila, all the record counts between archive and delete matched up. The issue with some records not being in the archive was not a problem due to the information being replicated on the source system.

The purpose of this posting is to describe again that data quality and doing simple checks of record counts between processes (especially delete processes) is a pressing and fundamental part of the data quality process. As data systems architects and developers, it is our responsibility to ensure that we build and check processes to ensure requirements are met. It might have been an issue if I have never checked the process after moving it into production. It may have never been discovered. It doesn't matter though because it's part of the data quality stewardship that is so important in today's business intelligence systems.

1 comment:

Christian Loris said...

This brings up a good point. I read your post twice until I finally understood what you meant by 'insert date'. At first glance, I assumed the date was being populated at the time of insert into the table which is what it sounds like you assumed too. In you scenario, you've got a distributed/disconnected data store which changes what an 'insert date' is. It's when it was entered into the system and not when it was persisted to the central data store. I am guilty of this mistake too which is making assumptions about the nature of the data. I've so often run into these scenarios where I assume something and no one who owns the source system points out the nuances. A good reason to try and understand the data and its flow first. An even more good reason to ensure data quality throught the techniques you described. Thanks for the good advice.