The key concern in data warehousing is quality data. If you have seen me present, you probably won't forget my emphasis on having data quality mechanisms built into ETL processes. Let me tell you about a scenario I was dealing with at a client recently.
The client wanted and ETL process built that would do the following:
1. Pull all records from table A with a insert date of yesterday into a flat file
2. Delete all records from table A
3. Run this process daily
Okay, so this is simple enough. I set up an ETL job that read the rows from a table and put them into a flat file. I then prepared a second job that would do the delete with the same date criteria. The purpose of doing this in two jobs was to ensure the backup file was created and populated before any delete process executes.
I tested this process and the counts looked great. I implemented in production and checked the counts of the two processes in the morning and they were different. I thought something must be horribly wrong because I had tested this process. After doing a little research with the client it was determined that Table A was being updated on a 24-hour window process, so by the time my process ran in the morning, all the records had not been loaded from the previous night, resulting in records being deleted that were not in the backup file.
The solution to the problem was to set the date back to 2 days prior and voila, all the record counts between archive and delete matched up. The issue with some records not being in the archive was not a problem due to the information being replicated on the source system.
The purpose of this posting is to describe again that data quality and doing simple checks of record counts between processes (especially delete processes) is a pressing and fundamental part of the data quality process. As data systems architects and developers, it is our responsibility to ensure that we build and check processes to ensure requirements are met. It might have been an issue if I have never checked the process after moving it into production. It may have never been discovered. It doesn't matter though because it's part of the data quality stewardship that is so important in today's business intelligence systems.