Tuesday, March 07, 2006

Data Quality

Today's discussion will center around the absolute most important concept in data warehousing (in my opinion); data quality.

Users might notice if you don't have data but you can bet they will notice if your data is incorrect and probably consider you, your team, and your processes to be incompotent.

Here's a three guidelines I have regarding data quality
1. Data quality is not an accidental result of a process; it is a planned result of efficient and correct processes, THUS
2. Having no data is better than incorrect data, THUS
3. If data quality is not important in your processes, you will not have quality data

Sounds like a no-brainer doesn't it? I beg to differ - of ALL the IT projects I have worked on quality is the biggest issue that generally gets the least effort. Think about it - a project is running long. What's the first part to get cut? Testing time. Quality review. UAT.

A previous project I worked on dealt with prescription information. Wouldn't you think data quality is important there? Of course you would - knowing that if data was incorrect a drug interaction could be missed potentially causing tragic consequences including death. So you don't get prescriptions and thus you are not affected by data quality? Okay, what if the erroneous code because of developer oversight was used for your parents prescriptions? I bet you're a lot more interested in the far reaching effects of data quality now.... (to be fair, not all data quality is a life or death matter, but it can have serious consequences to an organization)

At the client I'm working with now we had a severe problem with quality in the past couple of weeks. The datawarehouse is a marvelous tool but in the past things have been rushed into production without considering quality issues until a user squawks that numbers don't look right or some report doesn't balance out with what the source system says.

Let's get it clear right away that source systems won't always balance out with the data warehouse. The source system for my client accepts orders/returns/shipments. The datawarehouse will reject a return that doesn't have a shipment. The source system says its balanced so the datawarehouse must be wrong in the eye of the business user, right? WRONG! Many times a simple explanation is all that's needed but problems crop up when there is no explanation.

Taking a stab at fixing our quality issues I prepared a short PowerPoint presentation to give to the team and manager. The first slide had a single question.

Question #1: Do we want to load bad data?

The answer might seem obvious but it's more murky depending on the priorities of the organization. For this organization the answer was a resounding 'NO'. Good, now we are making progress.

Question #2: How do we want to handle bad data?

This is a little tricker. We can and do ship bad data records back to the sending systems for them to clean up. In addition we perform data cleansing and validation pre-load on key fields. It doesn't seem to be enough. What is enough, and what is too much? That is question #3 to which I am still thinking about.

The brief synoposis is I'm involved in determining what kind of effort can be made to clean up data.

In my next post covering data quality we will talk about some data cleansing methods. For now try and answer question #1 & #2 above as they relate to your organization.

No comments: