Monday, March 13, 2006

Project Estimation

I've just returned from a three-day very relaxing fishing nirvana that gave me plenty of time to think about todays topic. For the record I did catch two very nice largemouth bass and a bunch of smaller ones I hope to catch again next year. WES: 2, BASS: 0.

Project estimation has to be one of the most difficult areas of project management. I've worked with people who have a standard "6-months" answer. I ask "How long will it take to create 100 jobs?" The answer: "6-months". "Two jobs?" "6-months". Really making progress there...I could probably make this into a Dilbert cartoon.

Project estimation is related to all areas of development, not just datawarehousing. As any of us who have ever spec'd out worked units knows, estimating the time that it takes to accomplish a specific task is easier said than done. It depends on many variables including worker motivation, worker ability, timeline, and difficulty.

Let's start out by determining a unit of work. A unit-of-work, or work package, is the smallest piece of work. For a datawarehouse this might be extracting rows from an Oracle table. A project includes the other jobs (transformation, cleaning, load).

Worker motivation also influences timelines. I think there is a good compromise between being over-motivated and under-motivated. I'd rather have someone between the two extremes on my team eliminating burn-out and boredom. Happy, challenged employees will do much better work than someone bored with work. From my very first job in high school I have continually heard the mantra that I'll call labor shuffling. It's the theory that any one person can do any job in a certain organization. While I believe this is a great idea in theory I am not convinced that it works in practice unless executed to almost impossible precision. Never mind that putting on my economist hat it really bothers me to be operating in a region where we are inefficient.

Ability is difficult to ascertain. Enough said.

Tonight I had a disucssion with an associate regarding some opportunities he is having on a project. He's trying to put together an accurate timeline without having defined milestones. He has dates and estimated project durations for the project but they are mostly a stroke of wishful thinking. I recommended he work out an objective list broken down by work packages then use the PMI formula for estimating project timelines. I'm hoping our discussion helped him out.

Here's the formula PMI uses to estimate projects (I'm a Project Management Professional (PMP) and a member of PMI so I feel it's appropriate to share this here...let me know if you feel otherwise....taken from PMBOK, version 2000)

Pert Analysis
Program Evaluation and Review Technique - weights activities to determine most likely duration

PERT Weighted Average = Optimisitc + 4 Most Likely + Pessimistic / 6

Let's put this into play thinking of a DW load process. I'll give you the numbers and you figure out the weighted average.
O=5 ML=10 P=15

Time for you to do the math.......don't cheat.

Plugging it into the formula, WA = 5 + 4(10) + 15 / 6 = 60/6 = 10
It's probably going to take 10 days to complete so most likely is correct in this case. Hmmmm, maybe I should pick harder numbers. I think you get the idea.

If there is further interest I'll dive a little deeper into estimation techniques.

This post is a little short on content but my goal is to get you thinking about estimating timelines. Write down the answer to the following question on the same piece of paper you used to figure out the Pert forumla (you did write it down didn't you??) How do you estimate your timelines and is it accurate?

Tuesday, March 07, 2006

Data Quality

Today's discussion will center around the absolute most important concept in data warehousing (in my opinion); data quality.

Users might notice if you don't have data but you can bet they will notice if your data is incorrect and probably consider you, your team, and your processes to be incompotent.

Here's a three guidelines I have regarding data quality
1. Data quality is not an accidental result of a process; it is a planned result of efficient and correct processes, THUS
2. Having no data is better than incorrect data, THUS
3. If data quality is not important in your processes, you will not have quality data

Sounds like a no-brainer doesn't it? I beg to differ - of ALL the IT projects I have worked on quality is the biggest issue that generally gets the least effort. Think about it - a project is running long. What's the first part to get cut? Testing time. Quality review. UAT.

A previous project I worked on dealt with prescription information. Wouldn't you think data quality is important there? Of course you would - knowing that if data was incorrect a drug interaction could be missed potentially causing tragic consequences including death. So you don't get prescriptions and thus you are not affected by data quality? Okay, what if the erroneous code because of developer oversight was used for your parents prescriptions? I bet you're a lot more interested in the far reaching effects of data quality now.... (to be fair, not all data quality is a life or death matter, but it can have serious consequences to an organization)

At the client I'm working with now we had a severe problem with quality in the past couple of weeks. The datawarehouse is a marvelous tool but in the past things have been rushed into production without considering quality issues until a user squawks that numbers don't look right or some report doesn't balance out with what the source system says.

Let's get it clear right away that source systems won't always balance out with the data warehouse. The source system for my client accepts orders/returns/shipments. The datawarehouse will reject a return that doesn't have a shipment. The source system says its balanced so the datawarehouse must be wrong in the eye of the business user, right? WRONG! Many times a simple explanation is all that's needed but problems crop up when there is no explanation.

Taking a stab at fixing our quality issues I prepared a short PowerPoint presentation to give to the team and manager. The first slide had a single question.

Question #1: Do we want to load bad data?

The answer might seem obvious but it's more murky depending on the priorities of the organization. For this organization the answer was a resounding 'NO'. Good, now we are making progress.

Question #2: How do we want to handle bad data?

This is a little tricker. We can and do ship bad data records back to the sending systems for them to clean up. In addition we perform data cleansing and validation pre-load on key fields. It doesn't seem to be enough. What is enough, and what is too much? That is question #3 to which I am still thinking about.

The brief synoposis is I'm involved in determining what kind of effort can be made to clean up data.

In my next post covering data quality we will talk about some data cleansing methods. For now try and answer question #1 & #2 above as they relate to your organization.