Thursday, December 28, 2006

2006 – A Year of the Damn Data

It’s been almost exactly a year since I started “The Damn Data” (TDD). I have a great memory and love to reminisce about good times in the past so I’ve decided to put a data year-in-review summary below and list my professional resolutions at the end.

January 2006 – Come up with idea for “The Damn Data” website while sitting in a boring meeting at work arguing about data quality, thinking that the problem ‘is just the damn data’. Goal was (is?) to develop a brand about data of all types stored in all types of structures and manipulated by all types of tools.

February 3 – Present first ever session at South Florida Code Camp titled “It’s all about the Data: Building a Data Warehouse using SQL Server 2005”. I’ll never admit it but I was kind of nervous about the appeal of the topic but the 65 people in attendance convinced me that I was onto something. The post party at Tarpon Bend was quite enjoyable, I especially enjoyed talking with the pretty ladies who are also techies, and I think they were from Ft. Lauderdale. Oh well.

March – Much to the angst of Joe Healy, I skipped the Orlando code camp to watch my beloved St. Louis Cardinals play three spring training games in Jupiter, unknowingly watching the precursor to Game 7 of the NLCS (Cardinals @ Mets).

June – Roadtripped in the ‘shaggin’ wagon’ (since sold God bless it) with Healy and Tom Fuller to the Tallahassee Code Camp. Presented “Data Warehousing Essentials with SQL Server 2005” with 14 people in attendance. Slightly disappointed in the post-party at Paradise Bar and Grill due to lack of college students in town during the summer. Strongly suggested to Keith Rowe that TLH Code camp be moved back to September (which I’m told it will be). The best part was seeing city-boy AKA Fuller in his Buffalo Bills visor eating at “Hog Heaven” bbq on the side of U.S. 19, a place where only a redneck would feel at home (like Healy and I are at heart).

July – Wasted a good weekend at the Homewood Suites in Tallahassee working on a contract project and arguing with a batch load guy about the appropriate way to load data into tables. Even worse was the lack of a good bar in close proximity of the hotel. Jumped out of a perfectly good airplane with a fellow techie (Laurie) against my own judgement, but it was well worth it. Attended the Tampa code camp and presented two sessions “Data for Dummies: Parts 1 & 2”. First session went good, about 32 people in the room, some good questions, but a problem with my SSIS demo. DOH!!!!! I must have scared a bunch of people away because only 4 came to the second session. Enjoyed the candid feedback received from the evaluation forms, some of which said “speak more slowly, covered material too fast” and “speak faster, spent too much time on basic material”. MORAL – can’t please ‘em all.

August – Attended the Tampa Bay Technology Forum Summer tech jam (www.tbtf.org) at Channelside. It was a lot of fun and I got a chance to meet up with a lot of recruiters (many of them were very pretty and you know who you are, and I’ll still take you up on that dinner with your friend, Michelle). I also socialized with old coworkers and friends who I hadn’t seen in a long time. At the end of the month, I roadtripped with Healy, Kabza, Fuller, Schultes, and Zimmy to Jacksonville for the code camp (thankfully I did not drive this time). Shared a room with Kabza and Zimmy at the Hyatt on the St. Johns River. JAX is a beautiful town and I highly recommend it. Drank a couple with the guys at the Landing and ended up presenting “Data Warehousing Essentials with SQL Server 2005” as a fill-in in the afternoon with a pretty good hangover. Received an eval comment that said “too relaxed” contrasting with a comment saying “slow down and relax”. Went to the Bucs @ Jags preseason game with Kabza, riding the water taxi to the game; it was fun and worthwhile and one of the highlights of my summer.

September – Presented “Introduction to SQL Server Integration Services” at the Tampa SQL user group meeting with about 16 in attendance. The presentation went well and I enjoyed socializing with the group at Courtside Grille afterwards. Spent too much time chasing women and socializing to think a lot about data…….

October – Didn’t do a damn thing related to data, but I did hop on American Airlines and found myself at Game 5 of the World Series, watching the St. Louis Cardinals decimate the Detriot Tigers for the championship. I thought about data warehousing after noticing that it took 3 days for my AAdvantage miles to post to my account after my trip. In contrast, Delta posts the miles to my account by the time I get home from the airport. Could this be the difference between a batch load and a trickle feed (right time) data warehouse? Interesting…. Oh, and thanks to Kabza and Schultes for joining me at the Green Iguana (after Russ’ MSDN event) and Courtside Grille to watch the NLCS game 7.

November – Courtesy of my employer, attended The Data Warehouse Institute (www.tdwi.org) World Conference at Universal Studios in Orlando. I was able to walk around CityWalk during lunch and found it to be enjoyable. I learned a lot about ETL architecture and data warehouse design considerations. At the end of the month, went out with Tim, Kabza, Vinay, Schultes, and Healy to a good bar (The Independent) and Indian food restaurant in downtown St. Pete. It’s always a treat to socialize with these guys and I value those moments highly.

December – Attended the Visual Studio Team Systems for Database Professionals launch in Tampa at the Grand Hyatt. I unexpectedly was recognized in front of the 120 people there for my tribal knowledge and my wonderful website thanks to the Microsoft guys. Later in the month I attended the Hot Spaces.net Tech Social where my nametag with my name WES @ www.thedamndata.com drew quite a few stares. Met some great people and hopefully a hookup for some Tampa Bay Lightning tickets although she flaked at my first request.

Wow – now that I have written that I feel great and remember what a wonderful year it really has been. The goal now is to make 2007 even better, and tomorrow I will write a post with my thoughts on those goals.

Wednesday, December 27, 2006

Windows Vista – A New experience

Those who have seen my presentations know that my life is on my laptop. Therefore, it was fitting that I was exceptionally frustrated when a 10-month old Dell Inspiron 9300 laptop began to chug along like a 4-cylinder truck.

I reformatted the laptop and using my MSDN Universal subscription decided to install Windows Vista Ultimate as my operating system of choice. Thanks to my excellent file management practices (nicely complementing my good data design practices), backing up my projects and personal files took all of 30 minutes. I downloaded Windows Vista and I was off running to the races.

The install process was especially smooth. Friends reported that Vista installs quickly, my experience did not reflect that and it seemed to take a good hour to fully install. Most of the hardware had build in Vista drivers.

The new look and feel is beautiful. Thankfully I purchased the laptop with 1 GB of memory and the dedicated (and at the time, kick ass) Nvidia GeForce 6800 256 MB (dedicated) video card. The fade look and shadows are very nice.

Vista is a bit of a memory hog and I regularly find that 500 MB of my memory is used even before I open up any programs. It will be interesting to see how Oracle 10g runs on Vista and SQL Server 2005 when I am running my demonstrations. With XP Professional, running Oracle and SQL Server 2005 together was risky business.

The only issues I have had so far have to do with my legacy programs. My HP printer will not install a driver, thus it’s practically worthless to me right now. I also use a statistical analysis package called STATA for my economics research, and that program would not install correctly on Vista. Even running as administrator with XP SP 2 compatibility did not work.

Thankfully I can still use my trusty Dell Dimension 8200 (4 ½ years old) to run my legacy apps and print documents. I certainly recommend anyone upgrading to Vista to consider having 2 GB of RAM as a minimum. I am loathe to upgrade the lappy for $200 right now as this machine will be replaced in 11 months with an Apple MacBook pro unless Dell has a more compelling product.

Along with Vista I installed Office 2007 Professional. I love the new look and features of this program and will probably do a separate write up of it later after I have used it more.

All in all, Vista appears to be a great step forward and I look forward to showing it off at the Miami code camp on February 10, 2007.

Monday, December 04, 2006

Data Warehouse opportunities in Tampa Bay

It seems appropriate that I have been requested to help find some good talent to work with in the Tampa Bay area and the benefit is that you will be able to work with me, thus you will learn from the best (*HA*). Keeping in line with my blogging policies, I will not say the employer but if you've seen me speak you know where I work.

Position requirements:
-------------------------
College degree - no experience is okay, these are entry level positions
Willingness to learn
Good attitude

I've been speaking all over the state of FL this year and certainly there are 1 or 2 people who want to come work in a high-performance data warehouse environment, learn from people at the top of their game, and have a very strong resume builder position on their resumes.

Here's the technology you will be using:
-----------------------------------------
Oracle 9i/10g
IBM Websphere DataStage ETL
Quest TOAD for Oracle

We're also willing to consider entry-level people with experience in other ETL tools such as SSIS and Informatica.

If you are interested in learning more or want to send me your resume, my email is thedamndata AT yahoo DOT com. If you can't figure that one out, then don't bother.

Friday, September 15, 2006

Tampa SQL Presentation

On Tuesday, September 19, I will be presenting "Introduction to SQL Server Intergration Services" at the monthly Tampa Bay SQL user group meeting.

The presentation will cover the SSIS tool and we will look at ETL design and developing and deploying some small ETL jobs to move data around different databases. I'll also take questions from the audience and attempt to answer them or make up some kind of answer to salvage my reputation!

Hope to see you there.

Tampa SQL Website

Saturday, August 26, 2006

Jacksonville Code Camp

The second annual Jacksonville code camp is August 26, 2006 in downtown Jacksonville. I didn't post this earlier because I wasn't sure of my attendance until the very last minute. A sequence of events (one is a pattern Mr. Rawls!!) transpired that made my travel to Jacksonville possible. My continuous travel has put a damper on housekeeping tasks and my session to speak was not turned in on time, but nevertheless I am giving a 75 minute session today titled "Introduction to Data Warehousing".

Tonight we will be at the Copper Cellar in Jacksonville Landings for the post-party event.

Come support our 'brothers-in-arms' in the Florida developer community at the Jacksonville code camp.

Wednesday, August 16, 2006

JumpStart TV

Brian Knight of the Jacksonville developer community has created a great web site that contains instructional videos on technology topics.

I am working on an 8-part series of videos titled "Introduction to Data Warehousing". Each video is a piece in a cumulative series that gives a beginner overview of data warehousing. Each video is 10 minutes in length, and in the future I am going to create an "Advanced Data Warehousing" series. I'm also queued up some videos on SSIS, Database Performance Considerations, and other interesting data topics. Comment on anything else you would like to see.

The site is www.jumpstarttv.com

Wednesday, August 09, 2006

SQL Server and Surrogate Keys, Part 2

Last month I posted an article on the miss by Microsoft by not including a run-time surrogate key generator in the SSIS tool suite.

I received a couple comments asking why do I not use the identity column in the SQL Server database to be the de-facto surrogate key. This is certainly a good plan of action for using the SQL Server database, but the ETL jobs I am creating require the use of an Oracle database. Thus the identity column idea is not a workable solution for Oracle.

In my realm I have created an ETL methodology that is designed to be as database agnostic as possible however there will always be a need to customize the approach to the specific situation and request of the client.

Saturday, July 15, 2006

Data for Dummies - Part 1

Thanks to all those who attended my Data for Dummies session at the Tampa code camp. I very much appreciated all the constructive comments and feedback. This post will address some of the comments.

A couple comments were directed towards the concern that the course was too advanced for an introductory class. I will take that concern into future consideration - one of the problems is that everyone has a different idea of what is introductory and what is advanced, and I have structured the session to hit on the most important parts of the datawarehouse architecture. It would be no problem to cover this topic in 8 hours.

So for the future - I will do an introduction and advanced class. It has been my intention to do this for the past couple code camps but due to the high number of speakers and the logistical nightmares this was not possible.

Another thing I will include is to build a working data warehouse for my full demonstration built around my methodology. That has been the plan for a long time but due to time constraints (grad school and work keeps me very busy), I haven't been able to complete this vision.

But all in all I hope you enjoyed the session. Thanks for the constructive and positive feedback comments. If you have any questions, feel free to email me and I will be happy to put together a quick hits presentation to send to you to fill in any thoughts you wanted but did not receive from the session.

WD

Wednesday, July 12, 2006

High Performance Data Testing

My goal was to conduct some performance tests to determine the difference using Oracle vs. SQL Server 2005 as data feeds for SSIS transformations.

I created an identical table in both Oracle and SQL Server. The DDL is this:

CREATE TABLE EXTREME_DATA_TEST
(
ORDER_DT_KEY NUMBER(10),
ORDER_ID NUMBER(9),
OTHER_DATA NUMBER(2)
)

I loaded the table with 9.5 million rows of unique data. I created the below job in SSIS to process the rows. The job simply reads the data through and OLE DB connection and writes it out to a flat file with no transformations.



I then fired up the Oracle 10g instance on my personal laptop. Yes it's a high performance laptop but I don't expect near the performance of a EMC disk farm. I started the job and let it run 10 seconds. The following screen shot is after 10 seconds.



My second test was to run the same job connecting to SQL Server 2005. Remember that the SQL Server 2005 table has exactly the same data. Notice the drastic difference in data speeds.



Did you notice what I noticed? SQL Server 2005 is pulling data at 10x the speed of the Oracle OLE DB connection. For further tests, I created a non-unique bitmap index in Oracle 10g on the order_dt_key column. This had no apparent effect on the reading speed.

All I can say is that I am certainly leaning towards doing more ETL development in SQL Server 2005.

Tuesday, July 11, 2006

Surrogate Keys in SSIS

Surrogate keys are a way to build in referential integrity within a EDW environment snowflake schema without having the database enforce foreign key constraints.

A surrogate key is merely a random number that has no value other than to uniquely identify the row. A good example of a surrogate key (SK) is creating a sequence and having the sequence.nextval() routine called when inserting a row. This introduces the problem of making a DB call to transform a row which will drastically slow down high volumn processing. In IBM Websphere DataStage product it is very easy to create surrogate keys by using freeware functions. At my employer we load surrogate keys in a high performance method.

Using Microsoft SQL Server Intergration Services (SSIS) is a completely different story. Although the SSIS tool is very powerful, it appears to be a major oversight by Microsoft to not include a function that creates surrogate keys on the fly. In building some ETL jobs for my current client I ran into the problem of how to create surrogate keys. Remember that a key idea of a SK is for it to be unique. Indexes will be built on the SK field to allow quick retrieval of data.

I started out my quest to develop a workaround where I always begin my searches - the Yahoo homepage. I was quickly redirected to www.sqlis.com which recommends two methods of creating SKs. The first method requires using a script variable to store the SK value during processing; the second method is a plug-in component that was created for SSIS. The component worked but didn't (from a quick glance) look to be what I wanted.

Therefore I was challenged to come up with my own method for creating a SK. This is very easy the first time a table is populated. Initialize a variable with 1 and increment. But what happens when daily processing is ended and a second run begins? The variable is then initialized back to 1 which results in duplicate SKs. My workaround is this:
1) Create the first ETL job to select the max SK in each dimension table
2) Store this value in an ETL processing table with the SK and dimension
3) When doing processing for the dimension, pull this value, initialize into a
variable, and then load, incrementing scriptally during processing.

Below is the screen shot of what my seed job looks like. Note there is a OLE DB command that truncates the table each time the job is ran.





Here's a screenshot of the data grid in Toad (for SQL Server. Highly recommend this product).





During processing this value is placed into a variable that is incremented allowing for successful SK handling.

I will submit a product suggestion to Microsoft for the next release to include a built in SK generator. I'm willing to be it's already on their radar.

Hope this explanation was helpful. Drop me a comment with suggestions.

Tuesday, June 20, 2006

Tampa Code Camp

Time for the 2nd annual Tampa Code Camp. The Tampa code camp was my first code camp and I can easily recall the great time that it provided. Code camps are a professional conference, by and for the community, free for all. It is sponsored by Microsoft and contributed by other local groups. Top notch speakers from the state, region, and nation decend on Tampa on July 15 to speak at USF.

I'll be doing two sessions. The sessions are "Data for Dummies: Introduction to Data Warehousing". This session is the entry level session covering data warehouse practices and principles. The second session, "Data for Dummies Part 2: Advanced Data Warehousing" will cover intermediate and advanced data warehousing concepts including advanced data transformations, slowly changing dimensions, partitioned spaces, and some DBA tasks. It's not imperative to attend one session to receive anything from the other although they are designed to be cumulative.

For my friends (coworkers) who want to attend my session but not the entire code camp, please DO NOT register. The code camp is over 1/2 full already before the official open registration. I will let you know when and where my sessions will be held and you can attend my sessions only. If you would like to attend other sessions, please DO register. Without registration, you will be unable to attend the lunch and register for the drawings.

The code camp will be held at the University of South Florida (my future graduate alma mater) in the Business Administration building (my second home).

If I've personally invited you from out of town, drop me a line and I may (no promises) secure you an invite to the Friday night gathering. The Saturday night post-party is open to every attendee of the code camp and I will be there guaranteed. I'm always willing to chat and if you buy the beer I'll be more than happy to talk your ear off.

Here's the link to the website: Tampa Code Camp 2006

If you look through the video of the South Florida code camp on the website you might just see my presentation......

Monday, June 19, 2006

Tallahassee Code Camp Review

I have returned safely from the Tallhassee code camp. All in all it was a great weekend; here are my comments:

1) The session ("Data Warehouse Essentials") went really well. There were ~20 people in attendance. Got some great questions from the group, and if you personally requested me to mail you the materials, you will get them this week as soon as I attach my copyright and legal notice to the slides.

2) The code camp was good. A little smaller than last year, but I think all are in agreement that it should be moved back to September. Thanks to Keith Rowe and Jose Fuentes for putting on a great time. Paradise Bar was a good hangout place and I appreciated the hospitality shown by the TLH locals.

3) Thanks to Joe and Tom for being great passengers on the ride to/fro the event.

4) If you're ever in Suwanee (on U.S. 19), check out the bbq shack there. It was a great place and reminded me so much that we are in the beautiful South, better known as God's country.

5) I highly recommend a visit to the old state capitol if you're ever in Tallahassee. It is now a museum full of interesting facts about the state of FL. It doesn't hold a candle to the Missouri state capitol in beauty or size but it is a Southern building and is worth the look.

It was a successful event and thanks to all those who helped produce it and especially those who attended my session.

Friday, June 09, 2006

Come see me speak......

....at the second annual Tallahassee (FL) Code Camp on June 17, 2006. It's at Florida State University (looks like I'll be getting another hat for my hat collection) and is an all-day geeky event. Seriously, it's a professional event sponsored by Microsoft that features some of the best minds in the south giving presentations about current topics in technology. I know a lot of the presenters and it's always enjoyable socializing at the event.

My session is titled "Data Warehousing Essentials with SQL Server 2005". This session will cover historical and current trends in data warehousing, the fundamentals of building a data warehouse, and discussion of ETL (extract, transform, and load) techniques. The tool I'll be using to demonstrate the project will be the Visual Studio 2005 Business intelligence suite, also known as SSIS (SQL Server Integration Systems). I'll use SQL Server 2005 for the database as well as Oracle 10g.

Here's the link for the site to read about and register for the event: TLH Code Camp.

I'll be available at the invite-only social on Friday night as well as the open post-party on Saturday night if you have questions about data warehouse or just want to chat. I can also schedule some 1-on-1 time during the event on Saturday to talk about your specific application and answer questions. Leave me a comment and I'll email you.

Hope to see you there.

Thursday, May 25, 2006

Data Warehouses in the news

The following link points to a Business Week story. Just a shameless plug for BW, but its' my favorite magazine because of the in-depth articles.

Stopping The Sprawl at HP, Courtesy of Business Week

Reading during lunch today I found an interesting article about how HP is building an enterprise data warehouse, consolidating 46 data centers into 6. This is of great interest to me because this confirms my assertion that many companies have too much data and are so overwhelmed that they don't know what to do with it. Even worse, many companies I've talked to don't see to know exactly what data they store and need although they have an idea how they would like to use the data.

Sorry I haven't posted much lately - I don't want this blog to die on the vine so I will make a profound effort to think up some good articles. Have a good Memorial weekend.

Tuesday, April 25, 2006

Where has The Damn Data man been?

I'm still around just very busy with a couple things. I spent a weekend fishing, one weekend in Atlanta, and another weekend watching my beloved St. Louis Cardinals baseball team.

Never fear though I have completed the first draft of my ETL Design Methodology document. Check back on Wednesday, April 26 (tomorrow) and it will be posted for comments.

I'm also preparing a column on ETL audit processes. Stay tuned.

For those friends at work wondering, here's a pic of my home office.

Sunday, April 02, 2006

Data Extraction

First off, sorry about the lack of posts recently. Between dealing with a couple projects, a home, a parental visit and subsequent road trip, and everything else I haven't made time to write lately. Today's post will be a little short because I tore a hole in my hand today removing the oil filter on my truck.

ETL stands for Extract, Tranform, and Load. Let's dive into the Extract portion of data warehousing.

Extract as defined by Merriam-Webster dictionary is "to draw forth (as by research) extract data". Wow, they actually have data extraction listed in the defintion (way cool). Data warehouses can either extract data from the source system itself or receive files of extracted data. The point is that input data in the data warehouse must be extracted from another source and the method is irrelevant to the goal.

At the project I'm working on we have jobs that both receive extracted data and pull data, performing the extraction process ourselves.

Let's explore receiving first. Data is received in a flat file of pre-detemined layout (or it may come in a XML or some other message, depending on the architecture you are using). The layouts are delimted (comma, pipe, or some other delimited) or fixed-width. Fixed-width files harken back to mainfram COBOL days when file definitions were hardcoded into the program themselves. At work we prefer pipe-delimited files because it's a lot easier to change things around and maintain a working audit and log process.

The other type is to have some kind of extract job. I've seen it done both in actual DataStage jobs and Oracle procedures. We have a couple daily jobs that extract data directly from the source systems. The systems have an agreed upon time that we will connect and pull data. There is a little business knowledge that must be had here to pull the correct data, thus it's important to keep in mind that performing the extraction FROM the data warehouse shifts the responsibility for getting the correct data back to the ETL developers.

Once the data is pulled or received from the source systems it's placed in files or staging tables (again depending on your specific approach.) In a good system there will be an audit mechanism keeping track of what/when data was received, preferably including some kind of row count and timestamp.

Alright folks, that's pretty much the lowdown on the extraction process. In my next post I will discuss staging areas and audit processes for the extraction process and then we will move on to the transform part. Stay tuned.

As a teaser, I'm giving a presentation to a few people this week on Oracle Database performance tuning. I'll post the presenation up here and highlight a couple areas for in-depth discussion.

Monday, March 13, 2006

Project Estimation

I've just returned from a three-day very relaxing fishing nirvana that gave me plenty of time to think about todays topic. For the record I did catch two very nice largemouth bass and a bunch of smaller ones I hope to catch again next year. WES: 2, BASS: 0.

Project estimation has to be one of the most difficult areas of project management. I've worked with people who have a standard "6-months" answer. I ask "How long will it take to create 100 jobs?" The answer: "6-months". "Two jobs?" "6-months". Really making progress there...I could probably make this into a Dilbert cartoon.

Project estimation is related to all areas of development, not just datawarehousing. As any of us who have ever spec'd out worked units knows, estimating the time that it takes to accomplish a specific task is easier said than done. It depends on many variables including worker motivation, worker ability, timeline, and difficulty.

Let's start out by determining a unit of work. A unit-of-work, or work package, is the smallest piece of work. For a datawarehouse this might be extracting rows from an Oracle table. A project includes the other jobs (transformation, cleaning, load).

Worker motivation also influences timelines. I think there is a good compromise between being over-motivated and under-motivated. I'd rather have someone between the two extremes on my team eliminating burn-out and boredom. Happy, challenged employees will do much better work than someone bored with work. From my very first job in high school I have continually heard the mantra that I'll call labor shuffling. It's the theory that any one person can do any job in a certain organization. While I believe this is a great idea in theory I am not convinced that it works in practice unless executed to almost impossible precision. Never mind that putting on my economist hat it really bothers me to be operating in a region where we are inefficient.

Ability is difficult to ascertain. Enough said.

Tonight I had a disucssion with an associate regarding some opportunities he is having on a project. He's trying to put together an accurate timeline without having defined milestones. He has dates and estimated project durations for the project but they are mostly a stroke of wishful thinking. I recommended he work out an objective list broken down by work packages then use the PMI formula for estimating project timelines. I'm hoping our discussion helped him out.

Here's the formula PMI uses to estimate projects (I'm a Project Management Professional (PMP) and a member of PMI so I feel it's appropriate to share this here...let me know if you feel otherwise....taken from PMBOK, version 2000)

Pert Analysis
Program Evaluation and Review Technique - weights activities to determine most likely duration

PERT Weighted Average = Optimisitc + 4 Most Likely + Pessimistic / 6

Let's put this into play thinking of a DW load process. I'll give you the numbers and you figure out the weighted average.
O=5 ML=10 P=15

Time for you to do the math.......don't cheat.

Plugging it into the formula, WA = 5 + 4(10) + 15 / 6 = 60/6 = 10
It's probably going to take 10 days to complete so most likely is correct in this case. Hmmmm, maybe I should pick harder numbers. I think you get the idea.

If there is further interest I'll dive a little deeper into estimation techniques.

This post is a little short on content but my goal is to get you thinking about estimating timelines. Write down the answer to the following question on the same piece of paper you used to figure out the Pert forumla (you did write it down didn't you??) How do you estimate your timelines and is it accurate?

Tuesday, March 07, 2006

Data Quality

Today's discussion will center around the absolute most important concept in data warehousing (in my opinion); data quality.

Users might notice if you don't have data but you can bet they will notice if your data is incorrect and probably consider you, your team, and your processes to be incompotent.

Here's a three guidelines I have regarding data quality
1. Data quality is not an accidental result of a process; it is a planned result of efficient and correct processes, THUS
2. Having no data is better than incorrect data, THUS
3. If data quality is not important in your processes, you will not have quality data

Sounds like a no-brainer doesn't it? I beg to differ - of ALL the IT projects I have worked on quality is the biggest issue that generally gets the least effort. Think about it - a project is running long. What's the first part to get cut? Testing time. Quality review. UAT.

A previous project I worked on dealt with prescription information. Wouldn't you think data quality is important there? Of course you would - knowing that if data was incorrect a drug interaction could be missed potentially causing tragic consequences including death. So you don't get prescriptions and thus you are not affected by data quality? Okay, what if the erroneous code because of developer oversight was used for your parents prescriptions? I bet you're a lot more interested in the far reaching effects of data quality now.... (to be fair, not all data quality is a life or death matter, but it can have serious consequences to an organization)

At the client I'm working with now we had a severe problem with quality in the past couple of weeks. The datawarehouse is a marvelous tool but in the past things have been rushed into production without considering quality issues until a user squawks that numbers don't look right or some report doesn't balance out with what the source system says.

Let's get it clear right away that source systems won't always balance out with the data warehouse. The source system for my client accepts orders/returns/shipments. The datawarehouse will reject a return that doesn't have a shipment. The source system says its balanced so the datawarehouse must be wrong in the eye of the business user, right? WRONG! Many times a simple explanation is all that's needed but problems crop up when there is no explanation.

Taking a stab at fixing our quality issues I prepared a short PowerPoint presentation to give to the team and manager. The first slide had a single question.

Question #1: Do we want to load bad data?

The answer might seem obvious but it's more murky depending on the priorities of the organization. For this organization the answer was a resounding 'NO'. Good, now we are making progress.

Question #2: How do we want to handle bad data?

This is a little tricker. We can and do ship bad data records back to the sending systems for them to clean up. In addition we perform data cleansing and validation pre-load on key fields. It doesn't seem to be enough. What is enough, and what is too much? That is question #3 to which I am still thinking about.

The brief synoposis is I'm involved in determining what kind of effort can be made to clean up data.

In my next post covering data quality we will talk about some data cleansing methods. For now try and answer question #1 & #2 above as they relate to your organization.

Sunday, February 19, 2006

DW Load Methods

A data warehouse is a warehouse that contains data. Sounds funny doesn't it? Imagine a warehouse as a storage structure that may be physical or logical. In the DW it's both...logical as in the data model represenatation and physical as to the hard disks and other computer equipment that support the data warehouse.

The first question should be "How does the data get there?" Excellent question and I'll cover some different load methods in my column today.

Here are three key areas to keep in mind....

Availability - will someone create a report with transactions completed 20 minutes ago or will it be with yesterday's data?

ETL Process - does our ETL process handle constant loads or is it a once a day process?

Hardware limitations - do we have the hardware resources to allow people to access data and create reports while we are inserting/updating data?

Types of ETL Processing:

1. Batch load - load all data once a day. At the company I work with we do a batch load process that begins at midnight and hopefully a majority is complete by 7 am. (pretty common)

2. Trickle feed - rows are loaded in the EDW as they are received in the collector (the hardest)

3. Microbatch - small batches of data are processed on either a time threshold (every 2 hours) or a row threshold (20,000 records in the collector).

There you have it.....a brief overview of common DW load methods. In a future column we'll further explore each of these methods.

DW Resources

Here's some Data warehouse resources that I find helpful. I'm going to put this on the sidebar but until I take the time to figure that out this will have to do.

General DW information and white papers:
The Data Warehouse Information Center

The Data Warehouse Institute professional EDW organization. I am a member of this group and they put on quarterly conferences. The fall one is always in Orlando and I highly recommend that you attend if you get a chance. This is the conference I attended in Spring 2005.
The Data Warehouse Institute

There are two schools of thought in the data warehouse world. Two of the original fathers of the DW are Bill Inmon and Ralph Kimball. They have a similar goal but different approach and I'll try to write a little about their differences in a future column. For now though here's their websites. I've attended a Ralph Kimball presentation and it was very good; he's a great speaker.

The Kimball Group

Bill Inmon web site:

Inmon Data Systems

Last but not least, here's some books I highly recommend. I have all the books listed below and will be happy to discuss them if you email me. For those who attended my class you have my email but I also have an email (thedamndata) at Yahoo. Figure out the syntax!





Thursday, February 16, 2006

ETL Methodology

I want to take a little time to discuss something that is extremely important in data warehousing: ETL methodology.

ETL methodology can best be defined as the way ETL is done in an environment.

Everything regarding ETL has a process. Here's some pieces:
- Table naming conventions
- Job flow (source, transform, load)
- Job parameters
- Job options (Resurrection, pause, kill)
- General ETL job layout (sourcing on top, transform in middle, loading on bottom)
- Use of Staging areas, temporary files, and temporary tables

I'm developing a complete methodology for SSIS (SQL Server Integration Systems). When complete my methodology will be able to be applied into any environment at any company with only minor modifications.

Once I get a couple of things done around the house I'll start working on my ETL Methodology document. If anyone would like to review it let me know; I could use a couple reviewers.

I'd also like to give a shoutout to Ken Bland. Ken is a very talented ETL architect and developer who has jointly developed a very popular methodology for the IBM WebSphere DataStage(TM) product. Over 90% of my EDW knowledge was a direct result of working under Ken's guidance. His website is listed below if you'd like to take a look at his booming consulting business. Best of luck Ken.

KBA Associates

Code Camp Presentation Materials, Part II

Here's the materials you've been waiting for:

Slides (with phone/email addy removed - you've got that info from the presentation)

Here's the Solution (I'm working on making this better so give it some time)

File Briefcase on Yahoo

Both of my economics tests are over for this month. I did better than I thought on both of them so my quote of the day is "it's better to be overprepared than underprepared"

Thanks for the well wishes....

Tuesday, February 07, 2006

Code Camp Presentation Materials

I had a discussion with a confidant today regarding whether or not to post my Code camp presentation materials (slide and code). My main concern is this: I have a distinct name. I prefer to remain anon. on the internet for multiple reasons (I'm a private person anyway), but my confidant "Guy" convinced me that it would be for the common good of the community and that nothing is anonymous anyway. He made a good point so I will post it.

First, I had a killer test in Economics class tonight and I have another one on Thursday night. I will post the materials sometime later this week and update this posting. Wish me luck on the tests!

I'm also preparing a booklist of recommended reading and websites so I'll get that up as well.

Sunday, February 05, 2006

Simple Performance Testing

While preparing my demonstration for the code camp listed below, I came across something interesting. The SSIS package read rows from SQL Server 2005 about twice as fast as from Oracle 10g.

Let me start off by describing my environment. I have a Dell Inspiron 9300 laptop with 2Ghz processor, 1 GB ram, 80 GB hd. I have a full install of Oracle 10g Standard edition and a full install of SQL Server 2005 standard edition. I created the same tables in both systems, loaded them with the same count of rows (the same data) totaling about 9.5 million rows.

My data flow tasks simply used the OLE stage to pull data out of SQL Server and Oracle and write directly to a flat file stored on my C:\ drive. (I didn't have them both running at once, I ran the oracle job first, and then the SQL server job next). Neither table was indexed.

After 5 seconds the Oracle job had pulled ~300k rows while SQL server had pulled over 700k rows. There was obvious latency as the output file was written to disk.

Someone mentioned to me that the OLE stage in SSIS has an error communicating with Oracle. I will research this more and find out what effect, if any it has on my test.

After I determine if the OLE stage is affecting the performance, I'll run actual tests with a stopwatch and post the full results on here.

SQL Server 2005 is looking very promising as a key tool for a datawarehouse.

South Florida Code Camp

Yesterday I spoke at the south Florida code camp on data warehousing and SQL Server 2005. The session seemed to go well; I received a lot of positive comments and would appreciate more constructive comments (positive or negative).

I have not posted my presentation or code examples online, but if there is demand to do so I don't have a problem with it. Just leave a comment or shoot me an email (the address I gave out during the session) and I'll get them to you or post them on here.

A couple of the other presenters who spoke about SQL Server Integration Systems and SQL Server 2005 and I have agreed in principle to put together a full day track from start to finish covering databases, datawarehousing, beginning transformations, advanced transformations, and then concluding with a wrapping up session. Hopefully we can get this put together for the Tampa code camp in July 2006.

I'm also available for any questions, just drop me a line.

Thanks to the Code Camp staff for putting on an excellent event. The DeVry location was perfect and the facilities couldn't have been better. Tarpon Bend was an excellent restaurant. You guys sure know how to put on a heck of a party.

With a couple exceptions I'd do it again. First, I wouldn't stay at the Wellesley. Their pillows are thinner than my laptop. (Luckily I keep a good pillow in my truck.) I paid $126 a night for a glorified Super 8. Second, I'll try and ride down there with some people. All of my peers are very knowledgeable and generally a whole lot of fun and I would like more opportunity to speak to them in depth regarding IT issues in general.

Future (potential) speaking engagements:
------------------------------------------
Tampa Code Camp - July 2006
Jacksonville Camp - August 2006
Tallahassee Camp - Sept. 2006
B'ham (Dreamland, anyone?) - Oct 2006

Wednesday, January 25, 2006

Welcome

Welcome to the Damn data. Funny website name you say? I work in technology with high volume data warehouses. Every issue I've seen is related to the data. We're storing data, transforming data, loading data, reporting on data, massaging data, etc. If we do not have accurate data at the end of the day our processes and the hard work I've put in for slave wages is moot!

The purpose of this blog is to discuss data issues. Data quality, databases, data designs, metadata, well you got it. If it ends involves data we will discuss it here. Please feel free to email or post comments.

Enjoy; now let's talk about the damn data.