Saturday, July 15, 2006

Data for Dummies - Part 1

Thanks to all those who attended my Data for Dummies session at the Tampa code camp. I very much appreciated all the constructive comments and feedback. This post will address some of the comments.

A couple comments were directed towards the concern that the course was too advanced for an introductory class. I will take that concern into future consideration - one of the problems is that everyone has a different idea of what is introductory and what is advanced, and I have structured the session to hit on the most important parts of the datawarehouse architecture. It would be no problem to cover this topic in 8 hours.

So for the future - I will do an introduction and advanced class. It has been my intention to do this for the past couple code camps but due to the high number of speakers and the logistical nightmares this was not possible.

Another thing I will include is to build a working data warehouse for my full demonstration built around my methodology. That has been the plan for a long time but due to time constraints (grad school and work keeps me very busy), I haven't been able to complete this vision.

But all in all I hope you enjoyed the session. Thanks for the constructive and positive feedback comments. If you have any questions, feel free to email me and I will be happy to put together a quick hits presentation to send to you to fill in any thoughts you wanted but did not receive from the session.


Wednesday, July 12, 2006

High Performance Data Testing

My goal was to conduct some performance tests to determine the difference using Oracle vs. SQL Server 2005 as data feeds for SSIS transformations.

I created an identical table in both Oracle and SQL Server. The DDL is this:


I loaded the table with 9.5 million rows of unique data. I created the below job in SSIS to process the rows. The job simply reads the data through and OLE DB connection and writes it out to a flat file with no transformations.

I then fired up the Oracle 10g instance on my personal laptop. Yes it's a high performance laptop but I don't expect near the performance of a EMC disk farm. I started the job and let it run 10 seconds. The following screen shot is after 10 seconds.

My second test was to run the same job connecting to SQL Server 2005. Remember that the SQL Server 2005 table has exactly the same data. Notice the drastic difference in data speeds.

Did you notice what I noticed? SQL Server 2005 is pulling data at 10x the speed of the Oracle OLE DB connection. For further tests, I created a non-unique bitmap index in Oracle 10g on the order_dt_key column. This had no apparent effect on the reading speed.

All I can say is that I am certainly leaning towards doing more ETL development in SQL Server 2005.

Tuesday, July 11, 2006

Surrogate Keys in SSIS

Surrogate keys are a way to build in referential integrity within a EDW environment snowflake schema without having the database enforce foreign key constraints.

A surrogate key is merely a random number that has no value other than to uniquely identify the row. A good example of a surrogate key (SK) is creating a sequence and having the sequence.nextval() routine called when inserting a row. This introduces the problem of making a DB call to transform a row which will drastically slow down high volumn processing. In IBM Websphere DataStage product it is very easy to create surrogate keys by using freeware functions. At my employer we load surrogate keys in a high performance method.

Using Microsoft SQL Server Intergration Services (SSIS) is a completely different story. Although the SSIS tool is very powerful, it appears to be a major oversight by Microsoft to not include a function that creates surrogate keys on the fly. In building some ETL jobs for my current client I ran into the problem of how to create surrogate keys. Remember that a key idea of a SK is for it to be unique. Indexes will be built on the SK field to allow quick retrieval of data.

I started out my quest to develop a workaround where I always begin my searches - the Yahoo homepage. I was quickly redirected to which recommends two methods of creating SKs. The first method requires using a script variable to store the SK value during processing; the second method is a plug-in component that was created for SSIS. The component worked but didn't (from a quick glance) look to be what I wanted.

Therefore I was challenged to come up with my own method for creating a SK. This is very easy the first time a table is populated. Initialize a variable with 1 and increment. But what happens when daily processing is ended and a second run begins? The variable is then initialized back to 1 which results in duplicate SKs. My workaround is this:
1) Create the first ETL job to select the max SK in each dimension table
2) Store this value in an ETL processing table with the SK and dimension
3) When doing processing for the dimension, pull this value, initialize into a
variable, and then load, incrementing scriptally during processing.

Below is the screen shot of what my seed job looks like. Note there is a OLE DB command that truncates the table each time the job is ran.

Here's a screenshot of the data grid in Toad (for SQL Server. Highly recommend this product).

During processing this value is placed into a variable that is incremented allowing for successful SK handling.

I will submit a product suggestion to Microsoft for the next release to include a built in SK generator. I'm willing to be it's already on their radar.

Hope this explanation was helpful. Drop me a comment with suggestions.