Tuesday, July 11, 2006

Surrogate Keys in SSIS

Surrogate keys are a way to build in referential integrity within a EDW environment snowflake schema without having the database enforce foreign key constraints.

A surrogate key is merely a random number that has no value other than to uniquely identify the row. A good example of a surrogate key (SK) is creating a sequence and having the sequence.nextval() routine called when inserting a row. This introduces the problem of making a DB call to transform a row which will drastically slow down high volumn processing. In IBM Websphere DataStage product it is very easy to create surrogate keys by using freeware functions. At my employer we load surrogate keys in a high performance method.

Using Microsoft SQL Server Intergration Services (SSIS) is a completely different story. Although the SSIS tool is very powerful, it appears to be a major oversight by Microsoft to not include a function that creates surrogate keys on the fly. In building some ETL jobs for my current client I ran into the problem of how to create surrogate keys. Remember that a key idea of a SK is for it to be unique. Indexes will be built on the SK field to allow quick retrieval of data.

I started out my quest to develop a workaround where I always begin my searches - the Yahoo homepage. I was quickly redirected to www.sqlis.com which recommends two methods of creating SKs. The first method requires using a script variable to store the SK value during processing; the second method is a plug-in component that was created for SSIS. The component worked but didn't (from a quick glance) look to be what I wanted.

Therefore I was challenged to come up with my own method for creating a SK. This is very easy the first time a table is populated. Initialize a variable with 1 and increment. But what happens when daily processing is ended and a second run begins? The variable is then initialized back to 1 which results in duplicate SKs. My workaround is this:
1) Create the first ETL job to select the max SK in each dimension table
2) Store this value in an ETL processing table with the SK and dimension
3) When doing processing for the dimension, pull this value, initialize into a
variable, and then load, incrementing scriptally during processing.

Below is the screen shot of what my seed job looks like. Note there is a OLE DB command that truncates the table each time the job is ran.





Here's a screenshot of the data grid in Toad (for SQL Server. Highly recommend this product).





During processing this value is placed into a variable that is incremented allowing for successful SK handling.

I will submit a product suggestion to Microsoft for the next release to include a built in SK generator. I'm willing to be it's already on their radar.

Hope this explanation was helpful. Drop me a comment with suggestions.

7 comments:

Anonymous said...

Why did you search for another workaround when you have Identity columns in SQL Server? How Identity columns are not useful in your scenario? could you please elaborate?

-Chandra

Anonymous said...

i also wanted to ask the same question, why not used the Identity

Wes D said...

Sure, Identity columns are not used in an Oracle database which is the DB I am using for the project.

You are correct they would be useful with a SQL Server backend.

I wrote a post to address your question too.

Thanks for the comments.

Harsh V said...

The article below is worth reading:

http://www.sqljunkies.com/WebLog/sqlbi/archive/2005/05/30/15684.aspx

Anonymous said...

Then I would contend that it's an Oracle shortcoming. Microsoft isn't going to write anything that caters to any other RDBMS. Use the scripting component , if you can't figure it out from there then you’re not trying very hard.

Anonymous said...

This is great solution ,however I sat in a similar position abd more or less did exactly what you did ,but we have 60 different broker databases (Trading System) all feeding me source data from them thru a batch process every night ,and i wanted to do parralell loads in my ETL.I use the same package just different params ,i ran in to big concurrency issues and duplicated SK's ,had to implement a Sequence table

Anonymous said...

Indetity columns fail when using slowly changing dimensions. they o0nce again result in duplicate keys. also the integer key will be much shorter than a guid with a corrosponding space saving.