Surrogate keys are a way to build in referential integrity within a EDW environment snowflake schema without having the database enforce foreign key constraints.
A surrogate key is merely a random number that has no value other than to uniquely identify the row. A good example of a surrogate key (SK) is creating a sequence and having the sequence.nextval() routine called when inserting a row. This introduces the problem of making a DB call to transform a row which will drastically slow down high volumn processing. In IBM Websphere DataStage product it is very easy to create surrogate keys by using freeware functions. At my employer we load surrogate keys in a high performance method.
Using Microsoft SQL Server Intergration Services (SSIS) is a completely different story. Although the SSIS tool is very powerful, it appears to be a major oversight by Microsoft to not include a function that creates surrogate keys on the fly. In building some ETL jobs for my current client I ran into the problem of how to create surrogate keys. Remember that a key idea of a SK is for it to be unique. Indexes will be built on the SK field to allow quick retrieval of data.
I started out my quest to develop a workaround where I always begin my searches - the Yahoo homepage. I was quickly redirected to www.sqlis.com which recommends two methods of creating SKs. The first method requires using a script variable to store the SK value during processing; the second method is a plug-in component that was created for SSIS. The component worked but didn't (from a quick glance) look to be what I wanted.
Therefore I was challenged to come up with my own method for creating a SK. This is very easy the first time a table is populated. Initialize a variable with 1 and increment. But what happens when daily processing is ended and a second run begins? The variable is then initialized back to 1 which results in duplicate SKs. My workaround is this:
1) Create the first ETL job to select the max SK in each dimension table
2) Store this value in an ETL processing table with the SK and dimension
3) When doing processing for the dimension, pull this value, initialize into a
variable, and then load, incrementing scriptally during processing.
Below is the screen shot of what my seed job looks like. Note there is a OLE DB command that truncates the table each time the job is ran.
Here's a screenshot of the data grid in Toad (for SQL Server. Highly recommend this product).
During processing this value is placed into a variable that is incremented allowing for successful SK handling.
I will submit a product suggestion to Microsoft for the next release to include a built in SK generator. I'm willing to be it's already on their radar.
Hope this explanation was helpful. Drop me a comment with suggestions.