Yesterday I wrote about the arduous task of installing IBM Websphere DataStage. I mentioned that DataStage has been the tool of choice for high-performance ETL, and I wanted to elaborate on this topic today.
In my talks that I do at Code camps, I always take a minute or two to talk about ETL tools. This is a very important part of data warehousing, perhaps the most important part (aside from the data). Too often, the ETL tool choice is made after hearing sales pitches but before really studying the differences in the tools.
There are three main tools on the market: Informatica, DataStage, and SSIS. I haven't directly worked from Informatica, but I'm told that it's similar to DataStage without the hashed files (Server edition). DataStage has some very, very high performance characteristics when using Enterprise edition with large data volumes on a partitioned tables/database design.
However, Microsoft SSIS is really coming up on the inside performance-wise. The 2005 version was very good and offered bulk loaders to load data, and the 2007 release of the native connection packages for different databases is a huge step in this direction. The 2008 version of SSIS is even better and has the cached lookup, which is a big leap for performance as well.
I've written before (in 2006) about some misses in SSIS such as the lack of an easy way to create surrogate keys. I know it can be done using scripting, but there isn't a transform function to do this (or use sequences/identity columns). However, these small misses shouldn't be enough to not consider the tool.
I believe that SSIS is the future of ETL as far as a platform because the price point and the features combine to create a powerful platform. As SSIS grows, other tools will become obsolete and SSIS will take a larger and larger portion of the ETL product market.