Friday, May 18, 2007

Differences in ETL Tools

Here's an interesting topic, today let's look at some of the key differences between the IBM Websphere DataStage ETL product and the Microsoft SQL Server Integration Services suite. This topic has been written in my notebook for a while but an email I received from a code camp attendee pushed me to finally publish it.

First, DataStage (DS) has been around longer and traditionally has been more accepted within the corporate world. DS is now on version 7.5 with version 8 to be released in the fall. DS 8 will be the first all new version for quite a while and there are some excellent, much needed enhancements in the product.

SSIS got it's start as DTS (Data Transformation Services). DTS is a decent tool but not on the scale of DataStage, and the SSIS product certainly makes it a viable, and in some cases, better competitor to DataStage.

Main differences in the product out of the box is that DS has connectors to many different source systems, while SSIS is limited in the provided connectivity, although that is going to change shortly. DS is a powerful tool that provides for very simple ETL building jobs, utilizing Basic code as the supporting language. SSIS is a very powerful tool with many "stages" and options available, but in some cases I've found it harder than I'd like to create simple jobs. SSIS uses Visual Basic for the supporting language, and has great support for scripts. DS has hash files as high performance data stores, and SSIS has raw files. Both products have bulk loader capabilities, and both products are designed for high performance systems. One significant difference is that, not surprisingly, DataStage works with both Unix and Windows OS while SSIS is only available on a Windows platform.

The primary difference is cost, with DS costing 4-5x as much as SSIS for some applications. This is a key reason I believe that SSIS is going to be the future of ETL tools. Not only does it do a majority of the same functionality as DS, it costs much less. With the growing acceptance of ETL tools as business process drivers, look for SSIS to expand the market for ETL and business intelligence tools.

In short, while both DS and SSIS are very powerful, SSIS's cost advantage makes up for some of the weaknesses of the product.


Vincent McBurney said...

Enjoying your damn data blogging.

DataStage also runs on Linux and on the mainframe (DataStage MVS) and on Unix System Services on the mainframe. Access to mainframe data sources is a strength.

DataStage has a true end to end parallel architecture, almost every stage in a parallel job can handle partitioned data and multiple parallel instances of that stage without much additional coding. Parallel jobs can store data in partitioned datasets for staging or reference lookups. Ab Initio was probably the first massively parallel ETL tool, DataStage and Informatica followed. Not many other tools are on the same level.

DataStage has native database connectivity and for the biggest DBMS it has enterprise parallel connectivity that can be significantly faster than native or ODBC.

What SSIS does well is price, accessibility (it's easy to download and add to SQL Server) and interaction with SQL Server objects. As is the case with a lot of Microsoft BI and information management software some of the strengths are also weaknesses if you don't run a Windows environment.

But I am a bit biased since I blog about the IBM products. You can see an interesting Forrester analysis via my Forrester ETL Wave from 2004 to 2007 post.


Wes D said...


Thanks for your input and I totally agree that DataStage is a high performance parallel environment, unfortunately at my shop we use Server edition and create our own parallelism.

I've read your posts on DSXchange and hopefully one day we'll do some work together. Write anytime.