Sunday, July 27, 2008

One, Two, or More Packages?

For today's topic we will look at a a major ETL architecure consideration: how many tasks should we have in an ETL package?

One of the beautiful things about ETL is visual data flows that really help the architect consider the best way to move and transform data. Alas, each ETL architect has a different opinion on when to split tasks out into a separate package or to put everything in a single package.

I'll use another consulting experience that I have seen - clients who want to put all kinds of data derivations in a single package. My general rule of thumb is split off packages where there is a logical split in the data - have a separate package to handle order data and sales data. Sometimes there is other reasons to have separate packages - maybe you want to have a package that loads up a SQL Server database and another package to write to a Oracle instance. Normally I would just use Expressions to keep from creating multiple jobs, but consider the case where there are two processes that are similar but not exact, where using an Expression would not be practical.

Another key consideration is for restart capability in event of failure. I don't want to design a package that loads up data but has no ability to restart in case there is a referential integrity violation in the middle of the load.

For more information, see my ETL Architecture design document and/or ETL Architecture considerations presentation I did in June 2007 on the Durable Impact website to get a better idea of some of the considerations used to determine how to break up processes into separate pacakges.

As a final note, don't have a package that looks like the one above.

No comments: