Today’s topic will do a deep dive into ETL architecture considerations for the IBM Websphere DataStage product. Specifically, I am going to discuss the difference between the Lookup and Join stages in version 7.5. For those who might be confused, version 7.5 will cover 7.5.1, 7.5.2, and 7.5.x2, 7.5.3, etc. I've worked on clients will different variations of the tool but the functionatlity for these stages is the same.
This is a real-life case study of a problem I had desiging a job for a client. The job read from a source table with a driving table containing rows that were to be pulled.....a generic SQL query to do the same thing would look like:
select * from BIG_TABLE where record_type IN (select * from REF_TABLE where process_date = current date)
In DataStage I could have done a couple different things. I could have used a database stage to pull the rows with the join in a single stage and had good performance by forcing the DB engine to do a majority of the work. However, for the purpose of maintenance I chose to use a couple different stages to do the job because the client liked having stages on the pallete for people who support the jobs to easily see what is happening. Fair enough....
Thus I had the option of using Join, Lookup, or Merge stages. Merge is designed to be used to have two similar datasets (original and update sets) and merge them together. Since both tables had different metadata, this was not the best tool for the job. Eliminated from consideration.
The Join stages has options to perform a left, right, inner, or full outer join. Without going into extreme detail of the four types of joins (research SQL joins), I could set up the driver as the left table and the reference as the right and match the rows based on the date column. Okay, so this solution would work but it wasn't really what I wanted to do, so I went to the Lookup stage.
Lookup is similar to join except that it looks up a row in the reference dataset for each row in the source dataset. This is memory intensive and should be avoided on large datasets and I was dealing with millions of rows. I knew that my reference dataset would only have a maximum of 9 rows per run and there is this nifty little option to have the lookup table loaded into memory. Ah ha!!! Now I can load the reference table into memory, perform a lookup on the rows in memory, and get some great performance. For very large reference datasets the join will use less memory, but it worked perfectly for this application.
With some further tuning of the job that was initially running 45 minutes per run, I was able to get it down to 6 minutes. Huge performance increase for a little time spent thinking about better ways to architect the job.
Do you have a good example of where a simple ETL architecture change saved you a large amount of processing time?