Tuesday, May 22, 2007

Tampa Business Intelligence Group - Revision

The Tampa SQL BI Group meeting has been moved to June 11 to accomodate those going to TechEd. I'll see you on the 11th.

Monday, May 21, 2007

Konesans’ File Watcher Task in SSIS


Last week I was asked to help a friend redesign some of his ETL processes. It was a great exercise and quite fun knowing that we had to have it completed in the same day due to some time constraints from his customers.





File watchers are programs that ‘listen’ for a file to be placed in a directory. I’ve worked with file watchers written in C, PERL, and manual kick-off processes. Thus there are many ways to do a file watcher programmatically, and today we are going to look at how it is done in SSIS.
The need was to create a file watcher that waited for two files to arrive, then process them in a loop. It looked like this.












The file watcher was created by Konesans – they have good writeup on their webpage about the properties so check it out for information. This post merely shows the product in a production environment.













The task has the following properties:





I’ll explain key variables.




1. Name = name of the task.




2. Description = put a meaningful description here (‘watches for the files’ is not meaning and states the obvious).
Filter = *.* looks for every file. *.induct would only pick up files like TEST.induct.
FindExistingFiles = true if there may be files in the directory already. If False, then it ignores existing files.
Output Variable name = the variable name we assign in the variable box below.



Then I created a Package level variable called User::FileName to store the file name. You must also create a connection manager to list the directory the file will watch for (or create a new one).


The file number variable was used for the looping mechanism – to loop until two files were loaded. Once the file was found, the filename was written into the variable FileName and available for access further down the chain.

Here’s a screen shot of the process running. It looped twice, running the data flow tasks below each time. This task inserted the file name into an audit table. Once the loop was finished, the last file inducted was processed.




Now you should have a better understanding of the excellent Konesas’ file watcher task and a real-world application of this tool.

Friday, May 18, 2007

Differences in ETL Tools

Here's an interesting topic, today let's look at some of the key differences between the IBM Websphere DataStage ETL product and the Microsoft SQL Server Integration Services suite. This topic has been written in my notebook for a while but an email I received from a code camp attendee pushed me to finally publish it.

First, DataStage (DS) has been around longer and traditionally has been more accepted within the corporate world. DS is now on version 7.5 with version 8 to be released in the fall. DS 8 will be the first all new version for quite a while and there are some excellent, much needed enhancements in the product.

SSIS got it's start as DTS (Data Transformation Services). DTS is a decent tool but not on the scale of DataStage, and the SSIS product certainly makes it a viable, and in some cases, better competitor to DataStage.

Main differences in the product out of the box is that DS has connectors to many different source systems, while SSIS is limited in the provided connectivity, although that is going to change shortly. DS is a powerful tool that provides for very simple ETL building jobs, utilizing Basic code as the supporting language. SSIS is a very powerful tool with many "stages" and options available, but in some cases I've found it harder than I'd like to create simple jobs. SSIS uses Visual Basic for the supporting language, and has great support for scripts. DS has hash files as high performance data stores, and SSIS has raw files. Both products have bulk loader capabilities, and both products are designed for high performance systems. One significant difference is that, not surprisingly, DataStage works with both Unix and Windows OS while SSIS is only available on a Windows platform.

The primary difference is cost, with DS costing 4-5x as much as SSIS for some applications. This is a key reason I believe that SSIS is going to be the future of ETL tools. Not only does it do a majority of the same functionality as DS, it costs much less. With the growing acceptance of ETL tools as business process drivers, look for SSIS to expand the market for ETL and business intelligence tools.

In short, while both DS and SSIS are very powerful, SSIS's cost advantage makes up for some of the weaknesses of the product.

Thursday, May 17, 2007

Systems Analysis and Design

Most people who are following my blog know that I have an affinity for economics and am finishing up a masters degree in Economics as we speak. One common question I receive is "how does economics related to software engineering?" Well, that's a pretty easy for me. Let me explain.

Our entire capatalist system is built on economics; the buying and selling of good, creating efficient markets and mediums of exchanges. Software is very similar. The goal is to create an efficient program that uses hardware to accomplish a stated goal. Economics comes to play in software: I get to use economics on a daily basis to prioritize my workload. If it's more efficient to use my time tracking down a bug that is causing an overstatement of sales, that is how I will use my time (providing immediate payoff) instead of working on a development project with a 6-month payoff. Another example is using rules to cleanse data. In our data warehouse, we often get dirty data in and checking every column of every row is a hugely intensive process, adding cumulatively hours to our daily run times. Using economics, I've identified some rows that have marginal value to our processing on which we should not run the cleaning, shaving some time off (in the aggregate).

Now that I have went completely off target, my elective class this summer is "Systems Analysis and Design". This class looks at building processes, data models, use cases, normalizing data, etc. All the good stuff that I do professionally today. That's the good news - I can test my brain, add something to the class discussion, and hopefully learn a lot new in the process. Intrestingly enough, I took a similar undergraduate class about 10 years ago, and we used version 3 of the same book that the class is using now. Good thing I kept the old one; I have found it to be quite useful in the past couple years.

I'm going to post regular updates on my class on here and share some knowledge in the process.

Wednesday, May 16, 2007

Links on Left

I've edited my links on the left of the website to direct to local user groups and SQL resources.

In addition, I've added the links to four companies which deserve special recognition due to my positive experiences with them.

First, American Airlines. Their line is "We know why you fly"(SM). Why yes AA, you certainly do. Thank you for providing such excellent service on the flights I've been on in the past years and I look forward to many more years as an AA Platinum member. One day I hope to be a big shot CEO and you can bet my company will support AA if I have any say in the matter.

Sprint PCS - I've had Sprint phone service for 6 cumulative years. There was a dark 18 months where I used inferior service from another cell company, but the excellent coverage and pricing of Sprint plans fits right up my alley. The PPC-6700 is an excellent phone.

Dell computer - both of my computer are Dells and I've not had any problems in 5 years, 'nuff said. Amazon, good service and tech books at reasonable prices.

Hopefully I won't receive any hate mail requesting removal from my site, and I recommend everyone who reads my blog support the companies I have listed.

Introduction to Data Warehousing

As some will recall, I started a 8-part series titled "Introduction to Data Warehousing" last summer on JumpStartTV.com. Based on comments I've found using Google and feedback from attendees, I'm putting my presentation from Orlando Code Camp on JumpStartTV. This presentation will be complete material overview minus the actual code demonstration.

I completed part 3 & 4 of the aforementioned series tonight so look for them to be up on JumpStart shortly.

At the Tampa code camp, I hope to do a webcast of my session and a podcast for people
to access remotely at their leisure. I also have some surprises in store, so stay tuned.

Tuesday, May 15, 2007

Tampa Code Camp Correction

Keith called today and told me that the Tampa Code Camp has been moved to St. Petersburg College's EPIC Business Center on Ulmerton Road in Largo. A new venue, but it's more central to the area than USF, plus it's closer to home for me so I'm not complaining. I'm going to set up the post party at my local hangout so it should be a lot of fun. Hope to see you there.

Monday, May 14, 2007

Tampa Bay SQL BI User Group

I am speaking at the June 4th meeting of the Tampa Bay SQL BI User Group.

Topic Name: "Designing ETL Jobs: A Generalized Lesson in ETL Architecture"

Lesson Description: Learn all about using ETL to populate a data warehouse, some of the pitfalls of designing ETL jobs, and about how to design ETL jobs for maximum performance and maintainability.

During this session we will create ETL jobs in SSIS that meet business requirements and follow a standardized methodology. It should be an interesting and informative session.

For more information, see the group website : http://tampasqlbi.sqlservercentral.com/

Tuesday, May 08, 2007

Vacations


An interesting note - I am writing this post from my tent at a Florida state park where I have come to enjoy a couple days of solitude (yeah right) and catch up on some reading and minor work tasks. Seems to be the only place I can get anything done. Hopefully I'll get some kayaking and bass fishing in as well.


Earlier in the week I was in Puerto Rico, and I can tell you it is a great place to visit. There is nothing quite as relaxing as sitting under a palm tree on the beach listening to the waves. Sure I could do that in Florida, but then I feel the pressing need to be doing things at home, visiting people, so I decided to take an island getaway and brush up on my Spanish. It was absolutely thrilling, touring around San Juan, old San Juan, Condado beach, Isla Verde, and all the other sights that PR has to offer. I'm looking forward to a return trip to enjoy some scuba diving and a rain forest tour.


Here's a picture I took in Old San Juan of the forts.



Upgrading IBM Websphere DataStage

It's official - I have converted my employers' data warehouse to use IBM Websphere DataStage 7.5.2; we were running DataStage 7.1 previously. We use DataStage (DS) as our ETL tool, processing over 20 million records daily in a multi-terabyte data warehouse.

I'm proud to say that I led the project, presenting status reports to the CIO and vice-presidents, communcating between the groups, and actually directing the install and doing a majority of the testing.

The upgrade process ran pretty smoothly, all said and done. There were some issues encountered so I'll go ahead and list them here in case someone else has the same issues and somehow finds this post.

First, we are running Sun Solaris 5.8 on our current production server. This is a 7 year old 22-CPU machine. Needless to say, it's pretty slow. Our test environment was our brand-new 8 dual-core CPU (at 2 GHZ) Sun Solaris 10 machine. Therefore, we have substantially greater performance on our development machine than we do on our production machine - for now. This kind of performance disparity makes it diffucult to test because DataStage has many tunable parameters and while I wanted to mimic the prod enviornment, it would almost be criminal to dumb down our development machine to that level of DS performance.

DS has a file (uvconfig) that contains the specific settings for the application. Many variables in this file (NFILES, MFILES, etc) need to be set up to allow a high level of performance, utilizing the machine and software to it's full potential. Verify that the NOFILES (number of concurrent open files) parameter in the operating system is set high (at least 2048). Then set NFILES and TFILES parameters at least 100 or 130 in the uvconfig file. Make sure the DSENV file sources correctly and populates the library variables correctly.

In the end, our DataStage upgrade went smoothly because of good coordination, testing, and documenting all problems encountered during testing helped us a great deal when performing the actual install.

Now I'm looking forward to moving to DataStage 8 (Hawk) next year - the 8 release looks to have some great toolsets and improvements that will go a long ways toward ensuring data quality and funcationality of ETL processes.

Tuesday, May 01, 2007

Tampa Code Camp


Save the date, the Third-annual Tampa Code Camp (www.tampacodecamp.com) will be held on July 14, 2o07 at the Business College at University of South Florida.

I plan on doing two sessions this year: first will be the usual presenation on data warehousing, and the second will be an advanced data topics class, where we will talk about ETL design and architecture, database performance, tuning a data warehouse, and data modeling. As usual, the tool of choice will be SQL Server 2005.

This year I am going to videotape the presentation and post it on www.jumpstarttv.com as a training session, and we'll also be making some podcasts during the event for those who aren't able to attend.

Let me know if there is anything you would like to see covered specifically and I'll see if I can work it into the presentation. See you at the code camp.