Sunday, February 19, 2006

DW Load Methods

A data warehouse is a warehouse that contains data. Sounds funny doesn't it? Imagine a warehouse as a storage structure that may be physical or logical. In the DW it's both...logical as in the data model represenatation and physical as to the hard disks and other computer equipment that support the data warehouse.

The first question should be "How does the data get there?" Excellent question and I'll cover some different load methods in my column today.

Here are three key areas to keep in mind....

Availability - will someone create a report with transactions completed 20 minutes ago or will it be with yesterday's data?

ETL Process - does our ETL process handle constant loads or is it a once a day process?

Hardware limitations - do we have the hardware resources to allow people to access data and create reports while we are inserting/updating data?

Types of ETL Processing:

1. Batch load - load all data once a day. At the company I work with we do a batch load process that begins at midnight and hopefully a majority is complete by 7 am. (pretty common)

2. Trickle feed - rows are loaded in the EDW as they are received in the collector (the hardest)

3. Microbatch - small batches of data are processed on either a time threshold (every 2 hours) or a row threshold (20,000 records in the collector).

There you have it.....a brief overview of common DW load methods. In a future column we'll further explore each of these methods.

DW Resources

Here's some Data warehouse resources that I find helpful. I'm going to put this on the sidebar but until I take the time to figure that out this will have to do.

General DW information and white papers:
The Data Warehouse Information Center

The Data Warehouse Institute professional EDW organization. I am a member of this group and they put on quarterly conferences. The fall one is always in Orlando and I highly recommend that you attend if you get a chance. This is the conference I attended in Spring 2005.
The Data Warehouse Institute

There are two schools of thought in the data warehouse world. Two of the original fathers of the DW are Bill Inmon and Ralph Kimball. They have a similar goal but different approach and I'll try to write a little about their differences in a future column. For now though here's their websites. I've attended a Ralph Kimball presentation and it was very good; he's a great speaker.

The Kimball Group

Bill Inmon web site:

Inmon Data Systems

Last but not least, here's some books I highly recommend. I have all the books listed below and will be happy to discuss them if you email me. For those who attended my class you have my email but I also have an email (thedamndata) at Yahoo. Figure out the syntax!





Thursday, February 16, 2006

ETL Methodology

I want to take a little time to discuss something that is extremely important in data warehousing: ETL methodology.

ETL methodology can best be defined as the way ETL is done in an environment.

Everything regarding ETL has a process. Here's some pieces:
- Table naming conventions
- Job flow (source, transform, load)
- Job parameters
- Job options (Resurrection, pause, kill)
- General ETL job layout (sourcing on top, transform in middle, loading on bottom)
- Use of Staging areas, temporary files, and temporary tables

I'm developing a complete methodology for SSIS (SQL Server Integration Systems). When complete my methodology will be able to be applied into any environment at any company with only minor modifications.

Once I get a couple of things done around the house I'll start working on my ETL Methodology document. If anyone would like to review it let me know; I could use a couple reviewers.

I'd also like to give a shoutout to Ken Bland. Ken is a very talented ETL architect and developer who has jointly developed a very popular methodology for the IBM WebSphere DataStage(TM) product. Over 90% of my EDW knowledge was a direct result of working under Ken's guidance. His website is listed below if you'd like to take a look at his booming consulting business. Best of luck Ken.

KBA Associates

Code Camp Presentation Materials, Part II

Here's the materials you've been waiting for:

Slides (with phone/email addy removed - you've got that info from the presentation)

Here's the Solution (I'm working on making this better so give it some time)

File Briefcase on Yahoo

Both of my economics tests are over for this month. I did better than I thought on both of them so my quote of the day is "it's better to be overprepared than underprepared"

Thanks for the well wishes....

Tuesday, February 07, 2006

Code Camp Presentation Materials

I had a discussion with a confidant today regarding whether or not to post my Code camp presentation materials (slide and code). My main concern is this: I have a distinct name. I prefer to remain anon. on the internet for multiple reasons (I'm a private person anyway), but my confidant "Guy" convinced me that it would be for the common good of the community and that nothing is anonymous anyway. He made a good point so I will post it.

First, I had a killer test in Economics class tonight and I have another one on Thursday night. I will post the materials sometime later this week and update this posting. Wish me luck on the tests!

I'm also preparing a booklist of recommended reading and websites so I'll get that up as well.

Sunday, February 05, 2006

Simple Performance Testing

While preparing my demonstration for the code camp listed below, I came across something interesting. The SSIS package read rows from SQL Server 2005 about twice as fast as from Oracle 10g.

Let me start off by describing my environment. I have a Dell Inspiron 9300 laptop with 2Ghz processor, 1 GB ram, 80 GB hd. I have a full install of Oracle 10g Standard edition and a full install of SQL Server 2005 standard edition. I created the same tables in both systems, loaded them with the same count of rows (the same data) totaling about 9.5 million rows.

My data flow tasks simply used the OLE stage to pull data out of SQL Server and Oracle and write directly to a flat file stored on my C:\ drive. (I didn't have them both running at once, I ran the oracle job first, and then the SQL server job next). Neither table was indexed.

After 5 seconds the Oracle job had pulled ~300k rows while SQL server had pulled over 700k rows. There was obvious latency as the output file was written to disk.

Someone mentioned to me that the OLE stage in SSIS has an error communicating with Oracle. I will research this more and find out what effect, if any it has on my test.

After I determine if the OLE stage is affecting the performance, I'll run actual tests with a stopwatch and post the full results on here.

SQL Server 2005 is looking very promising as a key tool for a datawarehouse.

South Florida Code Camp

Yesterday I spoke at the south Florida code camp on data warehousing and SQL Server 2005. The session seemed to go well; I received a lot of positive comments and would appreciate more constructive comments (positive or negative).

I have not posted my presentation or code examples online, but if there is demand to do so I don't have a problem with it. Just leave a comment or shoot me an email (the address I gave out during the session) and I'll get them to you or post them on here.

A couple of the other presenters who spoke about SQL Server Integration Systems and SQL Server 2005 and I have agreed in principle to put together a full day track from start to finish covering databases, datawarehousing, beginning transformations, advanced transformations, and then concluding with a wrapping up session. Hopefully we can get this put together for the Tampa code camp in July 2006.

I'm also available for any questions, just drop me a line.

Thanks to the Code Camp staff for putting on an excellent event. The DeVry location was perfect and the facilities couldn't have been better. Tarpon Bend was an excellent restaurant. You guys sure know how to put on a heck of a party.

With a couple exceptions I'd do it again. First, I wouldn't stay at the Wellesley. Their pillows are thinner than my laptop. (Luckily I keep a good pillow in my truck.) I paid $126 a night for a glorified Super 8. Second, I'll try and ride down there with some people. All of my peers are very knowledgeable and generally a whole lot of fun and I would like more opportunity to speak to them in depth regarding IT issues in general.

Future (potential) speaking engagements:
------------------------------------------
Tampa Code Camp - July 2006
Jacksonville Camp - August 2006
Tallahassee Camp - Sept. 2006
B'ham (Dreamland, anyone?) - Oct 2006