Tuesday, July 29, 2008

Continued: How many packages do I need?

The other day we looked at a scenario that described the difference between one and many packages. I also put a ghoulish picture of a nasty looking package in my post to scare you away from doing a large package when organized, smaller packages are the better choice.

Let's look at the following case scenario. Great Plains Foods, Inc has hired us to create a database load process for them. Their requirements are that the data is loaded from a flat file every day by 10 AM and any failures need to generate an email to their production support control. We are to load 5 data files into 5 tables. The information below shows us filenames, expected daily row count, and the destination tables.

File Name Expected Count Target Table
Orders A 5500 DAILY_ORDERS
Orders B 1200 DAILY_ORDERS
Production 7500 DAILY_PRODUCTION
Employees 350 DAILY_EMPLOYEES
Shipments 900 DAILY_SHIPMENTS

From observing the data, we see that the orders files A & B load into the same table. A question for the client should pop into your head at this point, which is "What is the difference between Orders A and Orders B files?" Our business associate at Great Plains tells us that the files are the same but A comes from computer orders and B comes from telephone orders. Now, from an architecture standpoint, how many packages would you create for this job?

I would create 4: Orders, Production, Shipments, and Employees. The Orders job will read from two files while the remainder will read from the single source file. I have seen single packages load all these tables in a single job, but that kind of architecture does not leave any room to manually load a single file or to handle data related issues and failures.

What would you do in this scenario?

Monday, July 28, 2008

Package Failure Notification

One way to guarantee disagreement among a room of ETL developers is to begin a discussion on how to notify when a package fails - and that argumentative topic is the subject of our discussion today.

When I'm working with clients designing ETL jobs one of the first questions I ask from my fact finding checklist is "What is the package failure notification requirements?". This answer not only varies by client but also by team members within the same organization. This simple question often has to be escalated up to the senior manager to get the correct answer. Developers often want to be notified by email, preferably next-day notification while team leaders and managers want people notified by pager or phone calls when the package is a critical data task. I've also found developers on the same team desire next day email and immediate phone call from the operations control even at 2 AM.

There has to be a middle ground in notifications. When business critical metrics are at risk of not being met, immediate notification is a reasonable requirement. If no one will be looking at the data until Monday and it's Wednesday, an email should be sufficient. Thankfully, SSIS has a Send Mail task that can be used to generate emails for failures or abnormal operation. A good example of using Send Mail task is when a process runs but only processes 10,000 records when the daily average is 100,000. In this case, a proactive email can be sent to the developers to do some research before the business metric is published wrong and senior management notices. DataStage 7.x also has notification capabilities when using the Sequencer and a notification task.

The decision on notification activities should be as consensus as possible among an organization and thankfully the major ETL players in the market provide easy development of notification processes in ETL jobs.

Tomorrow we will do a deeper dive into yesterday's topic of splitting up ETL jobs so stay tuned.

Sunday, July 27, 2008

One, Two, or More Packages?


For today's topic we will look at a a major ETL architecure consideration: how many tasks should we have in an ETL package?

One of the beautiful things about ETL is visual data flows that really help the architect consider the best way to move and transform data. Alas, each ETL architect has a different opinion on when to split tasks out into a separate package or to put everything in a single package.

I'll use another consulting experience that I have seen - clients who want to put all kinds of data derivations in a single package. My general rule of thumb is split off packages where there is a logical split in the data - have a separate package to handle order data and sales data. Sometimes there is other reasons to have separate packages - maybe you want to have a package that loads up a SQL Server database and another package to write to a Oracle instance. Normally I would just use Expressions to keep from creating multiple jobs, but consider the case where there are two processes that are similar but not exact, where using an Expression would not be practical.

Another key consideration is for restart capability in event of failure. I don't want to design a package that loads up data but has no ability to restart in case there is a referential integrity violation in the middle of the load.

For more information, see my ETL Architecture design document and/or ETL Architecture considerations presentation I did in June 2007 on the Durable Impact website to get a better idea of some of the considerations used to determine how to break up processes into separate pacakges.

As a final note, don't have a package that looks like the one above.


Saturday, July 26, 2008

Laptop as primary work computer?

Today I'll diverge a bit from talking about software and discuss hardware and a topic that everyone as a strong opinion about. Does a laptop suffice as a primary work computer for us techies?

For my situation the answer is yes, but. Yes, but includes a nice external widescreen LCD monitor and extrenal keyboard and mouse for the house. My personal config is a 24" Dell LCD and ideally when I get a bigger desk I'll get a second 24" LCD, but two monitors might be problematic with laptops that generally have one VGA out. Some of the nicer laptops have VGA and DVI, but more on that later.

I prefer to use laptops because * ding ding ding * I bet you can guess the answer - they are portable. The newer generation of laptops are sufficiently powerful to run every piece of software that I need to run and it keeps me from having a server stack in my office, which is good because I sold them all last year.

This topic came up today because I am looking for another DTR laptop (dtr = desktop replacement). My choices are the Dell Studio line and the HP dv7t. I'm not overly excited about either one of them because they cut a lot of important things like DVI output and high end speakers and removed the option for mattee screens which is a nice option in an office setting. I don't buy the business machines because they do not offer a 17" screen for whatever reason.

My ultimate goal is to have two laptops set up to be ready to go at any time ....1 set up with the Microsoft BI suite and another set up with the IBM BI suite. Once I choose a computer and get them setup I will write a review about it.

Do you use a laptop in place of a desktop?

Friday, July 25, 2008

DataStage 8 New Features

I have recently taken a good look at the IBM WebSphere DataStage version 8 product offering. v8 has a very tight integration with WebSphere product and is the first true IBM iteration of the previous Ascential toolsuite and it shows. Installation on a Windows 2003 Server product was a really painful process. First, I read the entire manual which has a lot of steps to create users and set permissions on files and folders. Then I installed the software and had repeated failures when configuring the WebSphere product. The installation was failing for unknown reasons, and each time I would have to remove DB2 and WebSphere and restart the process. Finally I was able to successfully complete installation and get the software running.

v8 installs both DB2 and Websphere server but it's possible to use SQL Server or Oracle for the MetaData server instance. DB2 is not a bad database for those who haven't used it but there are definately some quirks (especially with dates and times) for those used to the other platforms. For right now I am using v8 with DB2 to keep things simple, but I will demonstrate the v8 platform using SQL Server, DB2, and Oracle as the MetaData repository to test the performance of each platform.

After I finally got the software installed and rebooted the machine in preparations to prepare screen shots for this blog, I was unable to get the WebSphere server running which means I cannot log into the Information Server portal. Once I troubleshoot this issue I'll post a fix, as I suspect I'm not the only person to have this problem.

Thursday, July 24, 2008

Business Intelligence Development Studio, or BIDS for Short

Part of the allure of the Microsoft platform for business intelligence is the BIDS, or development environment used for the platforms. BIDS is directly part of Visual Studio and will make an easy transition for anyone familiar with VS. I actually like BIDS although I do have some gripes and I understand there are a number of changes in store for BIDS 2008 that will be detailed in a later writing.

One of my main beefs with the BIDS is that sometimes it's overly complicated to do something. Microsoft has always been known for giving developers full control of the environment and BIDS is no exception to this, almost to the point of there being so many options in the tool that it's easy to get overwhelmed. I always put the options I use most often on the toolbar so I don't have to wade through menu after menu of options. It would be great if Microsoft used the ribbon in BIDS 2008 as they have used in Office 2007. The first time I saw the ribbon I was intrigued and soon became a fan and don't even like to use older versions of Office anymore.

The main positive about BIDS is that there is full control over aspects. A key annoyance is all the windows that are docked but it really is a love/hate relationship for me, as I love having access to it, and I like to hide the windows, but I find myself constantly pinning and unpinning the windows when working with my laptop, although it's not so bad when I'm working on the 24" LCD in the office.

What changes are in store for BIDS 2008? You'll have to wait a couple of weeks to find out - I am giving a presentation on August 11 about the 'Features of SQL Server 2008' that you will find interesting.

Wednesday, July 23, 2008

The Importance of Data Quality

The key concern in data warehousing is quality data. If you have seen me present, you probably won't forget my emphasis on having data quality mechanisms built into ETL processes. Let me tell you about a scenario I was dealing with at a client recently.

The client wanted and ETL process built that would do the following:
1. Pull all records from table A with a insert date of yesterday into a flat file
2. Delete all records from table A
3. Run this process daily

Okay, so this is simple enough. I set up an ETL job that read the rows from a table and put them into a flat file. I then prepared a second job that would do the delete with the same date criteria. The purpose of doing this in two jobs was to ensure the backup file was created and populated before any delete process executes.

I tested this process and the counts looked great. I implemented in production and checked the counts of the two processes in the morning and they were different. I thought something must be horribly wrong because I had tested this process. After doing a little research with the client it was determined that Table A was being updated on a 24-hour window process, so by the time my process ran in the morning, all the records had not been loaded from the previous night, resulting in records being deleted that were not in the backup file.

The solution to the problem was to set the date back to 2 days prior and voila, all the record counts between archive and delete matched up. The issue with some records not being in the archive was not a problem due to the information being replicated on the source system.

The purpose of this posting is to describe again that data quality and doing simple checks of record counts between processes (especially delete processes) is a pressing and fundamental part of the data quality process. As data systems architects and developers, it is our responsibility to ensure that we build and check processes to ensure requirements are met. It might have been an issue if I have never checked the process after moving it into production. It may have never been discovered. It doesn't matter though because it's part of the data quality stewardship that is so important in today's business intelligence systems.

Friday, July 11, 2008

Datastage Lookup vs. Join

Today’s topic will do a deep dive into ETL architecture considerations for the IBM Websphere DataStage product. Specifically, I am going to discuss the difference between the Lookup and Join stages in version 7.5. For those who might be confused, version 7.5 will cover 7.5.1, 7.5.2, and 7.5.x2, 7.5.3, etc. I've worked on clients will different variations of the tool but the functionatlity for these stages is the same.

This is a real-life case study of a problem I had desiging a job for a client. The job read from a source table with a driving table containing rows that were to be pulled.....a generic SQL query to do the same thing would look like:

select * from BIG_TABLE where record_type IN (select * from REF_TABLE where process_date = current date)

In DataStage I could have done a couple different things. I could have used a database stage to pull the rows with the join in a single stage and had good performance by forcing the DB engine to do a majority of the work. However, for the purpose of maintenance I chose to use a couple different stages to do the job because the client liked having stages on the pallete for people who support the jobs to easily see what is happening. Fair enough....

Thus I had the option of using Join, Lookup, or Merge stages. Merge is designed to be used to have two similar datasets (original and update sets) and merge them together. Since both tables had different metadata, this was not the best tool for the job. Eliminated from consideration.

The Join stages has options to perform a left, right, inner, or full outer join. Without going into extreme detail of the four types of joins (research SQL joins), I could set up the driver as the left table and the reference as the right and match the rows based on the date column. Okay, so this solution would work but it wasn't really what I wanted to do, so I went to the Lookup stage.

Lookup is similar to join except that it looks up a row in the reference dataset for each row in the source dataset. This is memory intensive and should be avoided on large datasets and I was dealing with millions of rows. I knew that my reference dataset would only have a maximum of 9 rows per run and there is this nifty little option to have the lookup table loaded into memory. Ah ha!!! Now I can load the reference table into memory, perform a lookup on the rows in memory, and get some great performance. For very large reference datasets the join will use less memory, but it worked perfectly for this application.

With some further tuning of the job that was initially running 45 minutes per run, I was able to get it down to 6 minutes. Huge performance increase for a little time spent thinking about better ways to architect the job.

Do you have a good example of where a simple ETL architecture change saved you a large amount of processing time?

Tuesday, July 01, 2008

The Great Debate: Employee vs. Consultant, Part 3

This is part 3 of my series on the debate between being a W2 Employee vs. a 1099 Contractor as it relates to the technology world.  In my previous posts, I concentrated around the following points:

-          There is too much focus on provided benefits and no easy way to attach a dollar value to them

-          Without a dollar value on benefits, its’ hard to determine your real wage

-          It’s hard to make a good decision on what to do without knowing your real wage

My earlier conclusion is that many people would be better off as contractors instead of employees if they truly looked at the dollar amount of benefits but this summation comes with the following major caveat:  the person should be entrepreneurial in nature and be willing to accept the uncertainty of continued employment.  Working as a W2 employee you have some protections against zealous bosses and other events out of your control that you do not have as a 1099.   Furthermore, many companies have a vested interest in promoting a stable workforce for support purposes and having a staff of consultants who come and go does not help with this goal.    This is a key concern of many hiring managers I’ve talked to – they are concerned they will get a 1099 who will come in, develop some stuff, and bail quickly to go make $10 more an hour on another project.  It’s a valid concern because this has happened a lot in the past, or the consultant walks away without documenting what he/she did leaving the rest of the people in a lurch.  These kind of situations can be rectified by open, honest communication between both the supervising manager and the consultant, but the key point to remember as a consultant is that unhappy clients do not result in additional work. 

Some of the feedback I received from part 1 & 2 revolved around the uncertainty (risk) factor.  Using economic techniques risk can be modeled if you know the probability of each outcome, but there is no real way to know the outcomes in most job situations, so it’s best to plan on billing the client a rate that takes into effect your desired risk premium.  For an example, let’s look at a DBA who can work as W2 for $80,000 or as a 1099 for an hourly wage.  This DBA decides that he needs a risk premium of $50/hr to do the job, putting his hourly rate at $150/hr.  This is a pretty high rate for a DBA so unless there are special skills involved, it’s not likely this DBA will find many clients so the DBA should probably stick to a W2 job.   Another example (quite common) are the consultants who like to work 6-9 months a year with some downtime to travel the world or do whatever.  In this case, the risk premium would be substantially less because the person is hoping to be out of projects for a time.  Someone wanting to work 6-9 months a year is definitely not a candidate for a W2 position.

The biggest and most desired benefit is health insurance.  If I were to put on my economic hat, I could come up with 10 reasons that employers should NOT provide health insurance (and a few reasons they should, mainly adverse selection), but since this is a technology blog, let’s accept that most employers make health insurance the key benefit in their compensation package.  My first W2 job provided ‘free’ health insurance and it was a good plan.   I’ve seen good plans and I’ve seen bad plans, but this is definitely a consideration for anyone who has a family to support.     I’m not going to say a whole lot more about this benefit except look at the plan closely to try and attach a free-market price to it and use this in your job search.     Two plans which have a similar bi-weekly deduction may have drastically different out-of-pocket expenses or doctor networks.   Caveat emptor!

By and large, the biggest benefit I see for companies is the ability to have flexible scheduling. Having a flexible, motivated workforce is key to economic growth.  This is why you will mostly see large development projects staffed with 1099s who have the specialized knowledge to perform the work and leave the after implementation support to the employee base.  The best projects I’ve worked on have done exactly that – design, develop, and build the solution, then train the employees and move on.   These projects are very rewarding and part of the key reason that I enjoy working as a 1099 worker.

In synopsis, working as a W2 employee vs. a 1099 worker can be drastically different depending on your personal values and associated preferred wage.  I like to manage my own career, benefits, and time off versus having to follow a corporate template.  Good luck in whatever YOU decide to do, but I hope this candid discussion has helped inform you of some of the considerations.