Tuesday, December 16, 2008

Extracting SQL in Data Warehouses

In working with potential clients, I often get the following question in tech interviews, and the question is "what is the best way to extract data from a data source?". To use the name of Andy Warren's blog, the answer is "It Depends".

I was asked this specific question today and gave the response above. Knowing that's an incomplete answer, I elaborated. The response was something like "It depends on the machines that are running both the database and the ETL process."

It goes something like this. Let's say our source system is an order entry system that is a mainframe DB2 system, and our data warehouse is a Windows-based SSIS implementation running 16 dual-core processors with 32 GB of RAM. We can run the extraction a couple of different ways, including:
- Perform a simple select statement and pull all rows out of the database on the source
- Perform a simple select statement and pull rows with a constraint
- Run a compplex query on the database engine to pull the minimum amount of data.

The key differentiator in this dilemna is how much data do we need? I've worked on ETL processes that pulled 20 million records in a single ETL jobstream only to use a majority for lookups and then updated only about 1 million records. I've also worked on ETL jobs that looked like the third case above, running a complex query with where clauses out the wazoo and returning just a couple rows to the ETL process. I generally try to avoid the first choice above unless we need all those rows. It's also possible and common to constrain the query and have indexes and other tweaks on the database side to speed query performance.

The way I would do it is dependent on the performance of the two machines. I always try to minimize long running network pulls, but if we run a significant query on the source system we may negatively impact their performance. Thus, it's such a localized question that is dependent on the environment.

Some good benchmark rules for extracting data are:
- Don't run queries on source systems that have negative performance impacts
- Don't have long running network connections to pull data
- Do utilize the machine with the most power for complex tasks
- Use all processor available to complete tasks faster
- Use specialized database connections (vs plain ODBC) to improve performance

Finally, always remember the correct answer is "It depends".

No comments: