Sunday, April 02, 2006

Data Extraction

First off, sorry about the lack of posts recently. Between dealing with a couple projects, a home, a parental visit and subsequent road trip, and everything else I haven't made time to write lately. Today's post will be a little short because I tore a hole in my hand today removing the oil filter on my truck.

ETL stands for Extract, Tranform, and Load. Let's dive into the Extract portion of data warehousing.

Extract as defined by Merriam-Webster dictionary is "to draw forth (as by research) extract data". Wow, they actually have data extraction listed in the defintion (way cool). Data warehouses can either extract data from the source system itself or receive files of extracted data. The point is that input data in the data warehouse must be extracted from another source and the method is irrelevant to the goal.

At the project I'm working on we have jobs that both receive extracted data and pull data, performing the extraction process ourselves.

Let's explore receiving first. Data is received in a flat file of pre-detemined layout (or it may come in a XML or some other message, depending on the architecture you are using). The layouts are delimted (comma, pipe, or some other delimited) or fixed-width. Fixed-width files harken back to mainfram COBOL days when file definitions were hardcoded into the program themselves. At work we prefer pipe-delimited files because it's a lot easier to change things around and maintain a working audit and log process.

The other type is to have some kind of extract job. I've seen it done both in actual DataStage jobs and Oracle procedures. We have a couple daily jobs that extract data directly from the source systems. The systems have an agreed upon time that we will connect and pull data. There is a little business knowledge that must be had here to pull the correct data, thus it's important to keep in mind that performing the extraction FROM the data warehouse shifts the responsibility for getting the correct data back to the ETL developers.

Once the data is pulled or received from the source systems it's placed in files or staging tables (again depending on your specific approach.) In a good system there will be an audit mechanism keeping track of what/when data was received, preferably including some kind of row count and timestamp.

Alright folks, that's pretty much the lowdown on the extraction process. In my next post I will discuss staging areas and audit processes for the extraction process and then we will move on to the transform part. Stay tuned.

As a teaser, I'm giving a presentation to a few people this week on Oracle Database performance tuning. I'll post the presenation up here and highlight a couple areas for in-depth discussion.

No comments: