ETL Method – Fastest Way To Get Data from DB2 to Microsoft SQL Server

For a while, I have been working on figuring out a “better” way to get data from DB2 to Microsoft SQL Server. There are many different options and approaches and environments, and this one is mine, your mileage may vary.

Usually, when pulling data from DB2 to any Windows box, the first thing you might think of is ODBC. You can either use the Microsoft DB2 driver (which works, if you are lucky enough to get it configured and working), or the IBM iSeries Client Access ODBC Driver (which works well), or another 3rd party ODBC driver. Using ODBC, you can access DB2 with a ton of different clients. Excel, WinSQL, any 3rd party SQL Tool, a MSSQL linked server, SSIS, etc. ODBC connects just fine, and will work for “querying” needs. Also, with the drivers you might install, you can usually set up an OLE DB connection if your client supports it (SSIS for example) and query the data using OLEDB – this works as well, but there are some caveats, which I will talk about.

In comes SSIS, the go to ETL tool for MSFT BI developers. You want to get data from DB2 to your SQL Server Data Warehouse, or whatever. You try with an OLEDB connection source, but it is clunky, weird, and sometimes doesn’t work at all (PrimeOutput Errors Anyone?). If you do manage to get OLEDB configured and working, you still probably will be missing out on some performance gains compared to the method I am going describe.

Back to SSIS, using ODBC. It works. You have to create an ADO.NET ODBC connection, and use a DataReader source instead of an OLEDB source. Everything works fine, except one thing. It is slow! Further proof?

http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/162e55e5-b64b-423e-94c1-dd764ca1f683

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=96977

http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/cfade7e7-50d5-4447-9821-35c5d5ae1b66

http://www.sqlservercentral.com/Forums/Topic702042-148-1.aspx

http://www.sqlservercentral.com/Forums/Topic666993-148-1.aspx

Ok, enough links. But if you do read those. SQL 2000 DTS is faster than using SQL 2005/2008 SSIS. WTF? The best I can guess is that it is because of the .NET wrapper around ODBC. DTS is using “native” ODBC.

So, now what? Do we want to use DTS 2000? No. What to do though?

Well, after a few days of research, and just exploring around, I think I have found a good answer.. Replace DB2 with SQL Server.. just kidding. Here is what you need to do:

Install the IBM Client Access tools. There is a tool called “Data Transfer From iSeries Server” which the actual exe is "C:Program FilesIBMClient Accesscwbtf.exe"

image

This little tool allows you to set up data transfers from your DB2 system to multiple output choices (Display, Printer, Html, and Text). We want to export to Text file on our filesystem. You have to set up a few options, like the FileName, etc. In “Data Options” you can set up a where statement, aggregates, etc.

If you output to a file, you can go into “Details” and choose a file type, etc. I use ASCII Text, and then in the  “ascii file details” I uncheck all checkboxes. You set up your options and then hit the “Transfer data from iSeries” button and it will extract data to the file you chose in the filename field. Pretty sweet. But this is a GUI, how can I use this tool? I am not going to run this manually. Well, you are in luck.

If you hit the “Save” button, it will save a .dtf file for you. If you open this .dtf file in a text editor, you will see all options are defined in text, in a faux ini style. Awesome, we are getting somewhere.

Now, how do you run this from a cmd prompt? Well, we are in luck again. Dig around in C:Program FilesIBMClient Access and you will find a little exe called “rxferpcb”

image

What this tool allows you to do, is pass in a “request” (aka a DTF file), and a userid/password for your DB2 system, and it will execute the transfer for you. Sweet!

Now what do we do from here?

1) Create an SSIS package

2) Create an execute process task, call rxferpcb and pass in your arguements.

3) Create a BULK Insert task, and load up the file that the execute process task created. (note you have to create a .FMT file for fixed with import. I create a .NET app to load the FDF file (the transfer description) which will auto create a .FMT file for me, and a SQL Create statement as well – saving time and tedious work)

Now take 2 minutes and think how you could make everything generic/expression/variable driven, and you have yourself a sweet little SSIS package to extract any table from DB2 to text and bulk load it.

image

What is so great about the .DTF files is that you can modify them with a text editor, which means you can create/modify them programmatically. Think – setting where statements for incremental loads, etc.

image

 

You can see from the two screenshots above, that is all there is. Everything is expression/variable drive. Full Load, and Incremental Load. Using nothing but .dtf files, rxferpcb, a little .NET app I wrote to automatically create DTF’s for incremental (where statements), truncate, delete, and bulk insert. I can load up any table from DB2 to SQL by just setting 3 variables in a parent package.

After you wrap your head around everything I just went over, then stop to think about this. The whole DTF/Data Transfer/etc is all exposed in a COM API for “Data Transfer Automation Objects’”

http://www-912.ibm.com/s_dir/slkbase.NSF/643d2723f2907f0b8625661300765a2a/0c637d6b03f927ff86256a710076ab22?OpenDocument

With that information at your disposal, you could really do some cool things. Why not just create a SSIS Source Adapter that wraps that COM object and dumps the rows directly to the SSIS Buffer, and then does an OLEDB insert or Bulk Insert using the SQL Server Destination?

I have found in my tests that I can load over 100 million row tables – doing a full complete load, in about 6-7 hours. 30-40 million row tables in 4 hours. 2 to extract, 2 to BULK insert. Again, your mileage may vary depending on the width of your table, network speed, disk I/O, etc. To compare, with ODBC, just pulling and inserting 2 million records was taking over 2 hours, I didn’t wait around for it to finish. Pulling 2 Million records with my method described in this blog takes about 3-5 minutes (or less!)

I know I have skimmed over most of the nitty gritty details in this post, but I hope to convey from a high level that ODBC/OLE DB just aren’t as fast as the method here, I have spent a lot of time over the last few weeks comparing and contrasting performance and manageability. Now, if I could just get that DB2 server upgrade to SQL Server 2008. . . Happy ETL’ing!

About these ads

21 thoughts on “ETL Method – Fastest Way To Get Data from DB2 to Microsoft SQL Server”

  1. @Duane – sure. When I made my packages generic (allowed them to work for any table) I created a table called TESTFILE and set it up that way, just so it works. On a normal run, I pass in the actual table name from DB2 as a variable, and it changes the names on the tasks (replaces TESTFILE with the variable) so if I log execution I can see what is doing what. On the full load package, there are 3 steps, extract, truncate, load. The truncate is truncating that table in SQL server before I do a bulk load.

    Like

    1. I cant share it exactly, that is why I detailed it in the blog post. Your environment might be set up different than mine as well, so its best to follow the post and recreate it from that.

      Like

      1. I know my environment is different, but I would like to see the details of how you set up the pacakges so I have a model to start from.

        Like

  2. I’m having some issues wrapping my head around how you handled the incremental import. Looks like you are comparing an update field which you tagged during the previous load to a date in the DB2 database. I’ve been looking through the AS400 tables which I import for a field to compare to but cannot for the life of me find one.

    I’d like to run the incrementals every 10 minutes so not sure the best way to approach this. Any insight would be greatly appreciated. Thanks!

    Like

    1. @Bryan – I am using an update date field. The tables I pull have a date field. The ETL I use only has to handle updates/inserts, no deletes. But I pull every greater than MAX update date in my SQL table subtract 3 days, and pull just the PK columns into a table, then do a delete by joining the PK’s to the main table in SQL Server. I have found that faster than looping through each row and doing a lookup/delete. Then I just grab all columns I want from db2 using the same logic, the max update date – 3, and insert into my table in SQL. Doing it that way, I get updates/inserts handled. You could use some kind of checksum method as well, I have blogged earlier about that, but for populating dimensions. There are other ways as well, using EXCEPT in SQL 2005, etc.

      Like

  3. The output file contains fixed position columns with leading blanks on numeric fields. SQL Server format files do not provide for skipping these extra blanks. How do you handle this?

    Like

  4. This is a fine article. I got it all to work, but I haven't tried the COM object part. ave you done this? Did it create a strong performance increase?

    Like

  5. Great article Steve! I followed the instructions and managed to extract and load an 18M row table in 45 minutes compared to over 7 hours using Datareader and ODBC! Noted though that either one of the IBM or corresponding Microsoft OLEDB providers for DB2 also gives a huge performance boost compared to using ODBC (about 45% faster when I have tested), although not at all as good as the method described in this article of course :-)

    Like

  6. Have you done any comparisons to a straight insert into…select openquery? Our ETL process essentially uses the insert into…select openquery statement and it takes about 4 hours to load 35 million table rows. We’re looking for ways to reduce that time. Your opinion?

    Like

  7. Install HitOLEDB drivers for AS400 instead – they are the fastest from SSIS. I tried the above technique myself but it was still much slower than DTS but after using HitOLEDB drivers, I am getting same (or even better) speed in SSIS than DTS

    Like

  8. Hi,

    Thanks for your post. I could pull 14.5M Rows in 7 minutes, now I’ll need to test the bulk insert step. Could you help me by providing the script of generating FMT file from the FDF file.

    Thanks,

    Like

  9. Steve — Did you ever post this code? ” I create a .NET app to load the FDF file ” I understand everything else, but am not much of a .NET coder and really don’t want to create the FDF files for 00+ tables by hand.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s