Tag Archives: SSIS

SQL Server 2012: Data Quality Services

With the release of SQL Server 2012, I am looking more into Master Data Services (MDS) and Data Quality Services (DQS). A brief overview of DQS.

You install DQS with SQL, and you have to configure it. The server configuration is a cmd line process that runs to create some databases on your server (DQS_MAIN, DQS_PROJECTS, DQS_STAGING_DATA).

I ran into one issue with the running of the configuration, not sure if this happens everywhere, I am running Windows 8, but nonetheless, I ran into. After running the tool and getting error after error, and trying as admin, etc. I dug deeper into the error message and found that I there were some security/permission issues I had to resolve. It ended up being that I had to change permissions on

C:WindowsMicrosoft.NETFramework64v4.0.30319Configmachine.config

to allow write access. Once I did that, the configuration tool worked and I could get into DQS.

DQS gives you a “Data Quality Services Client” to work with. When you open it, connect to the database where you configured the three databases I talk about above. Once you, you have 3 panes.

You have Knowledge Bases, Data Quality Projects, and Administration.

Knowledge Bases: datasets of known data that you can use in your Data Quality Projects. You get a default Knowledge Base – state names and some other data similar to that.

Data Quality Projects: Here is where the magic happens. You can choose some source data (Excel xls – xlsx wouldn’t work or SQL table) and then apply your knowledge base on it. Then you can reimport your data at the end back into SQL or export it, and update your Knowledge Base with learned values.

Administration: Not a ton of options, but you can set some thresholds, and also setup your Azure data market settings.

Azure Data Market https://datamarket.azure.com/browse/Data?Category=dqs – Lots of data you can use to combine with your Knowledge Bases. Much more here and I won’t go into detail – it could be its own post in itself.

As a test, I took an excel file, added a few records with columns first, last, city, state (I actually imported into a staging SQL table to work with it) – But in the state field I put different variations of state, WI, Wis, Wisconsin, MN, Minn, Minn., etc.

I then ran the file through creating a new data quality project and ran it against the default Knowledge Base, and it corrected the values it could. Got a weird error clicking next on the project, it seems the button is touchy. Hopefully they come out with a fix soon.

Once you build up and get your Knowledge Base stable, you can use from SSIS packages or in Master Data Services. I see many useful applications for DQS. Either around your corporate data or pulling in data from Azure data market to cleanse existing data you might have (think: looking up gender from first/last name).

This post is a brief look at DQS and how it works, but there is so much more. I hope to get more in depth in the near future.

Dev and Prod Systems, Using a HOST file to ensure your datasource is pointing the right system

In many shops, I am guessing there are multiple servers. Development, Production, maybe a Staging, etc.

With SQL Server Analysis Services (SSAS) and SQL Server Integration Services (SSIS), you set up data sources, connection to databases. In SSAS you usually have a connection which then you build your data source view off of, and in SSIS you have connections from which you push data to and pull data from.

Another thing, in SSAS you can “deploy” right from Visual Studio (BIDS). All these things have a server name. What we have run into is this:

You develop on your local machine, pointing at development server. You deploy to development, your connections are pointing to development, and everything works great. When you deploy to production (usually planned, every 2 weeks, or whatever) you deploy your stuff and what ends up happening?

In SSIS your config files should have a connection string (or however you store it) and it should point to production. But in SSAS, if you deploy from BIDS, your data source will have to change and in the cube project properties you need to change your deployment server.

I have seen countless times, a cube or a connection in SSIS without a config that is running in production, yet pointing at development. We keep our dev data as fresh or very close so sometimes we don’t even notice, but then it happens, something weird is reported and we dig into it, and we find the erroneous connection string.

Here is my solution to the problem:

Developers – go to C:WindowsSystem32DriversEtc and open your Hosts file with notepad or text editor. You then add a couple of entries

#production
#xxx.xxx.xxx.xxx datawarehouse

#development
yyy.yyy.yyy.yyy datawarehouse

where xxx is the ip of your production system, and yyy is the ip of your dev system. the # is the rem/comment out symbol. You can see above I have everything commented out but the line for the dev system. But notice each is pointed to “datawarehouse” so if I ping or connect to “datawarehouse” from Management Studio, or whatever, it goes to the IP I have commented out.

Now, go on to each server, but only add the line that corresponds to that server in the hosts file, or better yet just

127.0.0.1 datawarehouse

Now, when you deploy to either server, and your connections, etc are set to connect to “datawarehouse” you ensure it will always connect to the local server. Brilliant!

SSIS – Custom Control Flow Component – Execute SQL Job And Wait

Sometimes you have some pretty complex ETL’s going in SSIS, and you might have multiple projects/solutions that need to call other SSIS Packages or SQL Agent Jobs and you have a pretty big production going on. You might have an ETL solution that needs to kick off other packages, and you can either import those into your solution or call them where they lie on the file system/SQL server, etc. You might have to call some SQL agent jobs, and most often they are async calls (you dont need to wait for them to come back) and this works nicely, I do this all the time. The Execute SQL Agent Task in SSIS works nice, or you can just call the SQL statement to execute a job, either way, it kicks off the job and then just comes back successful right away, and doesn’t care if the job actually succeeds. You might want this in some scenarios, and the built in functionality works great.

But what if you want to just call an existing SQL Agent job and actually wait for the job to finish (success or failure)? There isn’t anything that I could see built in to SSIS to do this, sp_start_job is asynchronous, so you are out of luck there. I figured I could call sp_start_job, then create a for loop in SSIS and just check the status every X seconds/minutes, but I would have to either make this a package I could use everywhere or reproduce the same logic in multiple solutions, so I shied away from that solution.

What I decided to do was build a custom SSIS control flow task in .NET that will execute a SQL agent job and check the status and wait until it finishes. A disclaimer: This is going to be a lot of code :) also, it could be improved (but what couldn’t?) – this was a 1.5-2 hour experiment.

First, I created a VS2008 C# class library. I tried adding a UI to my task, but I couldn’t get it working so there is some code there for that but it’s commented out.

here is what my solution looks like:

Capture

import the correct namespaces:

using System;
using System.Collections.Generic;
using System.Text;
using Microsoft.SqlServer.Dts.Runtime;
using System.Net;
using System.Net.NetworkInformation;
using System.Xml;
using Microsoft.SqlServer.Dts.Runtime.Design;
using System.Data.SqlClient;

Next, you need to create the actual skeleton/wrapper for your component. You can see I have two properties, job name, server name. It could be expanded to have the connection string or use an existing connection in SSIS, I wasn’t that ambitious. The “Execute” method basically just calls some functions and waits for result.

namespace ExecuteSQLJobAndWaitControlTask
{

    [DtsTask(
        Description = "Execute SQL Job And Wait", 
        DisplayName = "Execute SQL Job And Wait", 
        TaskContact = "Steve Novoselac",
        TaskType = "SSIS Helper Task",
        RequiredProductLevel = DTSProductLevel.None)]
    public class ExecuteSQLJobAndWaitControlTask : Task, IDTSComponentPersist
     
    {
        private string _jobName;
        private string _serverName;

        ///

        /// The sql job name
        ///

public string JobName { get { return _jobName; } set { _jobName = value; } } ///
/// The sql server name ///

public string ServerName { get { return _serverName; } set { _serverName = value; } } public override DTSExecResult Execute(Connections connections, VariableDispenser variableDispenser, IDTSComponentEvents componentEvents, IDTSLogging log, object transaction) { try { StartJob(); System.Threading.Thread.Sleep(5000); do { System.Threading.Thread.Sleep(5000); } while (IsJobRunning()); if (DidJobSucceed()) { return DTSExecResult.Success; } else { return DTSExecResult.Failure; } } catch (Exception ex) { Console.WriteLine(ex.Message); return DTSExecResult.Failure; } } public override DTSExecResult Validate(Connections connections, VariableDispenser variableDispenser, IDTSComponentEvents componentEvents, IDTSLogging log) { if (string.IsNullOrEmpty(_serverName) || string.IsNullOrEmpty(_jobName)) { componentEvents.FireError(0, “You must specify a JobName and ServerName in the properties”, “”, “”, 0); return DTSExecResult.Failure; } else { return DTSExecResult.Success; } } void IDTSComponentPersist.LoadFromXML(System.Xml.XmlElement node, IDTSInfoEvents infoEvents) { if (node.Name != “ExecuteSQLJobAndWaitTask”) { throw new Exception(string.Format(“Unexpected task element when loading task – {0}.”, “ExecuteSQLJobAndWaitTask”)); } else { this._jobName = node.Attributes.GetNamedItem(“JobName”).Value; this._serverName = node.Attributes.GetNamedItem(“ServerName”).Value; } } void IDTSComponentPersist.SaveToXML(System.Xml.XmlDocument doc, IDTSInfoEvents infoEvents) { XmlElement taskElement = doc.CreateElement(string.Empty, “ExecuteSQLJobAndWaitTask”, string.Empty); XmlAttribute jobNameAttribute = doc.CreateAttribute(string.Empty, “JobName”, string.Empty); jobNameAttribute.Value = this._jobName.ToString(); taskElement.Attributes.Append(jobNameAttribute); XmlAttribute serverNameAttribute = doc.CreateAttribute(string.Empty, “ServerName”, string.Empty); serverNameAttribute.Value = this._serverName.ToString(); taskElement.Attributes.Append(serverNameAttribute); doc.AppendChild(taskElement); }

And then I have some helper methods, this is where the meat and potatoes are for this task. Now of course I could have the connection string once, etc. Like I said, it was a quick thing :). The heart of it is though, starting the job, checking if it is still running, and then after, if it succeeded. Pretty simple.


        private bool DidJobSucceed()
        {
            SqlConnection dbConn = new SqlConnection("Integrated Security=SSPI;Persist Security Info=False;Initial Catalog=msdb;Data Source=" + ServerName);
            SqlCommand dbCmd = new SqlCommand("exec msdb.dbo.sp_help_job @job_name = N'" + JobName + "' ;", dbConn);
            dbConn.Open();

            SqlDataReader dr = dbCmd.ExecuteReader();
            dr.Read();
            int status = Convert.ToInt32(dr["last_run_outcome"]);
            dr.Close();

            dbConn.Close();

            if (status == 1)
            {
                return true;
            }
            else
            {
                return false;
            }
        }

        private bool IsJobRunning()
        {

            SqlConnection dbConn = new SqlConnection("Integrated Security=SSPI;Persist Security Info=False;Initial Catalog=msdb;Data Source=" + ServerName);
            SqlCommand dbCmd = new SqlCommand("exec msdb.dbo.sp_help_job @job_name = N'" + JobName + "' ;", dbConn);
            dbConn.Open();

            SqlDataReader dr = dbCmd.ExecuteReader();
            dr.Read();
            int status = Convert.ToInt32(dr["current_execution_status"]);
            dr.Close();

            dbConn.Close();

            if (status == 1)
            {
                return true;
            }
            else
            {
                return false;
            }
           
        }

        private void StartJob()
        {
            SqlConnection dbConn = new SqlConnection("Integrated Security=SSPI;Persist Security Info=False;Initial Catalog=msdb;Data Source=" + ServerName);
            SqlCommand dbCmd = new SqlCommand("EXEC dbo.sp_start_job N'" + JobName + "' ;", dbConn);
            dbConn.Open();
            dbCmd.ExecuteNonQuery();
            dbConn.Close();
        }
   }

Now, to install this you need to register it in the GAC (global assembly cache), and then copy to the DTS/Tasks folder. Depending if you have VS2005 or VS2008 (or both) your gacutil path might be different.

cd
c:
cd C:Program FilesMicrosoft SDKsWindowsv6.0Abin
gacutil /uf "ExecuteSQLJobAndWaitTask"
gacutil /if "C:ProjectsSSISCustomTasksExecuteSQLJobAndWaitbinDebugExecuteSQLJobAndWaitTask.dll"
copy "C:ProjectsSSISCustomTasksExecuteSQLJobAndWaitbinDebugExecuteSQLJobAndWaitTask.dll" "C:Program FilesMicrosoft SQL Server90DTSTasks"

I have found once you have done that, you need to actually restart your SSIS service to make it work, but then you can use it in new Visual Studio SSIS packages.

Capture

Once you drag it on your package, you can set the JobName and ServerName property (from the properties window – remember, no GUI). and it should run.

Some notes:

If you kill the job, the SSIS task will fail (obviously). If you kill the SSIS package, the job will keep running. Maybe a future enhancement will be to capture the SSIS package fail/cancel and kill the job. Maybe :)

Attached is the source code for the task (Vs2008 C#) https://onedrive.live.com/redir?resid=ac05d3c752d3b50a!187358&authkey=!APJ06uEMqWiLlIM&ithint=file%2crar

This has been testing with BIDS VS2005. I take no responsibility if this blows up your system, computer, server, the world, etc.

Happy ETL’ing!

SQL 2005: SSIS – Pushing Data to MySQL using Script Component Destination

Sometimes, I just wonder why things that are useful in previous versions of things get removed. In SQL 2000, in DTS, there was an ODBC destination, you could write (insert, update, etc) to a foreign (non MSFT SQL) system easily. DB2 – no prob, MySQL – heck yea. For whatever reason, in SQL 2005 Integration Services (SSIS), that ability was removed (in SQL 2008 SSIS there is an ADO.NET Destination that can update ODBC sources, so they brought some functionality back).

I need to write to a MySQL database pulling data from a SQL Database, using SSIS 2005. What are the options? Well, the best I could come up with was a Script Component Destination in my DataFlow, this is how I did it:

1) Create a new SSIS Package, throw a DataFlow on the Control Flow, Add your connections (let’s say SourceDB_SQL – your source data, and then DestDB_MySQL – your MySQL destination, it needs to be a ADO.NET Connection, you need to install the MySQL connection – I installed this http://dev.mysql.com/downloads/connector/odbc/5.1.html)

2) In your DataFlow, create your OLEDB Source and get your query returning data, throw a Script Component on the Data Flow and make it a Destination.

ms_01

3) Open the Script Component, set your input columns up, and then your Connection Manager

ms_02

4) Open the actual script, and you just have to add a few lines of code:

' Microsoft SQL Server Integration Services user script component
' This is your new script component in Microsoft Visual Basic .NET
' ScriptMain is the entrypoint class for script components

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports System.Data.Odbc

Public Class ScriptMain
    Inherits UserComponent

    Dim mySQLConn As OdbcConnection
    Dim sqlCmd As OdbcCommand

    Public Overrides Sub AcquireConnections(ByVal Transaction As Object)
        mySQLConn = CType(Me.Connections.MySQLDatabase.AcquireConnection(Nothing), OdbcConnection)
    End Sub

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

        sqlCmd = New OdbcCommand("INSERT INTO steve_test(ShipTo, YearValue) VALUES(" & Row.ShipTo & ", '" & Row.YearValue & "')", mySQLConn)
        sqlCmd.ExecuteNonQuery()

    End Sub

    Public Overrides Sub ReleaseConnections()
        Me.Connections.MySQLDatabase.ReleaseConnection(mySQLConn)
    End Sub
End Class

Run it and you are done! Easy. Now you can write data into MySQL using SQL 2005 SSIS (or any ODBC destination if you can get it to work :))

Microsoft Business Intelligence Development in a Team Environment

Today I received an email asking to some extent best practices on development with SQL Server Integration Studio (SSIS) and Business Intelligence Developer Studio (BIDS) in a team environment. Here is part of the email:

Me and another DBA belong to the same team, we have a SQL server with SSIS running. We use the SSIS transfer data among multiple data sources. In SQL 2000 DTS, both of us can save the package on the server and open/edit it in the enterprise manager. In SQL 2005, I can see the package on server, but can’t open it directly. We came out a solution: create a shared folder on the server called ‘SSIS Projects’, both of us can access to it. We run the ‘SQL Server Business Intelligence Development Studio’ on local PC, to open the project in that shared folder. When done with the change, save the package to the SSIS server. Now, we have more than 50 packages in a project. Problem is: it’s very slow when open a project, ‘Business Intelligence Development Studio’ tends to open/verify every single package inside a project, takes up to 10 mins and getting worse. We really miss the SQL 2000 DTS, but we have to turn to SQL 2005.

  1. Are we doing the right thing? Is there any better solution for SSIS developing in a team environment?

  2. When open a project, does ‘Business Intelligence Development Studio’ has to open/verify every package?

 

This got me thinking, and I figured instead of write an email back, it would be good info for a blog post. So here is what I think and some things I have done that have worked.

First, yes, SQL 2000 DTS allows you to just edit on the server, do more than SSIS, is just way better than SSIS. Wait, what? Well, yeah some people will say that, because it does one thing that might be a little rigmarole in SSIS, but no, SQL 2000 DTS is not better than SSIS, just wanted to clear that up.

So, the is meant to be a starting point, by no means all encompassing, and as always, YMMV.

One thing that I first thought about is this: Yeah, if BI devs and SQL devs have never really worked in a team environment, developing software, how would they know what to do, or best practices? They would just go about “making it work” until everything breaks or who know what.

 

So how to develop Microsoft Business Intelligence Solutions in a team environment?

 

1) Standardize on Versions

 

First, figure out what “versions” you are going to support, and what you are going to use, and get standardized on them. I am guessing majority of BI devs right now are on the 2005 stack. Yeah, there is still probably a bit of 2000 legacy stuff out there, and some people are now getting into the 2008 stuff, but 2005 is pretty much the norm from what I see, at least at this point.

So, 2005. Get all your dev’s on 2005 on their machine – same patch level, etc. Get BIDS up to the same level. Get BIDS helper installed everywhere. Strive to get all your ETL packages in SSIS 2005, get all your cubes to SSAS 2005, etc, etc. Come to a consensus on things like config files for SSIS, naming conventions, within your development and on disk – folder structure is key! With a smaller number of versions of things floating around, it makes it easy for anyone on the team to open up a solution and start hammering away without tons of setup.

2) Get Source Control

 

This is crucial! I have talked about source control in the past, and also about some that aren’t so great. Really it doesn’t matter what you use, I prefer SVN. I install Tortoise SVN, SVN proper (to do scripting etc if I need to using cmd line) and also purchase Visual SVN, an add on to Visual Studio that integrates with SVN. for 50 bucks you have your source control system. Visual Source Safe works but is outdated, honestly I hate it. Team Foundation Server is good, but expensive. Other solutions might be using something like GIT, etc. Whatever you do, just get a source control system going, and learn it well. Learn how to create repos, commit, update, revert, merge, etc. Set up a user for each BI dev and make sure they commit often, and make sure they leave comments in the source control log when they commit, history is your lifeline to go back to something if you need to! Note: exclude .suo, bin, obj directories, .user files, etc. Anything that changes every time you build, open, etc, you want to exclude from source control.

 

3) Development Box

 

You now have your version standardized, and your source control setup. You can get most of your work done on your machine, but you need somewhere to test deployments, run scenarios, etc, etc. Make sure you have a comparable box to your production server. Set it up the same, same software etc. Make sure its backed up. Let all the devs know its a dev box, it can be wiped at any time for any reason if need be. It can be rebooted 5 times a day if need be. Its a dev box! But you can test and develop and tweak and change settings to your hearts content and not have to worry about breaking Mr. Executives reports.

 

4) Developing, Merging, Committing, Collaborating, Communicating.

So now you have your setup, well.. setup. Start creating stuff. SSIS Packages, ETL’s, SSAS Cubes, SSRS Reports, the whole MSFT BI Solution. This is where stuff can start to get tricky in a team environment though. SSIS/SSAS/SSRS isn’t as clean cut as something like C#/VB.NET, etc. Everything is in some form of XML behind the scenes, and with graphical based editing, you can move stuff around and it changes the files. Things like that are going to be your enemy. This is why you need to collaborate and communicate. Usually one person should be working on one project at a time. You can get really good at communicating and then in SSIS at least have multiple people working on different packages. Also in SSAS dimension editing and stuff can be done by multiple people at the same time as long as the dim is already hooked up to the cube. But you want to make sure that you communicate, “Hey, I am checking this in, you might want to do an update”, or “Is anyone working on this or are they going to? I want to modify something, and I will check it in so you all can see it”

You want to make sure you have your folder structure, and solution/project structure set up well. C:Projects  ..  and then maybe a folder for each major project “CompanySales” and under that, “ETL”, “Cube”, “Reports” and have a solution under each with 1 project of each type. You can also have a generic SSRS solution with many projects, which might work well for you. In any case, just come up with a standard and stick to it. Trust me it will make your life easier. The question from the email above, it sounds like they have every package in one solution, one project. Sounds like it needs to be split, multiple solutions, multiple projects.

 

5) Deployment Scenarios and Strategies

Now that you have everything developed, tested, checked in, what do you do?

Personally for SSIS I like xcopy deployments. One folder on the server, not on the C drive, but another drive, lets say “E:SSIS” under that a folder for each project. Put your dtsx and configs in the same folder. 99% of the time you are going to call the dtsx from a SQL Agent Job, and most likely you are going to run into a scenario where you need uber rights to execute it, so learn how to create a proxy/credential in SQL security so you can run the step as that. Once you have this folder and subfolders setup, you can use something like Beyond Compare to compare the folder on the server to the one you have locally that matches. Remember to copy files from the bin directory of your project after you build it, not the files directly on your project. As far as BIDS validating every package, there are workarounds out there you can do, here is one.

For SSAS, I try to lean towards using the Deployment Wizard that comes with SQL Server. You can use BIDS deployment, but if you start doing anything advanced with roles, partitions, etc, you are going to run into trouble. Take control and use the deployment wizard. I usually like to deploy, and then process manually when developing. And then later use SQL Agent or and SSIS package to actually do processing when it comes to a scheduled processing.

SSRS, I have become used to the auto deployment from Visual Studio. To really do this though, you need a project for every folder in SSRS, which can become a pain. You can always just upload the .RDL file and connection and do it manually, but if you start off right with using the deployment from BIDS, it can make your life easier.

 

So that is just a 10 minute overview of everything to kind of get started. Everything depends on your infrastructure and the way your team is setup, etc. But I think the biggest thing to take from all of the above is to standardize on things. If you standardize on as much as possible, SQL versions, setup of machines, naming conventions, layouts, design patterns, etc, everyone can do things faster and pretty soon it will start running like a well oiled machine!

ETL Method – Fastest Way To Get Data from DB2 to Microsoft SQL Server

For a while, I have been working on figuring out a “better” way to get data from DB2 to Microsoft SQL Server. There are many different options and approaches and environments, and this one is mine, your mileage may vary.

Usually, when pulling data from DB2 to any Windows box, the first thing you might think of is ODBC. You can either use the Microsoft DB2 driver (which works, if you are lucky enough to get it configured and working), or the IBM iSeries Client Access ODBC Driver (which works well), or another 3rd party ODBC driver. Using ODBC, you can access DB2 with a ton of different clients. Excel, WinSQL, any 3rd party SQL Tool, a MSSQL linked server, SSIS, etc. ODBC connects just fine, and will work for “querying” needs. Also, with the drivers you might install, you can usually set up an OLE DB connection if your client supports it (SSIS for example) and query the data using OLEDB – this works as well, but there are some caveats, which I will talk about.

In comes SSIS, the go to ETL tool for MSFT BI developers. You want to get data from DB2 to your SQL Server Data Warehouse, or whatever. You try with an OLEDB connection source, but it is clunky, weird, and sometimes doesn’t work at all (PrimeOutput Errors Anyone?). If you do manage to get OLEDB configured and working, you still probably will be missing out on some performance gains compared to the method I am going describe.

Back to SSIS, using ODBC. It works. You have to create an ADO.NET ODBC connection, and use a DataReader source instead of an OLEDB source. Everything works fine, except one thing. It is slow! Further proof?

http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/162e55e5-b64b-423e-94c1-dd764ca1f683

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=96977

http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/cfade7e7-50d5-4447-9821-35c5d5ae1b66

http://www.sqlservercentral.com/Forums/Topic702042-148-1.aspx

http://www.sqlservercentral.com/Forums/Topic666993-148-1.aspx

Ok, enough links. But if you do read those. SQL 2000 DTS is faster than using SQL 2005/2008 SSIS. WTF? The best I can guess is that it is because of the .NET wrapper around ODBC. DTS is using “native” ODBC.

So, now what? Do we want to use DTS 2000? No. What to do though?

Well, after a few days of research, and just exploring around, I think I have found a good answer.. Replace DB2 with SQL Server.. just kidding. Here is what you need to do:

Install the IBM Client Access tools. There is a tool called “Data Transfer From iSeries Server” which the actual exe is "C:Program FilesIBMClient Accesscwbtf.exe"

image

This little tool allows you to set up data transfers from your DB2 system to multiple output choices (Display, Printer, Html, and Text). We want to export to Text file on our filesystem. You have to set up a few options, like the FileName, etc. In “Data Options” you can set up a where statement, aggregates, etc.

If you output to a file, you can go into “Details” and choose a file type, etc. I use ASCII Text, and then in the  “ascii file details” I uncheck all checkboxes. You set up your options and then hit the “Transfer data from iSeries” button and it will extract data to the file you chose in the filename field. Pretty sweet. But this is a GUI, how can I use this tool? I am not going to run this manually. Well, you are in luck.

If you hit the “Save” button, it will save a .dtf file for you. If you open this .dtf file in a text editor, you will see all options are defined in text, in a faux ini style. Awesome, we are getting somewhere.

Now, how do you run this from a cmd prompt? Well, we are in luck again. Dig around in C:Program FilesIBMClient Access and you will find a little exe called “rxferpcb”

image

What this tool allows you to do, is pass in a “request” (aka a DTF file), and a userid/password for your DB2 system, and it will execute the transfer for you. Sweet!

Now what do we do from here?

1) Create an SSIS package

2) Create an execute process task, call rxferpcb and pass in your arguements.

3) Create a BULK Insert task, and load up the file that the execute process task created. (note you have to create a .FMT file for fixed with import. I create a .NET app to load the FDF file (the transfer description) which will auto create a .FMT file for me, and a SQL Create statement as well – saving time and tedious work)

Now take 2 minutes and think how you could make everything generic/expression/variable driven, and you have yourself a sweet little SSIS package to extract any table from DB2 to text and bulk load it.

image

What is so great about the .DTF files is that you can modify them with a text editor, which means you can create/modify them programmatically. Think – setting where statements for incremental loads, etc.

image

 

You can see from the two screenshots above, that is all there is. Everything is expression/variable drive. Full Load, and Incremental Load. Using nothing but .dtf files, rxferpcb, a little .NET app I wrote to automatically create DTF’s for incremental (where statements), truncate, delete, and bulk insert. I can load up any table from DB2 to SQL by just setting 3 variables in a parent package.

After you wrap your head around everything I just went over, then stop to think about this. The whole DTF/Data Transfer/etc is all exposed in a COM API for “Data Transfer Automation Objects’”

http://www-912.ibm.com/s_dir/slkbase.NSF/643d2723f2907f0b8625661300765a2a/0c637d6b03f927ff86256a710076ab22?OpenDocument

With that information at your disposal, you could really do some cool things. Why not just create a SSIS Source Adapter that wraps that COM object and dumps the rows directly to the SSIS Buffer, and then does an OLEDB insert or Bulk Insert using the SQL Server Destination?

I have found in my tests that I can load over 100 million row tables – doing a full complete load, in about 6-7 hours. 30-40 million row tables in 4 hours. 2 to extract, 2 to BULK insert. Again, your mileage may vary depending on the width of your table, network speed, disk I/O, etc. To compare, with ODBC, just pulling and inserting 2 million records was taking over 2 hours, I didn’t wait around for it to finish. Pulling 2 Million records with my method described in this blog takes about 3-5 minutes (or less!)

I know I have skimmed over most of the nitty gritty details in this post, but I hope to convey from a high level that ODBC/OLE DB just aren’t as fast as the method here, I have spent a lot of time over the last few weeks comparing and contrasting performance and manageability. Now, if I could just get that DB2 server upgrade to SQL Server 2008. . . Happy ETL’ing!

Early Arriving Facts, Late Arriving Dimensions, Inferred Dimensions

Most ETL systems (at least that I have seen/studied/worked on) that populate data warehouses run something like

 

1) Load Dims

a) populate an unknown

b) populate dim data

2) Load Facts

a) join/lookup to dim’s, and if no match, set as “unknown” dimension record

3) Process Cube

 

This type of system works in many cases, but there are some flaws that bubble up over time. First, unless you reload your fact table, or update your unknown dimension keys on your fact, you could end up with unknowns, that will be unknowns forever. The system described above also means you need to run it in that order. Dims first, Facts second.

1155499_the_blends__4

Early Arriving Facts/Late Arriving Dimension – If you are an optimist, we have the fact data before we have the dimension data. Or, if you are a pessimist, we don’t have the dimension data when we load the fact. You choose, but in either scenario, we have data missing somewhere.

Like I mentioned earlier, many systems will just set the early arriving fact as “unknown” and set it to a unknown dimension key (usually –1) in the fact table. Some people might just ignore the fact record completely. You probably don’t want to do that.

But what if we have the “business” key in our fact data select. What can we do with that?

One option is to modify your dimension data select to UNION in all the distinct business keys from your fact data that aren’t in your dimension data. This works in a small data set. If you fact table is 500 million rows, you won’t like the performance of this option.

Another option we can use is the idea of an inferred dimension. As you load your fact table data (preferably through SSIS) you do a lookup to your dimension. If you have a match, cool, take that key and move on. If you don’t have a match, instead of setting the key to –1 (unknown), do this:

1) Insert a new dimension record with your business key from your fact table

2) Grab the newly created dimension key from the record you just inserted

3) Merge the key back into your fact data pipeline.

Awesome. Now, sometime in the future, your Dimension process can come through, and if you are doing Slowly Changing Dim’s, it should just update your inferred dimension records with data. If your inferred dimension records are some one offs that might never get updated, you might be able to get someone to manually update them through some interface, or whatever, in any event you aren’t stuck with tons of fact records that are set to –1/unknown.

Of course, the method above works best using SSIS, with a “Get Data -> Lookup Pattern –> Insert” method.

Happy ETL’ing!