SQL Server Data Synchronization Automation Using Tablediff Utility and SSIS – Part 1

February 19th, 2014 / No Comments » / by admin

Note: Second part to this series can be found HERE

On one of the recent project a client posed a requirement that a data synchronization routine between database objects is to be developed with minimal complexities i.e. as little SQL as possible, definitely no .NET/Python and preferably using in-house ETL tool such as MS SQL Server Integration Services. The solution was to be exceptionally serviceable by their internal group of analysts with no elaborate code base to maintain. Given the fact that SSIS has become their de facto tool for all data transformations and loading and most staff were familiar with the intuitive drag-and-drop style of development, the constraint of using SSIS in its purest form was the primary success-defining factor. I have written in length in my previous posts (mainly HERE, HERE and HERE) on different methods of synchronizing database objects and their content, however, I found that when KISS (Keep It Simple Stupid) methodology is a preference and no elaborate features and required, a little tool called tablediff, in combination with a simple SSIS package, gets the job done very nicely.

Tablediff lets you compare the data in two tables in order to identify any discrepancies and generate the T-SQL script necessary to update the data in the second table to conform to the first table, if discrepancies are found. Though originally intended primarily as a utility for troubleshooting non-convergence in a replication topology, tablediff can actually be used to perform a row-by-row comparison of any two tables that share a similar schema. The tool has a number of limitations which prohibit it from serving as a more enterprise-geared option e.g. the source table must be configured with a primary key/identity/rowguid column or only tables sharing same schema can be compared. Also, tablediff enforces certain restrictions around datatype support – synchronization feature doesn’t work with varchar(max) nvarchar(max), varbinary(max), timestamp, xml, text, ntext or image datatypes. However, in its defence, tablediff is free and very simple to use on its own or alternatively integrate with other services.

In this post I will not focus on all options and characteristics that tablediff offers but explain how a simple SSIS package can be built and configured to take advantage of its functionality and features to automate data synchronization process.

Let’s start with a sample dataset to demonstrate how those data differences can be resolved. Below is a simple SQL script which when run, creates 2 databases and 4 tables – 2 in each database. When executed, Source_DB database and its 2 tables contains data which serves as the master copy, whereas Target_DB database contains data which requires ‘alignment’ with the source to reflect its copy in an identical way. Even though I could contain the tablediff tool’s functionality using only two tables (the script creates a pair for each database, totaling to 4 tables all together), I decided to multiply it by the factor of 2 to showcase how we can loop through each table to encompass a large collection of objects. In this way, rather than running tablediff for each table separately, we can specify an assortment of tables which can be as small as a few hand-picked objects all the way up to the whole database with all tables included.

USE [master]
GO
IF EXISTS (SELECT name FROM sys.databases WHERE name = N'Source_DB')
BEGIN
-- Close connections to the DW_Sample database
ALTER DATABASE [Source_DB] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [Source_DB]
END
GO
CREATE DATABASE [Source_DB]

IF EXISTS (SELECT name FROM sys.databases WHERE name = N'Target_DB')
BEGIN
-- Close connections to the DW_Sample database
ALTER DATABASE [Target_DB] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [Target_DB]
END
GO
CREATE DATABASE [Target_DB]

USE Source_DB
CREATE TABLE Tbl1 (
ID int NOT NULL,
Sample_Data_Col1 varchar (50) NOT NULL,
Sample_Data_Col2 datetime NOT NULL)
GO

INSERT INTO Tbl1
(ID, Sample_Data_Col1, Sample_Data_Col2)
SELECT 1, 'OddValue1', '9999/12/31'
UNION ALL
SELECT 2, 'OddValue2', '9999/12/31'

SELECT * INTO Tbl2 FROM Tbl1
GO

ALTER TABLE Tbl1
ADD CONSTRAINT Source_pk_ID1 PRIMARY KEY (ID)
ALTER TABLE Tbl2
ADD CONSTRAINT Source_pk_ID2 PRIMARY KEY (ID)

USE Target_DB
CREATE TABLE Tbl1 (
ID int NOT NULL,
Sample_Data_Col1 varchar (50) NOT NULL,
Sample_Data_Col2 datetime NOT NULL)
GO

INSERT INTO Tbl1
(ID, Sample_Data_Col1, Sample_Data_Col2)
SELECT 1, 'OddValue1', GetDate()
UNION ALL
SELECT 3, 'OddValue3', '1900/01/01'

SELECT * INTO Tbl2 FROM Tbl1
GO

You can notice the difference when running a simple SELECT statement against both databases as per image below.

Tablediff_select_all_tables

Now that we have our source and target databases and tables defined, we need to create a separate table which will act as a placeholder for the objects we wish to synchronize and their corresponding details. We will refer to this table during package execution while looping through each of its records to fetch relevant information e.g. schema name, database name etc.

The code for this table which I called Params_details is as per below but you can populate it using other methods which reflect your environment setup more adequately.

IF OBJECT_ID('dbo.Params_details', 'U') IS NOT NULL
  DROP TABLE dbo.Params_details

CREATE TABLE Params_details
(ID	int IDENTITY		(1,1),
Source_DB	  varchar	(100)	NOT NULL,
Target_DB     varchar	(100)	NOT NULL,
Source_Schema varchar	(50)	NOT NULL,
Target_Schema varchar	(50)	NOT NULL,
Source_Tbl    varchar	(100)	NOT NULL,
Target_Tbl    varchar	(100)	NOT NULL)
GO

DECLARE @source_db_name varchar		(100)	= 'Source_DB'
DECLARE @target_db_name varchar		(100)	= 'Target_DB'
DECLARE @source_schema_name varchar (50)	= 'dbo'
DECLARE @target_schema_name varchar (50)	= 'dbo'
DECLARE @sql1 varchar (1000) =
	'INSERT INTO Params_details
	(Source_DB, Target_DB, Source_Schema, Target_Schema, Source_Tbl, Target_Tbl)
	SELECT DISTINCT '''+@source_db_name+''', '''+@target_db_name+''',
	'''+@source_schema_name+''', '''+@target_schema_name+''', a.name, b.name
	FROM '+@source_db_name+'.sys.tables a
	JOIN '+@target_db_name+'.sys.tables b ON a.name = b.name
	WHERE a.type = ''U'' and b.type = ''U'''

exec (@sql1)

Running a simple select against it, you can see that its content simply reflects the collection of objects with their schema and database details which we identified as the candidates for data merging.

Tablediff_select_all_params_details

For the purpose of this exercise, I have also created a directory on my C:\ drive called DiffOutput which will house the batch file used by the package, SQL files created during its execution and another directory called Archived_Files where files generated will be moved into and stored for further reference and in order to keep DiffOutput less cluttered. The directory structure and content should resemble the image below.

Tablediff_dir_structure

Lastly, before I start exploring the SSIS package components in the NEXT post to this series, I would like to touch on the batch file code and its functionality. As you can see in the above image the DiffOutput directory contains a small batch file which is called by the package’s Execute Process Task. The file executes tablediff.exe utility which providing you have SQL Server installed should also be created in SQL Server directory sub-tree. A number of parameters are also passed down from the package when the file is called as, at the minimum, tablediff requires source and target database, source and target schema and source and target table to be specified. As the script is configured to generate a SQL file containing modification code to be applied in order to synchronize source and target, -f option, followed by the full path and file name of the script file is also used to output the file into the nominated directory. Also, notice that all information is a single line entry only, if you choose to break it down into multiple lines tablediff will complain and throw an error. The remaining lines (lines 2-6) simply define the current date and time which are used to rename the file(s) generated, appending date and time string thus allowing for more effective archiving strategy. The batch file content is as per below.

"C:\Program Files\Microsoft SQL Server\110\COM\tablediff.exe" -sourceserver "SHERLOCK\MSSQL12_Tabular" -sourcedatabase [%1]  -sourceschema [%2]  -sourcetable [%3]   -destinationserver SHERLOCK\MSSQL12_TABULAR  -destinationdatabase [%4]  -destinationschema [%5]  -destinationtable [%6] -f c:\DiffOutput\difference.sql
set d=%date:~-4,4%%date:~-7,2%%date:~0,2%
set d=%d: =_%
set t=%time:~0,2%%time:~3,2%%time:~6,2%
set t=%t: =0%
RENAME "c:\diffoutput\difference.sql" "diff_%d%_%t%.sql"

In the NEXT post I will outline the SSIS package development steps and present the complete solution ‘at work’ in a short video footage.

Tags: ,

Which One Should I Use – NOT EXISTS, NOT IN, OUTER APPLY, EXCEPT and LEFT OUTER JOIN Performance Optimization

February 12th, 2014 / No Comments » / by admin

Introduction

Lately I have been involved in a large enterprise data warehouse deployment project and the in last few weeks we have been at a point where the team slowly begun transitioning from dimension tables development to fact tables development. Working with many different teams e.g. contractors, consultants, internal data professionals etc. what caught my attention was the fact that different developers tend to use different techniques to differentiate between the ‘new’ vs. ‘old’ data e.g. source OLTP system vs. target fact table in the OLAP schema to determine whether a given set of transactions have already been inserted or not. Given that most data flow logic was embedded into SQL stored procedures, with SQL Server Integration Services only controlling execution flow, a typical scenario for source vs. target comparison would involve applying dataset differentiating statement e.g. NOT EXISTS or NOT IN to account for any new transactions. There is a number of different options available here so I thought it would be a good idea to put them to the test and find out how performance is affected when using the most common ones i.e. NOT EXISTS, NOT IN, OUTER APPLY, EXCEPT and LEFT OUTER JOIN, on a well-structured dataset, with and without an index and with columns defined as NULLable vs. Non-NULLable.

The short-winded version is that when presented with a choice of using either one of those 5 statements you should preferably stay away from NOT IN for reasons I described below. This will depend heavily on your schema, data and resources at your disposal but as a rule of thumb, NOT IN should never be the first option to give consideration to when other alternatives are possible.

TL;DR Version

Let’s start with two sample datasets – Tbl1 and Tbl2 – where Tbl2 differs from its archetype by removing a small number of records (in this case around 50). The two tables and their data were created using the following SQL.

USE [master]
GO
IF EXISTS (SELECT name FROM sys.databases WHERE name = N'TestDB')
BEGIN
-- Close connections to the DW_Sample database
ALTER DATABASE [TestDB] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [TestDB]
END
GO
CREATE DATABASE [TestDB]
GO
USE [TestDB]
GO

CREATE TABLE Tbl1
(ID int identity (1,1),
ID_NonIdentity int NOT NULL DEFAULT 0,
object_id int NOT NULL)
GO

CREATE TABLE Tbl2
(ID int identity (1,1),
ID_NonIdentity int NOT NULL,
object_id int NOT NULL)
GO

INSERT INTO Tbl1 (object_id)
SELECT c1.object_id FROM sys.objects c1
CROSS JOIN (SELECT Top 100 name FROM sys.objects) c2
CROSS JOIN (SELECT Top 100 type_desc FROM sys.objects) c3
GO 25

UPDATE Tbl1 SET ID_NonIdentity = ID

INSERT INTO Tbl2 (ID_NonIdentity, object_id)
SELECT ID_NonIdentity, object_id FROM Tbl1

SET NOCOUNT ON
DECLARE @start int = 0
DECLARE @finish int = (SELECT MAX(id) FROM Tbl2)
WHILE @start <= @finish
	BEGIN
	DELETE FROM Tbl2 WHERE id = @start
	SET @start = @start+250000
	END

CREATE INDEX idx_Tbl1_ID_NonIdentity
ON Tbl1 (ID_NonIdentity)
CREATE INDEX idx_Tbl2_ID_NonIdentity
ON Tbl2 (ID_NonIdentity)

Given that the two objects’ data is slightly different, we can now compare their content and extract the dichotomies using ID_NonIdentity attribute or simply run EXCEPT statement across the two tables. You can also notice that at this stage both tables are defined using non-NULLable data types and have indexes created on ID_NonIdentity column. Let’s run the sample SELECT statements using NOT EXISTS, NOT IN, OUTER APPLY, EXCEPT AND LEFT OUTER JOIN and look at execution times and plans in more detail.

--QUERY 1
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
SET STATISTICS TIME ON
SELECT ID
FROM tbl1 a
WHERE a.ID_NonIdentity NOT IN
	(SELECT b.ID_NonIdentity FROM tbl2 b)
SET STATISTICS TIME OFF

--QUERY 2
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
SET STATISTICS TIME ON
SELECT ID
FROM tbl1 a WHERE NOT EXISTS
	(SELECT ID_NonIdentity
	FROM tbl2 b
	WHERE a.ID_NonIdentity = b.ID_NonIdentity)
SET STATISTICS TIME OFF

--QUERY 3
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
SET STATISTICS TIME ON
SELECT a.ID FROM Tbl1 a
LEFT OUTER JOIN Tbl2 b ON a.ID_NonIdentity = b.ID_NonIdentity --11sec
WHERE b.ID_NonIdentity IS NULL
SET STATISTICS TIME OFF

--QUERY 4
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
SET STATISTICS TIME ON
SELECT a.ID
FROM tbl1 a OUTER APPLY
			(SELECT ID_NonIdentity FROM Tbl2 b
			WHERE a.ID_NonIdentity=b.ID_NonIdentity) z
			WHERE z.ID_NonIdentity IS NULL
SET STATISTICS TIME OFF

--QUERY 5
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
SET STATISTICS TIME ON
SELECT ID
FROM tbl1 a
EXCEPT
SELECT ID
FROM tbl2 b
SET STATISTICS TIME OFF

AllQueries_ExecTime_FirstPass

Looking closely at the execution plans, utilizing NOT EXISTS and NOT IN produced identical query plans with Right Anti Semi Join being the most expensive operation here.

Q1andQ2_ExecPlan

This was somewhat similar to OUTER APPLY and LEFT OUTER JOIN, however for those two query types the optimizer chose Right Outer Join which seemed a little bit more expensive compared to Right Anti Semi Join due to the query bringing in all matching and non-matching records first and then applying a filter to eliminate matches as per image below.

Q3andQ4_ExecPlan

Using EXCEPT yielded similar execution plan to NOT EXISTS and NOT IN with the exception of optimizer utilizing Hash Match (Aggregate) to build a hash table in order to remove duplicates. This is an important point to make as EXCEPT includes implicit DISTINCT – if cases multiple rows with the same value are found, they will be eliminated from the left ‘side of the equation’ much like UNION vs. UNION ALL operators. Not an issue in this specific instance but something to watch out for when planning to query data differences.

Regardless of the slight differences in execution plans, all queries with the exception of the one using EXCEPT run in a comparable time. Typically, such statements in a production environment would run over a potentially larger datasets with much more complex logic involved so larger variances can be expected. Generally though, performance is maintained on par and disparities should be minimal. Also, removing indexes from both tables did little to increase execution time for the above queries. But what happens if we enable NULL values ID_NonIdentity attribute? Let’s execute the following SQL to change column NULLability and run the representative SQL SELECT statements again to see if a change can be attributed to ID_NonIdentity accepting NULL values. Notice that there is no change to the underlying data and previously created indexes are still in place.

ALTER TABLE Tbl1
ALTER COLUMN ID_NonIdentity int NULL
GO
ALTER TABLE Tbl2
ALTER COLUMN ID_NonIdentity int NULL
GO

Execution times are as per the chart below and it’s clear to see that while query 2, 3, 4 and 5 behaved in a predictable manner and returned all records within respectable time, query 1 failed abominably.

AllQueries_ExecTime_SecondPass

The main problem here is that the results can be surprising if the target column is NULLable as SQL Server cannot reliably tell if a NULL on the right side is or isn’t equal to the reference record on the left side when executing the query using NOT IN clause. And that’s regardless whether the column actually contain any NULL values or not. I let query 1 to run for 30 minutes after which it was clear that it was going to take a while to complete and was a good indication of the problems the NOT IN clause was causing to optimizer trying to select the most representative plan. You can also tell that the query plan generated after the column modification is quite a bit more involved with Nested Loops, Row Count Spool and heavy tempdb database usage to accomplish what seemed straightforward in the first pass.

Q1_ExecPlan_PostColumnMod

Conclusion

The upshot to this quick exercise in query performance is that whenever you plan to compare data in table A against data in table B where some condition does not exists in table B, using NOT IN clause should be avoided as much as possible. This, of course will be dependent on your workloads, hardware, data, schema and environment in general but it would be safe to say that using NOT EXISTS instead of NOT IN would most likely result in best query performance and execution time.

Tags: , ,