Hadoop Musing – How to Install Hortonworks Hadoop Distribution (Sandbox and HDP) Locally On Microsoft Windows

April 6th, 2014 / No Comments » / by admin

Introduction

The venerable Hadoop obsession has taken its toll on me so I thought that in the spirit of the current data management trends it would be a good idea to start looking into the whole Hadoop ecosystem. Every executive I have recently spoken to about their data strategy has been asking about it (whether they actually understand it or not is a different story) and I start to feel that if I don’t succumb to this craze I will miss out on this worldwide phenomenon that’s been evolving behind my back for a few years now – the ‘Big Data’ movement. I went to a few seminars and meet-ups where the folks from Claudera, Hortonworks and others drum on about how nearly every single major RDBMS vendor will either rot away clinging to its archaic ways of storing structured data and disappear into the abyss of relational dismay unless they join this new data liberation movement and start integrating NoSQL, Big Data, Hadoop-like paradigms into their application pool to stay somewhat relevant in this new world order. Regardless of whether this PR web they’re spinning has any merit or not or what the future actually holds, there is nothing better than me getting my hands dirty and trying this thing for myself. So here I am, looking at my first Hadoop deployment and writing this post to help other folks out there make up their mind on the Hadoop vs. OLTP/OLAP technology for data analysis and management.

Given that it’s my first post on this topic (plenty more in the pipeline so make sure you stay tuned) I thought it would be a good idea to start with something simple and outline the installation process for Hortonworks Hadoop distribution on a Windows platform. Hortonworks also provides a ready-made VM image which can be deployed with just a few click but given that the actual installation process is slightly more involved, I though it’s a good idea to also step through it providing more details.

Installing Hortonworks Sandbox v. 2.0 (on premise)

Let’s start with the sandbox which can be downloaded from the Hortonworks website and deployed using either a free VirtualBox application (Hortonworks recommended option) or other virtualization software e.g. VMWare, Hyper-V (just make sure that virtualization is enabled in the BIOS, minimum RAM can be allocated if you plan to run Ambari and Hbase (8GB) and a compatible browser is installed on your box e.g. Chrome 25+, IE 9+ etc.). After downloading the VM’s image, the ‘installation’ process is very straightforward i.e. providing you have a suitable application to mount the image installed, it’s just a matter of double-clicking on the downloaded file which starts the mounting process, pointing your browser at the I.P. address as per the installation output, filling out the registration form and deepdiving into the sandbox. Below are the sample screenshots depicting the installation session at its completion and the browser interface for the sandbox environment when first initiated.

HDP_Local_Install_VMSessionIPAddress

HDP_Local_Install_HDPSandboxWebInterface

Installing HDP 2.0 for Windows (on premise)

Installing HDP is a little bit more involved and given the fact that the final product does not include native/out-of-the-box Apache Ambari or Hue components, it’s a little underwhelming to know that you cannot use GUI to manage HDP or query its data without additional configuration required that can be quite complex for someone uninitiated.

The following is a collection of step-by-step instructions on how to provision a single node installation of HDP on Windows Server 2012 R2. However, before we can commence the installation process, several prerequisites need to be fulfilled without which the process will fail so make sure all of those are met prior to cluster deployment.

  • Install Microsoft Visual C++ 2010 Redistributable package (64-bit)
  • Install Microsoft .NET Framework 4.0 (unless installing on Windows Server 2012 or higher in which case it should already be a part of the base installation)
  • Install Java Development Kit (JDK)
  • Install Python 2.7 (or higher if superseded by the newer version)

Installing the first two should not be a problem and usually no additional configuration is required. However, when installing Java JDK and Python, further tweaks are required so I think it’s a good idea to outline those in more details.

Java JDK Installation and Configuration

After Java JDK has been downloaded (the supported version can be fetched from HERE), install it to a directory which has no white space characters in its path. This is very important as the default option/path will not work! You can, for example, nominate the following as the new installation directory – C:\Java (as per image below) – which has no spaces in its directory path.

HDP_Local_Install_JavaDirectory

Next, create a system variable called JAVA_HOME and assign the full path to the installation directory defined earlier for JDK as its value. This is done by opening up Control Panel’s System pane and under Advanced system settings and Advanced tab clicking on Environment Variables button. Next, under System variables, click New and enter the variable name as JAVA_HOME. Finally, enter the variable value as the installation path for JDK e.g. C:\Java as per image below and confirm your input/selection by clicking on OK/Apply Changes when finished.

HDP_Local_Install_JavaPathConfig

Python 2.7 Installation and Configuration

After you downloaded the Windows specific python distribution (the binaries can be fetched from HERE), install it using the default configuration. Next, update the path settings in a similar way we did when configuring Java JDK i.e. go to Control Panel -> System pane -> Advanced settings -> Advanced system tab -> Environment Variables button and under System Variables find PATH and click Edit. In the Edit windows, modify path by appending the installation path for your Python directory to the value of PATH. For example, if the Python executable is installed in C:\Python27, you must append this value to PATH as per image below.

HDP_Local_Install_Python27Path

You can validate Python installation by opening up PowerShell terminal and typing in python. If the installation and configuration were successful you should get a response similar to the image below.

HDP_Local_Install_Python27CorrectPSOutput

Server Configuration

Now it’s time to configure the server before we can begin with HDP installation. The below steps outline Windows Server configuration and HDP provisioning.

Firstly, obtain the HOSTNAME for the server you’re installing to. This can be achieved by typing hostname in the command line or Windows PowerShell environment and capturing the output as per image below.

HDP_Local_Install_HostnamePSCheck

Next, disable the firewall and open up all ports. The simplest way around it is to go to the Control Panel, type in Firewall into the search box and click on Check Firewall Status. Next, click on Turn Windows Firewall On or Off to turn it off for all available networks. Alternatively, individual ports will require to be opened. The below table outlines the default ports used by various services so unless you are in a position to turn the firewall off completely, these will need to be opened before HDP installation.

HDP_Local_Install_PortsOverviewNext, download HDP from Hortonworks website (currently distributed as a zipped up file) and go to the folder where the downloaded file was saved. Extract its content to a different directory. For the purpose of this demo I decided to simply extract it into the Downloads directory, where the downloaded package was stored under another directory called hdp-2.0.6-GA_extract as per image below.

HDP_Local_Install_DownloadDirectory

At this point, under the hdp-2.0.6-GA_extract directory, you should see a number of files, one of them called clusterproperties.txt. Open it up and replace server names for all the nodes under #Hosts and for the DB_HOSTNAME under #Database Host with the host name of your machine (obtained as per point 1 instructions). Also, ensure that you change the Log and Data directories to point to the appropriate drive on your server (most likely C drive). The modified clusterproperties.txt file’s content should resemble the image below.

HDP_local_Install_ClusterPropsTextFile

Next, save the file and from an elevated command prompt with administrative privileges, run the following command ensuring that directory path reflects your environment/folders setup.

msiexec /i "C:\Users\dingopole\downloads\hdp-2.0.6-GA_extract\hdp-2.0.6.0.winpkg.msi" /lv "hdp.log" HDP_LAYOUT="C:\Users\dingopole\Downloads\hdp-2.0.6-GA_extract\clusterproperties.txt" HDP_DIR="C:\hdp\hadoop" DESTROY_DATA="yes"

Alternatively, open up the PowerShell prompt in Administrator mode and execute the extracted MSI through this command, ensuring that you have navigated to the directory where the MSI file is stored.

msiexec /i "hdp-2.0.6.0.winpkg.msi"

The below image depicts the command input in PowerShell as per my VM’s directory structure and the HDP MSI file’s saved location.

HDP_Local_Install_MSIFileExecInPS

This should provide you with the following output where many parameters should appear pre-populated with default values. Complete the form with the parameters which reflect your environment ensuring that DERBY and not MSSQL is selected from the drop-down for ‘DB Flavour’ selection and that ‘Delete Existing HDP data’ is ticked as per image below.

HDP_Local_Install_MSIFileExecForm

A successful installation should conclude with the following message displayed on its conclusion.

HDP_Local_Install_SuccessMsg

On the desktop of your server you should now see three new icons. Click the Hadoop command prompt and navigate back one directory by executing the following.

cd..

Next, type in the following to start services associated with HDP installation.

Start_local_HDP_services

Wait for a few minutes for this job to complete and validate the services start-up by issuing the following from the same command prompt.

Run-SmokeTests

This command runs a variety of tests against the new installation to ensure that MapReduce, Hive, Pig, ZooKeeper etc have been provisioned correctly and are running as expected.

Conclusion

As you can see, unless you’re happy to use a VM image provided by Hortonworks, installing Hadoop can be a quite involving process, even though this is only a single node installation. There have been whole books written on Hadoop environment provisioning so this post is only meant to serve as an introduction to a much complex issue of the Hadoop environment deployment. At this stage, regardless of which distribution you choose to go with, the ecosystem is still in a state of flux with new features rolled out nearly every month. The status quo around the companies supporting those distributions also fluctuates e.g. Intel abandoning their proprietary Hadoop release in favor of investing in Cloudera instead so even though it hurts me to say it, I wouldn’t necessarily take these instructions for granted. At the same time, any forthcoming releases should become more coherent, integrated and easier to deploy and manage so the ease of breaking into the Hadoop ‘black box’ should also become more straightforward.

As this is my first post on Hadoop I have merely managed to scratch the surface of this technology. In the next few posts I will dive deeper into the HDP/Hadoop paradigm, exploring its features and functionality in more details so stay tuned!

Tags: , ,

SQL Server Data Synchronization Automation Using Tablediff Utility and SSIS – Part 2

February 19th, 2014 / No Comments » / by admin

In the FIRST post this short series I briefly outlined tablediff utility and set the stage (databases, objects, directories and batch file) for the SSIS package responsible for data synchronization workflow. In this post I will continue with the solution development and show you how to build and configure the package to take advantage of the tablediff tool and merge data across the 2 databases and their objects deployed as per PART 1 prelude. The complete package can be downloaded from my SkyDrive folder HERE if you wish to use it as a template – just make sure you adjust all necessary parameters e.g. database connection details etc. to correspond to your environment. I have also attached a short video footage depicting the package execution and the outcome on the tables’ data synchronization and SQL files management at the end of this post.

The complete package looks as per image below, with the core functionality encapsulated in the first two sections. The remaining Script Task and the second ForEach Loop Container are there to provide additional features i.e. the Script Task checking whether the file(s) have been created in the nominated directories and the second ForEach Loop Container to execute scripts generated by the tablediff utility and archive the files processed into a separate directory.

Tablediff_complete_SSIS_package

Let’s start with creating a bunch of variables which will house the data temporarily used during package runtime or provide value-dependant execution topology. Most of the variables are pretty self-explanatory with names reflecting their function. Common_Var variable of the System.Object type will house the results of the query against the Params_details table (see previous post HERE) whereas the Boolean File_Exists variable is populated by the Script Task and its value determines execution pathway depending on whether the file(s) have been created or not. Finally, Temp_FileName is used by the second ForEach Loop Container in iterating through SQL files created.

Tablediff_SSIS_variables_collection

Looking at the Connection Manager, there is one database connection pointing to the Source_DB database and 2 file connections – Archived_Files which points to the archive directory (C:\Diffoutput\Archived_Files) and Temp_FileName one which is using an expression for connection string as per image below.

Tablediff_Temp_Filename_conn_expression

Let’s go through the Control Flow pane now and outline the individual containers roles. Starting with Execute SQL Task 1 transformation, it simply executes the SELECT SQL statement, returning the data as a full results set, which in turn is mapped to the Common_Var variable as per below.

Tablediff_SSIS_Exec_SQL1

Tablediff_SSIS_Exec_SQL1_Var_Assignment

Next, the first ForEach Loop Container is configured to use the ADO object source variable Common_Var and the following variables mappings.

Tablediff_SSIS_ForEach_Loop1

Tablediff_SSIS_ForEach_Loop1_Var_Mapping

Note that the column order used in the SELECT SQL statement from the first transformation is corresponding to the variables assignment in the first ForEach Loop Container, otherwise the package will fail.

Execute Process Task simply runs the batch file saved in the nominated directory, passing the variables as arguments in a form of an expression as per below. Also, note that the order in which those appear in the expression corresponds to the SELECT SQL statement attributes order and variable mapping in the ForEach Loop Container order.

Tablediff_exec_process_process_property

Tablediff_exec_process_expression_property

Notice that the FailTaskIfReturnCodeIsNotSuccessValue property is set to False. This is due to the fact that if no changes are detected, the tablediff will return SucessValue of 1 which effectively fails package execution. As we don’t want to treat this as an error or exception FailTaskIfReturnCodeIsNotSuccessValue property needs to be changed from its default setting to False value.

Following on, we have a Script Component with a very simply VB.NET code snippet checking for SQL files existence. I know in the first post I said that no code will be required but this section is not compulsory, uses default SSIS functionality and is only there as an enhancement so you can remove it if you wish. The File_Exists variable is also listed in the ReadOnlyVariables property.

Tablediff_SSIS_script_task_config

#Region "Imports"
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.IO
#End Region

<Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute()> _
<System.CLSCompliantAttribute(False)> _
Partial Public Class ScriptMain
    Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase

    Public Sub Main()
        Dim directory As DirectoryInfo = New DirectoryInfo("c:\DiffOutput")
        Dim file As FileInfo() = directory.GetFiles("*.sql")

        If file.Length > 0 Then
            Dts.Variables("User::File_Exists").Value = True
        Else
            Dts.Variables("User::File_Exists").Value = False
        End If

        Dts.TaskResult = ScriptResults.Success
        Dts.TaskResult = ScriptResults.Success
    End Sub

#Region "ScriptResults declaration"
    Enum ScriptResults
        Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success
        Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
    End Enum
#End Region
End Class

This code also populates File_Exists variable which is being used in the Precedence Constraints joining the subsequent transformations. It contains the conditional expression which when evaluated, returns a Boolean value altering the package execution flow i.e. pointing it in either of the 2 directions – Execute SQL Task 1 or ForEach Loop Container 2. Below is the Precedence Constraint editor view for executing ForEach Loop Container 2 section.

Tablediff_SSIS_precedence_constraint_formula

In the last section, ForEach Loop Container 2 iterates through SQL files created using the following enumerator configuration and variable mapping.

Tablediff_ForEach_Loop2_Var_Mapping

Inside the ForEach Loop Container 2, Execute SQL Task 3 transformation runs the SQL statements saved in the files, with SQL Source Type property configured as a File Connection and Temp_FileName connection selected as a value for File Connection property. File System Task simply archives the processed files into the archive sub-directory as the last step of this package’s execution.

Tablediff_SSIS_ForEach_Loop2

Tablediff_SSIS_file_sys_task1

The complete package can be downloaded from my SkyDrive folder HERE so you can use it as a template – just make sure you adjust all necessary parameters e.g. database connection details etc. to correspond to your environment. I have also attached a short video footage depicting the package execution and the outcome on the tables’ data synchronization and SQL files management.

So there you go, a fairly simple and easy to manage data synchronization solution using tablediff utility and Integration Services tool. If you take away the VB.NET code snippet, the whole solution can be build using the default tablediff functionality, pure SSIS and one very simple SELECT SQL statement so if your project can live with the tablediff functionality constraints and you don’t require a great deal of performance, this tool can become a simple and straightforward option for your data synchronization project.

Tags: ,

SQL Server Data Synchronization Automation Using Tablediff Utility and SSIS – Part 1

February 19th, 2014 / No Comments » / by admin

Note: Second part to this series can be found HERE

On one of the recent project a client posed a requirement that a data synchronization routine between database objects is to be developed with minimal complexities i.e. as little SQL as possible, definitely no .NET/Python and preferably using in-house ETL tool such as MS SQL Server Integration Services. The solution was to be exceptionally serviceable by their internal group of analysts with no elaborate code base to maintain. Given the fact that SSIS has become their de facto tool for all data transformations and loading and most staff were familiar with the intuitive drag-and-drop style of development, the constraint of using SSIS in its purest form was the primary success-defining factor. I have written in length in my previous posts (mainly HERE, HERE and HERE) on different methods of synchronizing database objects and their content, however, I found that when KISS (Keep It Simple Stupid) methodology is a preference and no elaborate features and required, a little tool called tablediff, in combination with a simple SSIS package, gets the job done very nicely.

Tablediff lets you compare the data in two tables in order to identify any discrepancies and generate the T-SQL script necessary to update the data in the second table to conform to the first table, if discrepancies are found. Though originally intended primarily as a utility for troubleshooting non-convergence in a replication topology, tablediff can actually be used to perform a row-by-row comparison of any two tables that share a similar schema. The tool has a number of limitations which prohibit it from serving as a more enterprise-geared option e.g. the source table must be configured with a primary key/identity/rowguid column or only tables sharing same schema can be compared. Also, tablediff enforces certain restrictions around datatype support – synchronization feature doesn’t work with varchar(max) nvarchar(max), varbinary(max), timestamp, xml, text, ntext or image datatypes. However, in its defence, tablediff is free and very simple to use on its own or alternatively integrate with other services.

In this post I will not focus on all options and characteristics that tablediff offers but explain how a simple SSIS package can be built and configured to take advantage of its functionality and features to automate data synchronization process.

Let’s start with a sample dataset to demonstrate how those data differences can be resolved. Below is a simple SQL script which when run, creates 2 databases and 4 tables – 2 in each database. When executed, Source_DB database and its 2 tables contains data which serves as the master copy, whereas Target_DB database contains data which requires ‘alignment’ with the source to reflect its copy in an identical way. Even though I could contain the tablediff tool’s functionality using only two tables (the script creates a pair for each database, totaling to 4 tables all together), I decided to multiply it by the factor of 2 to showcase how we can loop through each table to encompass a large collection of objects. In this way, rather than running tablediff for each table separately, we can specify an assortment of tables which can be as small as a few hand-picked objects all the way up to the whole database with all tables included.

USE [master]
GO
IF EXISTS (SELECT name FROM sys.databases WHERE name = N'Source_DB')
BEGIN
-- Close connections to the DW_Sample database
ALTER DATABASE [Source_DB] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [Source_DB]
END
GO
CREATE DATABASE [Source_DB]

IF EXISTS (SELECT name FROM sys.databases WHERE name = N'Target_DB')
BEGIN
-- Close connections to the DW_Sample database
ALTER DATABASE [Target_DB] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [Target_DB]
END
GO
CREATE DATABASE [Target_DB]

USE Source_DB
CREATE TABLE Tbl1 (
ID int NOT NULL,
Sample_Data_Col1 varchar (50) NOT NULL,
Sample_Data_Col2 datetime NOT NULL)
GO

INSERT INTO Tbl1
(ID, Sample_Data_Col1, Sample_Data_Col2)
SELECT 1, 'OddValue1', '9999/12/31'
UNION ALL
SELECT 2, 'OddValue2', '9999/12/31'

SELECT * INTO Tbl2 FROM Tbl1
GO

ALTER TABLE Tbl1
ADD CONSTRAINT Source_pk_ID1 PRIMARY KEY (ID)
ALTER TABLE Tbl2
ADD CONSTRAINT Source_pk_ID2 PRIMARY KEY (ID)

USE Target_DB
CREATE TABLE Tbl1 (
ID int NOT NULL,
Sample_Data_Col1 varchar (50) NOT NULL,
Sample_Data_Col2 datetime NOT NULL)
GO

INSERT INTO Tbl1
(ID, Sample_Data_Col1, Sample_Data_Col2)
SELECT 1, 'OddValue1', GetDate()
UNION ALL
SELECT 3, 'OddValue3', '1900/01/01'

SELECT * INTO Tbl2 FROM Tbl1
GO

You can notice the difference when running a simple SELECT statement against both databases as per image below.

Tablediff_select_all_tables

Now that we have our source and target databases and tables defined, we need to create a separate table which will act as a placeholder for the objects we wish to synchronize and their corresponding details. We will refer to this table during package execution while looping through each of its records to fetch relevant information e.g. schema name, database name etc.

The code for this table which I called Params_details is as per below but you can populate it using other methods which reflect your environment setup more adequately.

IF OBJECT_ID('dbo.Params_details', 'U') IS NOT NULL
  DROP TABLE dbo.Params_details

CREATE TABLE Params_details
(ID	int IDENTITY		(1,1),
Source_DB	  varchar	(100)	NOT NULL,
Target_DB     varchar	(100)	NOT NULL,
Source_Schema varchar	(50)	NOT NULL,
Target_Schema varchar	(50)	NOT NULL,
Source_Tbl    varchar	(100)	NOT NULL,
Target_Tbl    varchar	(100)	NOT NULL)
GO

DECLARE @source_db_name varchar		(100)	= 'Source_DB'
DECLARE @target_db_name varchar		(100)	= 'Target_DB'
DECLARE @source_schema_name varchar (50)	= 'dbo'
DECLARE @target_schema_name varchar (50)	= 'dbo'
DECLARE @sql1 varchar (1000) =
	'INSERT INTO Params_details
	(Source_DB, Target_DB, Source_Schema, Target_Schema, Source_Tbl, Target_Tbl)
	SELECT DISTINCT '''+@source_db_name+''', '''+@target_db_name+''',
	'''+@source_schema_name+''', '''+@target_schema_name+''', a.name, b.name
	FROM '+@source_db_name+'.sys.tables a
	JOIN '+@target_db_name+'.sys.tables b ON a.name = b.name
	WHERE a.type = ''U'' and b.type = ''U'''

exec (@sql1)

Running a simple select against it, you can see that its content simply reflects the collection of objects with their schema and database details which we identified as the candidates for data merging.

Tablediff_select_all_params_details

For the purpose of this exercise, I have also created a directory on my C:\ drive called DiffOutput which will house the batch file used by the package, SQL files created during its execution and another directory called Archived_Files where files generated will be moved into and stored for further reference and in order to keep DiffOutput less cluttered. The directory structure and content should resemble the image below.

Tablediff_dir_structure

Lastly, before I start exploring the SSIS package components in the NEXT post to this series, I would like to touch on the batch file code and its functionality. As you can see in the above image the DiffOutput directory contains a small batch file which is called by the package’s Execute Process Task. The file executes tablediff.exe utility which providing you have SQL Server installed should also be created in SQL Server directory sub-tree. A number of parameters are also passed down from the package when the file is called as, at the minimum, tablediff requires source and target database, source and target schema and source and target table to be specified. As the script is configured to generate a SQL file containing modification code to be applied in order to synchronize source and target, -f option, followed by the full path and file name of the script file is also used to output the file into the nominated directory. Also, notice that all information is a single line entry only, if you choose to break it down into multiple lines tablediff will complain and throw an error. The remaining lines (lines 2-6) simply define the current date and time which are used to rename the file(s) generated, appending date and time string thus allowing for more effective archiving strategy. The batch file content is as per below.

"C:\Program Files\Microsoft SQL Server\110\COM\tablediff.exe" -sourceserver "SHERLOCK\MSSQL12_Tabular" -sourcedatabase [%1]  -sourceschema [%2]  -sourcetable [%3]   -destinationserver SHERLOCK\MSSQL12_TABULAR  -destinationdatabase [%4]  -destinationschema [%5]  -destinationtable [%6] -f c:\DiffOutput\difference.sql
set d=%date:~-4,4%%date:~-7,2%%date:~0,2%
set d=%d: =_%
set t=%time:~0,2%%time:~3,2%%time:~6,2%
set t=%t: =0%
RENAME "c:\diffoutput\difference.sql" "diff_%d%_%t%.sql"

In the NEXT post I will outline the SSIS package development steps and present the complete solution ‘at work’ in a short video footage.

Tags: ,