bicortex

Hadoop Musing – How to Install Hortonworks Hadoop Distribution (Sandbox and HDP) Locally On Microsoft Windows

April 6th, 2014 / No Comments » / by admin

Introduction

The venerable Hadoop obsession has taken its toll on me so I thought that in the spirit of the current data management trends it would be a good idea to start looking into the whole Hadoop ecosystem. Every executive I have recently spoken to about their data strategy has been asking about it (whether they actually understand it or not is a different story) and I start to feel that if I don’t succumb to this craze I will miss out on this worldwide phenomenon that’s been evolving behind my back for a few years now – the ‘Big Data’ movement. I went to a few seminars and meet-ups where the folks from Claudera, Hortonworks and others drum on about how nearly every single major RDBMS vendor will either rot away clinging to its archaic ways of storing structured data and disappear into the abyss of relational dismay unless they join this new data liberation movement and start integrating NoSQL, Big Data, Hadoop-like paradigms into their application pool to stay somewhat relevant in this new world order. Regardless of whether this PR web they’re spinning has any merit or not or what the future actually holds, there is nothing better than me getting my hands dirty and trying this thing for myself. So here I am, looking at my first Hadoop deployment and writing this post to help other folks out there make up their mind on the Hadoop vs. OLTP/OLAP technology for data analysis and management.

Given that it’s my first post on this topic (plenty more in the pipeline so make sure you stay tuned) I thought it would be a good idea to start with something simple and outline the installation process for Hortonworks Hadoop distribution on a Windows platform. Hortonworks also provides a ready-made VM image which can be deployed with just a few click but given that the actual installation process is slightly more involved, I though it’s a good idea to also step through it providing more details.

Installing Hortonworks Sandbox v. 2.0 (on premise)

Let’s start with the sandbox which can be downloaded from the Hortonworks website and deployed using either a free VirtualBox application (Hortonworks recommended option) or other virtualization software e.g. VMWare, Hyper-V (just make sure that virtualization is enabled in the BIOS, minimum RAM can be allocated if you plan to run Ambari and Hbase (8GB) and a compatible browser is installed on your box e.g. Chrome 25+, IE 9+ etc.). After downloading the VM’s image, the ‘installation’ process is very straightforward i.e. providing you have a suitable application to mount the image installed, it’s just a matter of double-clicking on the downloaded file which starts the mounting process, pointing your browser at the I.P. address as per the installation output, filling out the registration form and deepdiving into the sandbox. Below are the sample screenshots depicting the installation session at its completion and the browser interface for the sandbox environment when first initiated.

Installing HDP 2.0 for Windows (on premise)

Installing HDP is a little bit more involved and given the fact that the final product does not include native/out-of-the-box Apache Ambari or Hue components, it’s a little underwhelming to know that you cannot use GUI to manage HDP or query its data without additional configuration required that can be quite complex for someone uninitiated.

The following is a collection of step-by-step instructions on how to provision a single node installation of HDP on Windows Server 2012 R2. However, before we can commence the installation process, several prerequisites need to be fulfilled without which the process will fail so make sure all of those are met prior to cluster deployment.

Install Microsoft Visual C++ 2010 Redistributable package (64-bit)
Install Microsoft .NET Framework 4.0 (unless installing on Windows Server 2012 or higher in which case it should already be a part of the base installation)
Install Java Development Kit (JDK)
Install Python 2.7 (or higher if superseded by the newer version)

Installing the first two should not be a problem and usually no additional configuration is required. However, when installing Java JDK and Python, further tweaks are required so I think it’s a good idea to outline those in more details.

Java JDK Installation and Configuration

After Java JDK has been downloaded (the supported version can be fetched from HERE), install it to a directory which has no white space characters in its path. This is very important as the default option/path will not work! You can, for example, nominate the following as the new installation directory – C:\Java (as per image below) – which has no spaces in its directory path.

Next, create a system variable called JAVA_HOME and assign the full path to the installation directory defined earlier for JDK as its value. This is done by opening up Control Panel’s System pane and under Advanced system settings and Advanced tab clicking on Environment Variables button. Next, under System variables, click New and enter the variable name as JAVA_HOME. Finally, enter the variable value as the installation path for JDK e.g. C:\Java as per image below and confirm your input/selection by clicking on OK/Apply Changes when finished.

Python 2.7 Installation and Configuration

After you downloaded the Windows specific python distribution (the binaries can be fetched from HERE), install it using the default configuration. Next, update the path settings in a similar way we did when configuring Java JDK i.e. go to Control Panel -> System pane -> Advanced settings -> Advanced system tab -> Environment Variables button and under System Variables find PATH and click Edit. In the Edit windows, modify path by appending the installation path for your Python directory to the value of PATH. For example, if the Python executable is installed in C:\Python27, you must append this value to PATH as per image below.

You can validate Python installation by opening up PowerShell terminal and typing in python. If the installation and configuration were successful you should get a response similar to the image below.

Server Configuration

Now it’s time to configure the server before we can begin with HDP installation. The below steps outline Windows Server configuration and HDP provisioning.

Firstly, obtain the HOSTNAME for the server you’re installing to. This can be achieved by typing hostname in the command line or Windows PowerShell environment and capturing the output as per image below.

Next, disable the firewall and open up all ports. The simplest way around it is to go to the Control Panel, type in Firewall into the search box and click on Check Firewall Status. Next, click on Turn Windows Firewall On or Off to turn it off for all available networks. Alternatively, individual ports will require to be opened. The below table outlines the default ports used by various services so unless you are in a position to turn the firewall off completely, these will need to be opened before HDP installation.

Next, download HDP from Hortonworks website (currently distributed as a zipped up file) and go to the folder where the downloaded file was saved. Extract its content to a different directory. For the purpose of this demo I decided to simply extract it into the Downloads directory, where the downloaded package was stored under another directory called hdp-2.0.6-GA_extract as per image below.

At this point, under the hdp-2.0.6-GA_extract directory, you should see a number of files, one of them called clusterproperties.txt. Open it up and replace server names for all the nodes under #Hosts and for the DB_HOSTNAME under #Database Host with the host name of your machine (obtained as per point 1 instructions). Also, ensure that you change the Log and Data directories to point to the appropriate drive on your server (most likely C drive). The modified clusterproperties.txt file’s content should resemble the image below.

Next, save the file and from an elevated command prompt with administrative privileges, run the following command ensuring that directory path reflects your environment/folders setup.

msiexec /i "C:\Users\dingopole\downloads\hdp-2.0.6-GA_extract\hdp-2.0.6.0.winpkg.msi" /lv "hdp.log" HDP_LAYOUT="C:\Users\dingopole\Downloads\hdp-2.0.6-GA_extract\clusterproperties.txt" HDP_DIR="C:\hdp\hadoop" DESTROY_DATA="yes"

Alternatively, open up the PowerShell prompt in Administrator mode and execute the extracted MSI through this command, ensuring that you have navigated to the directory where the MSI file is stored.

msiexec /i "hdp-2.0.6.0.winpkg.msi"

The below image depicts the command input in PowerShell as per my VM’s directory structure and the HDP MSI file’s saved location.

This should provide you with the following output where many parameters should appear pre-populated with default values. Complete the form with the parameters which reflect your environment ensuring that DERBY and not MSSQL is selected from the drop-down for ‘DB Flavour’ selection and that ‘Delete Existing HDP data’ is ticked as per image below.

A successful installation should conclude with the following message displayed on its conclusion.

On the desktop of your server you should now see three new icons. Click the Hadoop command prompt and navigate back one directory by executing the following.

cd..

Next, type in the following to start services associated with HDP installation.

Start_local_HDP_services

Wait for a few minutes for this job to complete and validate the services start-up by issuing the following from the same command prompt.

Run-SmokeTests

This command runs a variety of tests against the new installation to ensure that MapReduce, Hive, Pig, ZooKeeper etc have been provisioned correctly and are running as expected.

Conclusion

As you can see, unless you’re happy to use a VM image provided by Hortonworks, installing Hadoop can be a quite involving process, even though this is only a single node installation. There have been whole books written on Hadoop environment provisioning so this post is only meant to serve as an introduction to a much complex issue of the Hadoop environment deployment. At this stage, regardless of which distribution you choose to go with, the ecosystem is still in a state of flux with new features rolled out nearly every month. The status quo around the companies supporting those distributions also fluctuates e.g. Intel abandoning their proprietary Hadoop release in favor of investing in Cloudera instead so even though it hurts me to say it, I wouldn’t necessarily take these instructions for granted. At the same time, any forthcoming releases should become more coherent, integrated and easier to deploy and manage so the ease of breaking into the Hadoop ‘black box’ should also become more straightforward.

As this is my first post on Hadoop I have merely managed to scratch the surface of this technology. In the next few posts I will dive deeper into the HDP/Hadoop paradigm, exploring its features and functionality in more details so stay tuned!

Posted in: Hadoop

Tags: Big Data, Hadoop

SQL Server Data Synchronization Automation Using Tablediff Utility and SSIS – Part 2

February 19th, 2014 / No Comments » / by admin

In the FIRST post this short series I briefly outlined tablediff utility and set the stage (databases, objects, directories and batch file) for the SSIS package responsible for data synchronization workflow. In this post I will continue with the solution development and show you how to build and configure the package to take advantage of the tablediff tool and merge data across the 2 databases and their objects deployed as per PART 1 prelude. The complete package can be downloaded from my SkyDrive folder HERE if you wish to use it as a template – just make sure you adjust all necessary parameters e.g. database connection details etc. to correspond to your environment. I have also attached a short video footage depicting the package execution and the outcome on the tables’ data synchronization and SQL files management at the end of this post.

The complete package looks as per image below, with the core functionality encapsulated in the first two sections. The remaining Script Task and the second ForEach Loop Container are there to provide additional features i.e. the Script Task checking whether the file(s) have been created in the nominated directories and the second ForEach Loop Container to execute scripts generated by the tablediff utility and archive the files processed into a separate directory.

Let’s start with creating a bunch of variables which will house the data temporarily used during package runtime or provide value-dependant execution topology. Most of the variables are pretty self-explanatory with names reflecting their function. Common_Var variable of the System.Object type will house the results of the query against the Params_details table (see previous post HERE) whereas the Boolean File_Exists variable is populated by the Script Task and its value determines execution pathway depending on whether the file(s) have been created or not. Finally, Temp_FileName is used by the second ForEach Loop Container in iterating through SQL files created.

Looking at the Connection Manager, there is one database connection pointing to the Source_DB database and 2 file connections – Archived_Files which points to the archive directory (C:\Diffoutput\Archived_Files) and Temp_FileName one which is using an expression for connection string as per image below.

Let’s go through the Control Flow pane now and outline the individual containers roles. Starting with Execute SQL Task 1 transformation, it simply executes the SELECT SQL statement, returning the data as a full results set, which in turn is mapped to the Common_Var variable as per below.

Next, the first ForEach Loop Container is configured to use the ADO object source variable Common_Var and the following variables mappings.

Note that the column order used in the SELECT SQL statement from the first transformation is corresponding to the variables assignment in the first ForEach Loop Container, otherwise the package will fail.

Execute Process Task simply runs the batch file saved in the nominated directory, passing the variables as arguments in a form of an expression as per below. Also, note that the order in which those appear in the expression corresponds to the SELECT SQL statement attributes order and variable mapping in the ForEach Loop Container order.

Notice that the FailTaskIfReturnCodeIsNotSuccessValue property is set to False. This is due to the fact that if no changes are detected, the tablediff will return SucessValue of 1 which effectively fails package execution. As we don’t want to treat this as an error or exception FailTaskIfReturnCodeIsNotSuccessValue property needs to be changed from its default setting to False value.

Following on, we have a Script Component with a very simply VB.NET code snippet checking for SQL files existence. I know in the first post I said that no code will be required but this section is not compulsory, uses default SSIS functionality and is only there as an enhancement so you can remove it if you wish. The File_Exists variable is also listed in the ReadOnlyVariables property.

#Region "Imports"
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.IO
#End Region

<Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute()> _
<System.CLSCompliantAttribute(False)> _
Partial Public Class ScriptMain
    Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase

    Public Sub Main()
        Dim directory As DirectoryInfo = New DirectoryInfo("c:\DiffOutput")
        Dim file As FileInfo() = directory.GetFiles("*.sql")

        If file.Length > 0 Then
            Dts.Variables("User::File_Exists").Value = True
        Else
            Dts.Variables("User::File_Exists").Value = False
        End If

        Dts.TaskResult = ScriptResults.Success
        Dts.TaskResult = ScriptResults.Success
    End Sub

#Region "ScriptResults declaration"
    Enum ScriptResults
        Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success
        Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
    End Enum
#End Region
End Class

This code also populates File_Exists variable which is being used in the Precedence Constraints joining the subsequent transformations. It contains the conditional expression which when evaluated, returns a Boolean value altering the package execution flow i.e. pointing it in either of the 2 directions – Execute SQL Task 1 or ForEach Loop Container 2. Below is the Precedence Constraint editor view for executing ForEach Loop Container 2 section.

In the last section, ForEach Loop Container 2 iterates through SQL files created using the following enumerator configuration and variable mapping.

Inside the ForEach Loop Container 2, Execute SQL Task 3 transformation runs the SQL statements saved in the files, with SQL Source Type property configured as a File Connection and Temp_FileName connection selected as a value for File Connection property. File System Task simply archives the processed files into the archive sub-directory as the last step of this package’s execution.

The complete package can be downloaded from my SkyDrive folder HERE so you can use it as a template – just make sure you adjust all necessary parameters e.g. database connection details etc. to correspond to your environment. I have also attached a short video footage depicting the package execution and the outcome on the tables’ data synchronization and SQL files management.

So there you go, a fairly simple and easy to manage data synchronization solution using tablediff utility and Integration Services tool. If you take away the VB.NET code snippet, the whole solution can be build using the default tablediff functionality, pure SSIS and one very simple SELECT SQL statement so if your project can live with the tablediff functionality constraints and you don’t require a great deal of performance, this tool can become a simple and straightforward option for your data synchronization project.

Posted in: SQL, SSIS

Tags: SQL, SSIS

What you are looking at...

My name is Martin and this site is a random collection of recipes and reflections about various topics covering information management, data engineering, machine learning, business intelligence and visualisation plus everything else that I fancy to categorise under the 'analytics' umbrella. I'm a native of Poland but since my university days I have lived in Melbourne, Australia and worked as a DBA, developer, data architect, technical lead and team manager. My main interests lie in both, helping clients in technical aspects of information management e.g. data modelling, systems architecture, cloud deployments as well as business-oriented strategies e.g. enterprise data solutions project management, data governance and stewardship, data security and privacy or data monetisation. On the whole, I am very fond of anything closely or remotely related to data and as long as it can be represented as a string of ones and zeros and then analysed and visualised, you've got my attention!

Outside sporadic updates to this site I typically find myself fiddling with data, spending time with my kids or a good book, the gym or watching a good movie while eating Polish sausage with Zubrowka (best served on rocks with apple juice and a lime twist). Please read on and if you find these posts of any interests, don't hesitate to leave me a comment!

Subscribe via RSS | Comments (RSS)

Advertising

Recent Comments
1. terry w on Using Polybase and In-Database Python Runtime for Building SQL Server to Azure Data Lake Data Extraction PipelinesMay 12, 2025
  Never thought Polybase could be used in this capacity and don't think Microsoft advertised this feature (off-loading DB data as…
2. admin on SQL Server to Snowflake Solution Architecture and Implementation – How to Extract, Ingest and Model Wide World Importers Database using Snowflake PlatformMay 5, 2025
  Hi Karissa It's definitely doable and I don't foresee too many changes from what what was outlined in the post.…
3. Karissa Lambert on SQL Server to Snowflake Solution Architecture and Implementation – How to Extract, Ingest and Model Wide World Importers Database using Snowflake PlatformMay 5, 2025
  Thanks for the details write-up Martin. I have one question though, can something similar be developed on top of Azure…
4. Leonard Chu on AWS S3 data ingestion and augmentation patterns using DuckDB and PythonApril 30, 2025
  Very informative, love these little nuggets of knowledge from the DuckDB community and anyone involved, thanks Martin!
5. admin on AWS S3 data ingestion and augmentation patterns using DuckDB and PythonJanuary 23, 2025
  Not sure about Polars but I did take ClickHouse (chDB) and DuckDB for a quick test drive - you can…