{"id":1868,"date":"2013-09-27T01:10:31","date_gmt":"2013-09-27T01:10:31","guid":{"rendered":"http:\/\/bicortex.com\/?p=1868"},"modified":"2013-09-27T02:42:42","modified_gmt":"2013-09-27T02:42:42","slug":"decision-trees-solving-customer-classification-marketing-problem-with-excel-and-microsoft-sql-server-data-mining-tools","status":"publish","type":"post","link":"http:\/\/bicortex.com\/bicortex\/decision-trees-solving-customer-classification-marketing-problem-with-excel-and-microsoft-sql-server-data-mining-tools\/","title":{"rendered":"Decision Trees &#8211; Solving Customer Classification Marketing Problem With Excel And Microsoft SQL Server Data Mining Tools"},"content":{"rendered":"<p style=\"text-align: justify;\">A few months ago I started playing around with a bunch of statistical applications and built a very simple Twitter sentiment analysis engine for feeds classification using RapidMiner \u2013 you can see the full post <strong><a href=\"https:\/\/skydrive.live.com\/redir?resid=715AEF07A82832E1!55887&amp;authkey=!AJ80I9RifRP0haw\" target=\"_blank\">HERE<\/a><\/strong>. Since then I have been quite busy and haven\u2019t had a chance to play with data mining applications any further but given my recent time off work and a few extra hours on my hands I was eager to replicate a proof-of-concept solution I built a long time ago for a marketing department in response to their issue with customer classification for catalogue mail-out. For that project I put together a rudimentary SSIS, SSAS and SSRS solution which, in its core functionality utilised Microsoft data mining features. In this post I will show you how to build a similar solution (highly simplified though \u2013 no reporting or complex ETL\/data cleansing features) using Excel. I will also use the same dataset to compare my results to those generated by SAP Predictive Analysis \u2013 another application which I recently attended a demo of. SAP Predictive Analysis is a \u2018new kid on the block\u2019 of data mining software crated to allow analysts and business users, not quants with PhDs, to build simple models to gather insight from their data.<\/p>\n<p style=\"text-align: justify;\">Excel, being the lovechild of any commercial department can be utilised as a powerful tool for many different domains of data mining, especially in conjunction with Analysis Services engine running as part of Microsoft SQL Server deployment. Yes, it is true that Excel contains many mathematical and statistical functions, however, on its own, Excel does not provide the functionality for robust data cleansing, model testing, scripting languages such as DMX, deployment options etc. which can be achieved in conjunction with SQL Server engine. Also, in order to take advantage of the highly simplified data mining techniques mostly driven by step-by-step wizard dialogs, Excel requires a connection to Analysis Services so if you\u2019re keen to replicate this exercise in your own environment, make sure you can connect to SSAS database and have Excel data mining add-in installed.<\/p>\n<p style=\"text-align: justify;\">\u00a0Let\u2019s go ahead and start with a problem definition to help us determine the possible course of action and two sample datasets \u2013 one for model training and another one which will be used for scoring.<\/p>\n<h3 align=\"center\">Problem Definition<\/h3>\n<p style=\"text-align: justify;\">Let\u2019s envisage that a telecom company has struck a deal with a telephone manufacturer and is about to release a new model of a highly anticipated smart phone on a competitive plan. They know that the phone is going to be a hit among its customers but given that their marketing department is very cutting-edge and would like to maximise the odds of addressing advertising campaign budget towards the right group of clients, they want to use the customer data to pinpoint those individuals who are happy not to upgrade their existing phones. Chances are that there will almost certainly be a large group of die-hard fans who are even happy to camp outside the store just to be the first ones to get their hands on the latest gadget when it eventually gets released \u2013 these individuals do not need the power of persuasion or marketing dollars spent on. It\u2019s the ones who are on the fence, undecided or doubtful that may need just a little nudge to sway them over the line. This group will unknowingly become the focus of the campaign but in order to determine their profile, we first must determine the method we can employ to single out customers who fall into this category.<\/p>\n<p style=\"text-align: justify;\">The decision tree is probably the most popular data mining technique because of fast training performance, a high degree of accuracy, and easily understandable patterns. The most common data mining task for a decision tree is classification \u2014 that is, determining whether or not a set of data belongs to a specific type, or class. For example, loan applicants can be classified as high risk or low risk, and decision trees help determine the rules to perform that classification based on historical data. The principal idea of a decision tree is to split your data recursively into subsets. Each input attribute is evaluated to determine how cleanly it divides the data across the classes (or states) of your target variable (predictable attribute). The process of evaluating all inputs is then repeated on each subset. When this recursive process is completed, a decision tree is formed. For this exercise I will use Microsoft decision trees algorithm as it fits the intended classification purpose very well. There is lots of information on the internet pertaining to decision tree algorithm applications so rather than theorising, let\u2019s look at some concrete examples of data that will be used for this post.<\/p>\n<p style=\"text-align: justify;\">Let\u2019s assume that marketing department has provided us with a training dataset (downloadable <a href=\"https:\/\/skydrive.live.com\/redir?resid=715AEF07A82832E1!55887&amp;authkey=!AJ80I9RifRP0haw\" target=\"_blank\"><b>HERE<\/b><\/a>) of customers (identified with unique IDs for privacy reasons) in conjunction with a series of additional attributes such as Gender, Age, Marital Status, Contract Length, whether the customer has kids, whether he or she is married etc. Amongst all those there is also one attribute &#8211; Adopter Class &#8211; which categorises clients based on whether they are likely to adopt the new product relatively quickly after it has been released or whether they are happy to wait. This attribute has four distinctive values \u2013 Late, Very Late, Early, Very Early \u2013 which indicate customer adoption promptness. Obviously, if a customer has been tagged with Very Early in the past based on their previous behaviour patterns, he or she is highly likely to procure a new phone the moment it becomes available. On the other side of the spectrum are those customers marked as Very Late, meaning they are not as savvy to peruse the latest and greatest and are happy to wait. All those attributes in the training dataset will be used to train our model to learn about the most deterministic factors that influence clients\u2019 behaviours.<\/p>\n<p style=\"text-align: justify;\">We were also given a second dataset (downloadable <a href=\"https:\/\/skydrive.live.com\/redir?resid=715AEF07A82832E1!55887&amp;authkey=!AJ80I9RifRP0haw\" target=\"_blank\"><b>HERE<\/b><\/a>) which contains a different set of customers with same descriptive attributes attached to their IDs but this time without Adopter Class populated. The goal is to mine the already pre-classified customers from our first dataset and based on the findings apply the \u2018knowledge\u2019 to determine the category the new set of customers will most likely fall into.<\/p>\n<h3 align=\"center\">Datasets and SQL Code<\/h3>\n<p style=\"text-align: justify;\">As far as the individual datasets, you can download them <a href=\"https:\/\/skydrive.live.com\/redir?resid=715AEF07A82832E1!55887&amp;authkey=!AJ80I9RifRP0haw\" target=\"_blank\"><b>HERE<\/b><\/a> or alternatively, you can re-create them from scratch using the SQL code I wrote to populate our training and scoring dataset files as per below. Please note that if you wish to run the code yourself, ensure that you have SampleData directory created on your C:\\ drive or alternatively change the code to output the files into your nominated directory structure. Also, in relation to datasets content, note that this data was synthesised using SQL and is not a true representation of the customer data in a sense that the values are not as random or representative of a group of individuals as they would be if the data was coming from a data warehouse or CRM sources.<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nUSE &#x5B;master]\r\nGO\r\nIF EXISTS (SELECT name FROM sys.databases WHERE name = N'Temp_DB')\r\nBEGIN\r\n-- Close connections to the DW_Sample database\r\nALTER DATABASE &#x5B;Temp_DB] SET SINGLE_USER WITH ROLLBACK IMMEDIATE\r\nDROP DATABASE &#x5B;Temp_DB]\r\nEND\r\nGO\r\nCREATE DATABASE &#x5B;Temp_DB] \r\nGO\r\nUSE  &#x5B;Temp_DB] \r\n\r\nDECLARE @Gender Table\r\n(Gender char(1))\r\nDECLARE @Age Table\r\n(Age int)\r\nDECLARE @Marital_Status Table\r\n(Marital_Status varchar (20))\r\nDECLARE @Yes_No_Flag Table\r\n(Flag char (1))\r\nDECLARE @Current_Plan Table\r\n(CPlan varchar (20))\r\nDECLARE @Payment_Method Table\r\n(Payment_Method varchar (50))\r\nDECLARE @Contract_Length Table\r\n(Contract_Length varchar (30))\r\nDECLARE @Adopter_Class Table\r\n(Adopter_Class varchar (20))\r\nDECLARE @Age_Range Table\r\n(Age int)\r\n\r\nDECLARE @Years int = 18\r\n\r\nWHILE @Years &lt; 65\r\nBEGIN\r\n\tINSERT INTO @Age_Range\r\n\tSELECT @Years\r\n\tSET @Years = @Years +1\r\nEND\r\n\r\nINSERT INTO @Gender \r\nVALUES ('M'), ('F')\r\nINSERT INTO @Age \r\nSELECT * FROM @Age_Range\r\nINSERT INTO @Marital_Status\r\nVALUES ('Married'), ('Single')\r\nINSERT INTO @Yes_No_Flag \r\nVALUES('Y'), ('N')\r\nINSERT INTO @Current_Plan\r\nVALUES ('Low'), ('Medium'), ('Heavy'), ('PrePaid')\r\nINSERT INTO @Payment_Method\r\nVALUES ('Automatic'), ('Non-Automatic')\r\nINSERT INTO @Contract_Length\r\nVALUES ('12 Months'), ('24 months'), ('36 Months'), ('No Contract')\r\nINSERT INTO @Adopter_Class\r\nVALUES ('Very Early'), ('Early'), ('Very Late'), ('Late')\r\n\r\nCREATE TABLE Temp_Results\r\n(UserID\t\t\t\t\tint\t\t\t\tNOT NULL,\r\nGender\t\t\t\t\tchar(1)\t\t\tNOT NULL,\r\nAge\t\t\t\t\t\tint\t\t\t\tNOT NULL,\r\nMarital_Status\t\t\tvarchar (20)\tNOT NULL,\r\nCurrent_Plan\t\t\tvarchar (20)\tNOT NULL,\r\nPayment_Method\t\t\tvarchar (50)\tNOT NULL, \r\nContract_Length\t\t\tvarchar (30)\tNOT NULL,\r\nHas_Kids\t\t\t\tchar (1)\t\tNOT NULL,\r\nOther_Services_Bundled\tchar(1)\t\t\tNOT NULL,\r\nAdopter_Class\t\t\tvarchar(20)\t\tNOT NULL)\r\n\r\nSET NOCOUNT ON\r\nDECLARE @Random INT;\r\nDECLARE @Upper INT;\r\nDECLARE @Lower INT\r\nSET @Lower = 10000\r\nSET @Upper = 50000\r\nDECLARE @Records_Count int = 0\r\nWHILE @Records_Count &lt; 5000\r\nBEGIN\r\n\tINSERT INTO Temp_Results\r\n\tSELECT  ROUND(((@Upper - @Lower -1) * RAND() + @Lower), 0), \r\n\t(SELECT TOP (1) Gender\t\t\tFROM @Gender\t\t\tORDER BY NewID()),\r\n\t(SELECT TOP (1) Age\t\t\t\tFROM @Age\t\t\t\tORDER BY NewID()),\r\n\t(SELECT TOP (1) Marital_Status\tFROM @Marital_Status\tORDER BY NewID()),\r\n\t(SELECT TOP (1) CPlan\t\t\tFROM @Current_Plan\t\tORDER BY NewID()),\r\n\t(SELECT TOP (1) Payment_Method\tFROM @Payment_Method\tORDER BY NewID()),\r\n\t(SELECT TOP (1) Contract_Length FROM @Contract_Length\tORDER BY NewID()),\r\n\t(SELECT TOP (1) Flag\t\t\tFROM @Yes_No_Flag\t\tORDER BY NewID()),\r\n\t(SELECT TOP (1) Flag\t\t\tFROM @Yes_No_Flag\t\tORDER BY NewID()),\r\n\t(SELECT TOP (1) Adopter_Class\tFROM @Adopter_Class\t\tORDER BY NewID())\r\n\tSET @Records_Count = @Records_Count + 1\r\nEND\r\n\r\nUPDATE Temp_Results\r\nSET Payment_Method\t=\tCASE\tWHEN Current_Plan = 'PrePaid'\tTHEN 'Non-Automatic' ELSE Payment_Method END\r\nUPDATE Temp_Results\r\nSET Contract_Length =\tCASE\tWHEN Current_Plan = 'PrePaid'\tTHEN 'No Contract' ELSE Contract_Length END\r\nUPDATE Temp_Results\r\nSET Adopter_Class\t=\tCASE\tWHEN\tAge BETWEEN 18 and 35 AND\r\n\t\t\t\t\t\t\t\t\t\tOther_Services_Bundled = 'Y' AND\r\n\t\t\t\t\t\t\t\t\t\tCurrent_Plan IN ('Medium', 'Heavy', 'Low') AND\r\n\t\t\t\t\t\t\t\t\t\tContract_Length IN ('12 Months', '24 Months')\r\n\t\t\t\t\t\t\t\t\t\tTHEN 'Very Early'\r\n\t\t\t\t\t\t\t\tWHEN\tAge BETWEEN 36 and 45 AND\r\n\t\t\t\t\t\t\t\t\t\tOther_Services_Bundled = 'Y' AND\r\n\t\t\t\t\t\t\t\t\t\tCurrent_Plan IN ('Medium', 'Heavy', 'Low') AND\r\n\t\t\t\t\t\t\t\t\t\tContract_Length IN ('12 Months', '24 Months', '36 Months')\r\n\t\t\t\t\t\t\t\t\t\tTHEN 'Early'\r\n\t\t\t\t\t\t\t\tWHEN\tAge BETWEEN 46 and 55 AND\r\n\t\t\t\t\t\t\t\t\t\tOther_Services_Bundled = 'N' AND\r\n\t\t\t\t\t\t\t\t\t\tCurrent_Plan IN ('Medium','Low') AND\r\n\t\t\t\t\t\t\t\t\t\tContract_Length IN ('12 Months', '24 Months', '36 Months')\r\n\t\t\t\t\t\t\t\t\t\tTHEN 'Late'\r\n\t\t\t\t\t\t\t\tWHEN\tAge &gt; 55 AND\r\n\t\t\t\t\t\t\t\t\t\tOther_Services_Bundled ='N' AND\r\n\t\t\t\t\t\t\t\t\t\tCurrent_PLan IN ('PrePaid', 'Low') AND\r\n\t\t\t\t\t\t\t\t\t\tContract_Length IN ('No Contract', '36 Months', '24 Months')\r\n\t\t\t\t\t\t\t\t\t\tTHEN 'Very Late' ELSE Adopter_Class END\r\n\r\nDECLARE @ID varchar (200)\r\nDECLARE @COUNT int\r\nDECLARE CUR_DELETE CURSOR FOR\r\n\tSELECT UserID,COUNT(&#x5B;UserID]) FROM &#x5B;Temp_Results] \r\n\tGROUP BY &#x5B;UserID] HAVING COUNT(&#x5B;UserID]) &gt; 1\r\n\tOPEN CUR_DELETE\r\n\t\tFETCH NEXT FROM CUR_DELETE INTO @ID, @COUNT\r\n\t\tWHILE @@FETCH_STATUS = 0\r\n\t\tBEGIN\r\n\t\tDELETE TOP(@COUNT -1) FROM &#x5B;Temp_Results] WHERE &#x5B;UserID] = @ID\r\n\t\tFETCH NEXT FROM CUR_DELETE INTO @ID, @COUNT\r\n\tEND\r\nCLOSE CUR_DELETE\r\nDEALLOCATE CUR_DELETE\r\n\r\nDECLARE\r\n@saveloc\t\tVARCHAR(2048),\r\n@query\t\t\tVARCHAR(2048),\r\n@bcpquery\t\tVARCHAR(2048),\r\n@bcpconn\t\tVARCHAR(64),\r\n@bcpdelim\t\tVARCHAR(2)\r\n\r\n\/*\r\nTo enable CMD_SHELL IF DISABLED\r\nEXEC sp_configure 'show advanced options', 1\r\nGO\r\nRECONFIGURE\r\nGO\r\nEXEC sp_configure 'xp_cmdshell', 1\r\nGO\r\nRECONFIGURE\r\nGO\r\n*\/\r\n\r\nSET @query =\t'USE Temp_DB \r\n\t\t\t\tSELECT \r\n\t\t\t\t''UserID''\t\t\t\t\tas H1,\r\n\t\t\t\t''Gender''\t\t\t\t\tas H2,\r\n\t\t\t\t''Age''\t\t\t\t\t\tas H3,\r\n\t\t\t\t''Marital_Status''\t\t\tas H4,\r\n\t\t\t\t''Current_Plan''\t\t\tas H5,\r\n\t\t\t\t''Payment_Method''\t\t\tas H6,\r\n\t\t\t\t''Contract_Length''\t\t\tas H7,\r\n\t\t\t\t''Has_Kids''\t\t\t\tas H8,\r\n\t\t\t\t''Other_Services_Bundled''\tas H9,\r\n\t\t\t\t''Adopter_Class''\t\t\tas H10\r\n\t\t\t\tUNION ALL\r\n\t\t\t\tSELECT \r\n\t\t\t\tCAST(UserID as Varchar(10)),\r\n\t\t\t\tGender,\r\n\t\t\t\tCAST(Age as Varchar (10)),\r\n\t\t\t\tMarital_Status,\r\n\t\t\t\tCurrent_Plan,\r\n\t\t\t\tPayment_Method,\r\n\t\t\t\tContract_Length,\r\n\t\t\t\tHas_Kids,\r\n\t\t\t\tOther_Services_Bundled,\r\n\t\t\t\tAdopter_Class\r\n\t\t\t\tFROM Temp_Results'\r\nSET @saveloc = 'c:\\SampleData\\Training_DataSet.csv'\r\nSET @bcpdelim   = ','\r\nSET @bcpconn    = '-T' -- Trusted\r\n--SET @bcpconn    = '-U &lt;username&gt; -P &lt;password&gt;' -- SQL authentication\r\nSET @bcpquery = 'bcp &quot;' + replace(@query, char(10), '') + '&quot; QUERYOUT &quot;' + @saveloc + '&quot; -c -t^' + @bcpdelim + ' ' + @bcpconn + ' -S ' + @@servername\r\nEXEC master..xp_cmdshell @bcpquery  \r\n\r\nTRUNCATE TABLE Temp_Results\r\n\r\nALTER TABLE Temp_Results\r\nDROP COLUMN Adopter_Class\r\n\r\nSET @Lower = 50001\r\nSET @Upper = 99999\r\nSET @Records_Count = 0\r\nWHILE @Records_Count &lt; 5000\r\nBEGIN\r\n\tINSERT INTO Temp_Results\r\n\tSELECT  ROUND(((@Upper - @Lower -1) * RAND() + @Lower), 0),  \r\n\t(SELECT TOP (1) Gender\t\t\tFROM @Gender\t\t\tORDER BY NewID()),\r\n\t(SELECT TOP (1) Age\t\t\t\tFROM @Age\t\t\t\tORDER BY NewID()),\r\n\t(SELECT TOP (1) Marital_Status\tFROM @Marital_Status\tORDER BY NewID()),\r\n\t(SELECT TOP (1) CPlan\t\t\tFROM @Current_Plan\t\tORDER BY NewID()),\r\n\t(SELECT TOP (1) Payment_Method\tFROM @Payment_Method\tORDER BY NewID()),\r\n\t(SELECT TOP (1) Contract_Length FROM @Contract_Length\tORDER BY NewID()),\r\n\t(SELECT TOP (1) Flag\t\t\tFROM @Yes_No_Flag\t\tORDER BY NewID()),\r\n\t(SELECT TOP (1) Flag\t\t\tFROM @Yes_No_Flag\t\tORDER BY NewID())\r\n\tSET @Records_Count = @Records_Count + 1\r\nEND\r\n\r\nSET @ID =''\r\nSET @COUNT = 0\r\nDECLARE CUR_DELETE CURSOR FOR\r\n\tSELECT UserID,COUNT(&#x5B;UserID]) FROM &#x5B;Temp_Results] \r\n\tGROUP BY &#x5B;UserID] HAVING COUNT(&#x5B;UserID]) &gt; 1\r\n\tOPEN CUR_DELETE\r\n\t\tFETCH NEXT FROM CUR_DELETE INTO @ID, @COUNT\r\n\t\tWHILE @@FETCH_STATUS = 0\r\n\t\tBEGIN\r\n\t\tDELETE TOP(@COUNT -1) FROM &#x5B;Temp_Results] WHERE &#x5B;UserID] = @ID\r\n\t\tFETCH NEXT FROM CUR_DELETE INTO @ID, @COUNT\r\n\tEND\r\nCLOSE CUR_DELETE\r\nDEALLOCATE CUR_DELETE\r\n\r\nSET @query =\t'USE Temp_DB \r\n\t\t\t\tSELECT \r\n\t\t\t\t''UserID''\t\t\t\t\tas H1,\r\n\t\t\t\t''Gender''\t\t\t\t\tas H2,\r\n\t\t\t\t''Age''\t\t\t\t\t\tas H3,\r\n\t\t\t\t''Marital_Status''\t\t\tas H4,\r\n\t\t\t\t''Current_Plan''\t\t\tas H5,\r\n\t\t\t\t''Payment_Method''\t\t\tas H6,\r\n\t\t\t\t''Contract_Length''\t\t\tas H7,\r\n\t\t\t\t''Has_Kids''\t\t\t\tas H8,\r\n\t\t\t\t''Other_Services_Bundled''\tas H9\r\n\t\t\t\tUNION ALL\r\n\t\t\t\tSELECT \r\n\t\t\t\tCAST(UserID as Varchar(10)),\r\n\t\t\t\tGender,\r\n\t\t\t\tCAST(Age as Varchar (10)),\r\n\t\t\t\tMarital_Status,\r\n\t\t\t\tCurrent_Plan,\r\n\t\t\t\tPayment_Method,\r\n\t\t\t\tContract_Length,\r\n\t\t\t\tHas_Kids,\r\n\t\t\t\tOther_Services_Bundled\r\n\t\t\t\tFROM Temp_Results'\r\nSET @saveloc = 'c:\\SampleData\\Scoring_DataSet.csv'\r\nSET @bcpdelim   = ','\r\nSET @bcpconn    = '-T' -- Trusted\r\n--SET @bcpconn    = '-U &lt;username&gt; -P &lt;password&gt;' -- SQL authentication\r\nSET @bcpquery = 'bcp &quot;' + replace(@query, char(10), '') + '&quot; QUERYOUT &quot;' + @saveloc + '&quot; -c -t^' + @bcpdelim + ' ' + @bcpconn + ' -S ' + @@servername\r\nEXEC master..xp_cmdshell @bcpquery \r\n\r\nDROP TABLE Temp_Results\r\n<\/pre>\n<h3 align=\"center\">Decision Trees Model Deployment and Application<\/h3>\n<p style=\"text-align: justify;\">Once the two files have been created (or saved if you opted out of running the script and just downloaded them), we can open the one prefixed with \u2018training\u2019 and highlight all records. Having data mining plug-in installed, Classify button should be one of the options available form Data Mining tab on Excel\u2019s ribbon. Clicking it should start the wizard which will take us through the steps where we will adjust model&#8217;s properties and variables before deployment.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/ExcelTrainingData_ExcelDM_DecisionTrees.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-1871\" alt=\"ExcelTrainingData_ExcelDM_DecisionTrees\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/ExcelTrainingData_ExcelDM_DecisionTrees.png\" width=\"580\" height=\"275\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Step one provides a brief description of its features and functionality, let\u2019s click Next and to proceed to data range selection dialog. As per image above you will notice that I have selected over 4600 records generated by the SQL query. Alternatively, you can also select external data source. Let\u2019s click Next to advance to Classification window where we have an option to choose the columns participating in the model creation as well as the column to analyse. I have intentionally removed the UserID column which has no relevance or dependencies on other columns and selected Adopter_Class column as an analysis target. Adopter_Class column contains the data which is the focus of this exercise and marketing wants to know how this attribute gets affected by other columns&#8217; data.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/ColumnsSelection_ExcelDM_DecisionTrees.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-1873\" alt=\"ColumnsSelection_ExcelDM_DecisionTrees\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/ColumnsSelection_ExcelDM_DecisionTrees.png\" width=\"580\" height=\"290\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Clicking Next we will advance to Split Data Into Training And Testing step where we can specify the percentage of our dataset to get allocated to training and testing the model. We will leave this value at default and advance to the last step where we have an option to change some of the final properties such as Model Name or Model Description. Again, we will leave everything at default and finish the wizard deploying the model as per image below.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/ModelCreation_ExcelDM_DecisionTrees.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-1876\" alt=\"ModelCreation_ExcelDM_DecisionTrees\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/ModelCreation_ExcelDM_DecisionTrees.png\" width=\"580\" height=\"350\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">After data reading and training has completed and our decision trees model has been deployed we can see the breakdown of individual attributes participating in how their values and affinities or dependencies relate to Adopter Class values. In the below image, the Background property has been set to Very Late, which is the group marketing campaign wants to potentially address when devising their next campaign. These people are the most resistant to adopting new technology and with the release of the new smart phone, marketing is hoping to gather insight into their profile.\u00a0 Rectangles shaded in darkest blue are representative of the target group i.e. very late adopters (click on image to enlarge).<\/p>\n<p style=\"text-align: left;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/VeryLate_ExcelDM_DecisionTrees.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-1877\" alt=\"VeryLate_ExcelDM_DecisionTrees\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/VeryLate_ExcelDM_DecisionTrees.png\" width=\"580\" height=\"300\" \/><\/a><\/p>\n<p style=\"text-align: left;\"><span style=\"text-align: justify;\">Following this tree from left to right, we can deduce the following.<br \/>\n<\/span><\/p>\n<ul>\n<li style=\"text-align: justify;\"><span style=\"text-align: justify;\">Individuals who are 58 or older, with no other services bundled, on a Pre-paid plan fall into Vary Late adopters category.<\/span><\/li>\n<li style=\"text-align: justify;\">Individuals who are 58 or older, with no other services bundled, on a Low plan and not on a 12 months contract fall into Very Late adopters category.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">If we were to review the profile of customers opposite to very late adopters i.e. individuals how are keen to update sooner and more frequently, the model would look as per below (click on image to enlarge).<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/VeryEarly_ExcelDM_DecisionTrees.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-1878\" alt=\"VeryEarly_ExcelDM_DecisionTrees\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/VeryEarly_ExcelDM_DecisionTrees.png\" width=\"580\" height=\"300\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">This tree diagram indicates that it\u2019s the customers who are likely to be less than 38 years old, with other services bundled and on either 24 or 12 months contracts who are the most likely ones to crave new technology.<\/p>\n<p style=\"text-align: justify;\">That is all well and god but how can apply this knowledge to our other customers? As you remember, our second dataset contains another group of individuals who have not been classified as yet. The ultimate purpose of this type of exercise is to apply such model to another group of clients and predict with a certain level of confidence what category they may fall into, based on their attributes\u2019 values and what our model has \u2018learnt\u2019. Let\u2019s apply this knowledge to our scoring dataset and open up the second file the script has generated \u2013 Scoring_DataSet.csv file. After highlighting all records and columns let\u2019s execute the data mining DMX query clicking on the Query button which starts another wizard. In step number 2 of the wizard (first one is just a process description) we can select the model we have just deployed and proceeding through steps 3 and 4 we have an option to select the data source and specify columns relationships. Let\u2019s leave those with default options selected i.e. data source will be the data range selected previously and columns relationships will be mapped out automatically for us and advance to adding an output column step. Here, we will proceed as per the image below.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/OutputDefinition_ExcelDM_DecisionTrees.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-1879\" alt=\"OutputDefinition_ExcelDM_DecisionTrees\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/OutputDefinition_ExcelDM_DecisionTrees.png\" width=\"580\" height=\"310\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">When completed, click OK and Next to advance to the next step where we have an option to choose a destination for the query result. Let\u2019s go ahead and select Append To The Input Data from the options provided and click Finish to let Excel and DMX query which has built in the background run our scoring data through the model to determine the Adopter Class output for each observation. When finished, we should be presented with an additional column added to our workbook which indicates which category decision tree model we applied to this dataset \u2018thinks\u2019 the customer should belong to.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/PredictedValues_ExcelDM_DecisionTrees.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-1881\" alt=\"PredictedValues_ExcelDM_DecisionTrees\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/PredictedValues_ExcelDM_DecisionTrees.png\" width=\"580\" height=\"320\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">If we filter this dataset on the records with adopter class equal to Very Late, we would effectively shrink our data to a quarter of the original size allowing the business to make conscious and data-backed decision on who to direct the marketing campaign to. Not only does this approach prevent a \u2018stab-in \u2013the-dark\u2019 approach but it can also allow the business to save a lot of money e.g. if campaign was relying on printing special offers catalogues etc. Based on this final list marketing can correlate customer ids with other customer descriptive attributes such as addresses, names, email address or location coordinates to drive the campaign dollars further.<\/p>\n<p style=\"text-align: justify;\">Just as a reference, I also compared the output from the model with another application capable of classifying based on decision tree algorithm \u2013 SAP Predictive Analysis. After installing R to integrate R_CNR Tree algorithm I re-created the model and this is what it appeared as (click on image to enlarge).<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/SAPDTOutput_ExcelDM_DecisionTrees.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-1882\" alt=\"SAPDTOutput_ExcelDM_DecisionTrees\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/SAPDTOutput_ExcelDM_DecisionTrees.png\" width=\"580\" height=\"290\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Arguably, I think that Excel output is much easier to read, analyse and interpret, especially that it provides a filtering option to highlight the groups of interests rather simply relying on color-coding. Also, as a secondary exercise I run the scoring dataset through the newly created model in SAP Predictive Analytics, outputting the newly tagged data into a separate file for the purpose of comparison (you can download all the files from <strong><a href=\"https:\/\/skydrive.live.com\/redir?resid=715AEF07A82832E1!55887&amp;authkey=!AJ80I9RifRP0haw\" target=\"_blank\">HERE<\/a><\/strong>).<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/SAPModel_ExcelDM_DecisionTrees.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-1883\" alt=\"SAPModel_ExcelDM_DecisionTrees\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/09\/SAPModel_ExcelDM_DecisionTrees.png\" width=\"580\" height=\"93\" srcset=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2013\/09\/SAPModel_ExcelDM_DecisionTrees.png 580w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2013\/09\/SAPModel_ExcelDM_DecisionTrees-300x48.png 300w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Quick, analysis revealed that out of close to 5000 records, overlapping results constituted around 55 percent or nearly 2600 observations. That means that for nearly half of all the records analysed the results varied depending on which application was used i.e. Excel with SSAS or SAP&#8217;s Predictive Analysis. Also, for comparison, I run the predicted values through Classification Matrix and Accuracy Chart tools to evaluate the models performance for false positive, true positive, false negative, and true negative. Given that I only run it in Excel, predicted data produced by Excel was 100 percent spot on whereas SAP&#8217;s Predictive Analysis output was a complete hit and miss (you can look at the results in the &#8216;Quick Analysis SAP vs Excel&#8217; spreadsheet <strong><a href=\"https:\/\/skydrive.live.com\/redir?resid=715AEF07A82832E1!55887&amp;authkey=!AJ80I9RifRP0haw\" target=\"_blank\">HERE<\/a><\/strong>). These are pretty dramatic variances and without further investigation I would be hesitant to provide a definite answer as to why these discrepancies are so large but chances are these tools and their corresponding algorithms differ in how they handle the data even though they both bear the same name &#8211; Decision Trees. This only reinforces the premise that data mining tools and the results they produce, as easy and fool-proof as they may seem on the face value, need to be treated with caution and consideration. Sometimes, depending on variables such as the data we use, applied algorithm and associated parameters, mining application etc. the results can vary significantly hence requiring more extensive analysis and interpretation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A few months ago I started playing around with a bunch of statistical applications and built a very simple Twitter sentiment analysis engine for feeds classification using RapidMiner \u2013 you can see the full post HERE. Since then I have been quite busy and haven\u2019t had a chance to play with data mining applications any [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[36,55,32],"tags":[57,34,39,54],"class_list":["post-1868","post","type-post","status-publish","format-standard","hentry","category-data-mining","category-excel","category-how-tos","tag-data-mining","tag-excel","tag-sap","tag-ssas"],"aioseo_notices":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/1868","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/comments?post=1868"}],"version-history":[{"count":23,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/1868\/revisions"}],"predecessor-version":[{"id":1900,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/1868\/revisions\/1900"}],"wp:attachment":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/media?parent=1868"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/categories?post=1868"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/tags?post=1868"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}