{"id":1956,"date":"2013-11-02T11:31:49","date_gmt":"2013-11-02T11:31:49","guid":{"rendered":"http:\/\/bicortex.com\/?p=1956"},"modified":"2018-07-23T08:21:40","modified_gmt":"2018-07-23T08:21:40","slug":"amazon-redshift-review-data-warehouse-in-the-cloud-is-here-part-3","status":"publish","type":"post","link":"http:\/\/bicortex.com\/bicortex\/amazon-redshift-review-data-warehouse-in-the-cloud-is-here-part-3\/","title":{"rendered":"Amazon Redshift Review &#8211; Data Warehouse in the Cloud Is Here, Part 3"},"content":{"rendered":"<p style=\"text-align: justify;\">In the <a href=\"http:\/\/bicortex.com\/amazon-redshift-review-data-warehouse-in-the-cloud-is-here-part-1\/\" target=\"_blank\"><b>FIRST<\/b><\/a> installment to this series I briefly outlined the environment setup as well as Redshift data loading process. This was followed by my <a href=\"http:\/\/bicortex.com\/amazon-redshift-review-data-warehouse-in-the-cloud-is-here-part-2\/\" target=\"_blank\"><b>SECOND<\/b><\/a> post where I analyzed performance findings when comparing query execution times on 4 different environments. In this post I would like to explore Redshift database connectivity options for data visualization applications &#8211; Tableau and Microsoft Excel &#8211; as well as provide a brief summary of my experience with Redshift up until this point.<\/p>\n<h3 style=\"text-align: center;\">Microsoft Excel<\/h3>\n<p style=\"text-align: justify;\">Let\u2019s start with the venerable Excel. There is probably no tool on the market which can beat Excel for its popularity, functionality and number of users who have had some experience with it. It is the love child of many corporate departments and a staple, go-to application when impromptu analysis and visualisation is required. Furthermore, with the latest release of Excel 2013 it has an added benefit of creating fancy dashboards or geospatial mappings using plugins such as PowerMap and PowerView with relative ease.<\/p>\n<p style=\"text-align: justify;\">Connecting to Redshift via Excel is quite straightforward. As I do not want to repeat the step-by-step details on how to link your spreadsheet to Redshift data source \u2013 someone else has already done it for me &#8211; I will just post a link to the webpage where this information can be found. Under this <a href=\"https:\/\/github.com\/snowplow\/snowplow\/wiki\/Setting-up-Excel-to-analyze-Snowplow-data\" target=\"_blank\"><b>LINK<\/b><\/a> you will find comprehensive tutorial on how to achieve this. Alternatively, in case the link is no longer valid, head over to my SkyDrive folder <a href=\"https:\/\/skydrive.live.com\/redir?resid=715AEF07A82832E1!56633&amp;authkey=!AJUkTUZp-7rajqQ\" target=\"_blank\"><b>HERE<\/b><\/a> where among other files I saved a screen dump as an image file.<\/p>\n<p style=\"text-align: justify;\">In order to visualize data stored in Redshift, I used the following two SQL queries returning aggregated results running on a multi-node deployment (2 nodes) on High Storage Extra Large (XL) DW instance.<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nSELECT\r\nSUM(s.qtysold) as Quantity_Sold\r\n,SUM(s.commission) as Commision\r\n,caldate as Event_Date\r\n,e.eventname as Event_Name\r\n,v.venuecity as Venue_City\r\n,v.venuestate as Venue_State\r\nFROM Sales s\r\n  join Event e on s.eventid = s.eventid\r\n  join Venue v on e.venueid = v.venueid\r\n  join Date d on d.dateid = s.dateid\r\n  WHERE e.eventname IN ('Eddie Vedder', 'Stone Temple Pilots', 'Rush')\r\n  GROUP BY e.eventname, caldate,\r\n  v.venuecity, v.venuestate\r\n<\/pre>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nSELECT\r\nSUM(s.pricepaid) as Price_Paid,\r\nd.day as Sales_Date,\r\ne.eventname as Event_Name,\r\nu1.city as City,\r\nu1.state as State\r\nFROM\r\nsales s join users u1 on u1.userid = s.sellerid\r\njoin date d on d.dateid = s.dateid\r\njoin event e on s.eventid = e.eventid\r\nWHERE d.holiday &lt;&gt; 1 and e.eventname &lt;&gt; 'Shrek the Musical' and u1.state &lt;&gt; 'IL' and EXISTS\r\n\t\t(SELECT 1\r\n\t\tFROM users u2 WHERE u2.userid = u1.userid\r\n\t\tand COALESCE(u2.likesports, u2.liketheatre, u2.likeconcerts, u2.likejazz, u2.likeclassical,\r\n\t\tu2.likeopera, u2.likerock, u2.likevegas, u2.likebroadway, u2.likemusicals) = 1)\r\nGROUP BY\r\nd.day,d.caldate,e.eventname, u1.state, u1.city\r\nORDER BY d.caldate ASC\r\n<\/pre>\n<p style=\"text-align: justify;\">On connection imported data from two queries above into my worksheets to observe cluster\u2019s performance during execution. I wasn\u2019t looking for anything specific and no particular variables were taken into consideration. The aim was simply to determine how flexible and issue-free using Excel as visual output can be as the Redshift front end. During data import, Excel becomes pretty much unresponsive, which depending on the query execution time can be frustrating. The only indication that data is being sourced or waited on was this little message in the bottom pane of the spreadsheet.<\/p>\n<p><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Excel_Data_Load_Status_ART.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-1958\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Excel_Data_Load_Status_ART.png\" alt=\"Excel_Data_Load_Status_ART\" width=\"580\" height=\"52\" srcset=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2013\/11\/Excel_Data_Load_Status_ART.png 580w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2013\/11\/Excel_Data_Load_Status_ART-300x26.png 300w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Also, in terms of network activity, I recorded fairly minimal traffic, with most of the time spent on waiting for the data to process before the final output.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Excel_Data_Load_Network_Stats_ART.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1968\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Excel_Data_Load_Network_Stats_ART.png\" alt=\"Excel_Data_Load_Network_Stats_ART\" width=\"580\" height=\"100\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">In the case of the first query , SQL code execution time, regardless of the application connecting to Redshift (Excel, SQL Manager, Tableau etc.), was quite long with just over 2 minutes to crunch my sales and commission data from the extended TICKIT database. Also, as you can see from the images below, CPU spiked to 100 percent for a short period of time during query 1 processing.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Excel_Data_Load_Q1_ART.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1960\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Excel_Data_Load_Q1_ART.png\" alt=\"Excel_Data_Load_Q1_ART\" width=\"580\" height=\"260\" srcset=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2013\/11\/Excel_Data_Load_Q1_ART.png 925w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2013\/11\/Excel_Data_Load_Q1_ART-300x133.png 300w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Excel_Data_Load_Q1_Perf_ART.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1961\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Excel_Data_Load_Q1_Perf_ART.png\" alt=\"Excel_Data_Load_Q1_Perf_ART\" width=\"580\" height=\"185\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Things were much snappier with query 2 which took hardly any time to run and populate my worksheet. Also, CPU utilisation was kept below 10 percent for both nodes involved.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Excel_Data_Load_Q2_ART.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1962\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Excel_Data_Load_Q2_ART.png\" alt=\"Excel_Data_Load_Q2_ART\" width=\"580\" height=\"260\" \/><\/a><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Excel_Data_Load_Q2_Perf_ART.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1963\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Excel_Data_Load_Q2_Perf_ART.png\" alt=\"Excel_Data_Load_Q2_Perf_ART\" width=\"580\" height=\"180\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Overall ease of use and flexibility Excel comes with is very good and it literally took me 5 minutes from the moment I run the first query to putting a small Sales dashboard report together (click on image to enlarge).<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Excel_2008_sales_report_ART.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1964\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Excel_2008_sales_report_ART.png\" alt=\"Excel_2008_sales_report_ART\" width=\"580\" height=\"335\" \/><\/a><\/p>\n<h3 style=\"text-align: center;\">Tableau<\/h3>\n<p style=\"text-align: justify;\">Tableau is quickly becoming the go-to application for rapid data analysis and I can see more and more businesses using it as a primary tool when it comes to visualisation and data exploration. It is also known for its vast connectivity options and Amazon Redshift is also included by default. Connecting to Redshift cluster is quite straightforward \u2013 given that Redshift integration is there out-of-the-box, all that\u2019s required is filling in the blanks with authentication credentials and choosing between importing all database objects or running a selective query as per below.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Conn_Details_ART.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1970\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Conn_Details_ART.png\" alt=\"Tableau_Conn_Details_ART\" width=\"580\" height=\"535\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Next, Tableau imports database and objects metadata and provides 3 options for how the data interface between Redshift and the application should be managed \u2013 connect live, import all data or import some data.<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Data_Conn_Choices_ART.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-1971\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Data_Conn_Choices_ART.png\" alt=\"Tableau_Data_Conn_Choices_ART\" width=\"580\" height=\"328\" srcset=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2013\/11\/Tableau_Data_Conn_Choices_ART.png 580w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2013\/11\/Tableau_Data_Conn_Choices_ART-300x169.png 300w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Unfortunately, working with the &#8216;active connection&#8217; turned out to be quite problematic and only highlights why remotely hosted data warehouse can potentially be a big issue, at least when using Tableau in \u2018always connected\u2019 mode. Network latency is not a problem here as only a small subset of data is being returned to the client. Also, I could live with the fact that query execution time can, in some cases, be longer than anticipated thus adding to the increasing frustration. However, I would not be able to put up with a setup which provides the flexibility of real-time data interrogation for the price of great inconvenience and deficiency in how the current architecture is resolved. What I am referring to here is the way Tableau works with data, aggregating, sorting, and manipulating it when conducting the analysis. It appeared to me that, for query 1, for example, which as I have already mentioned took around 2 minutes to process, not only did I have to wait 2 minutes just for the metadata to be returned to the client but each time the application was required to reference the source dataset, additional 2 minutes were wasted for further execution due to lack of adequate optimisation. To explain further, let\u2019s look at the image below outlining creating the initial data analysis sheet.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Commission_Query_Exec.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1972\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Commission_Query_Exec.png\" alt=\"Tableau_Commission_Query_Exec\" width=\"580\" height=\"345\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Nearly the same query was executed when attempting to drag \u2018commission\u2019 measure onto the summary field as we saw in the case of metadata fetching, which means that every time a new aggregation is created in Tableau, additional 2 minutes were added (read: wasted) for the query to hit Redshift\u2019s processing engine, crunch the data and return it over the network. This means that utilizing live connection to the data source was virtually prohibitive and could severely impact the \u2018what was intended as the quick analysis project\u2019 to the point of time-unaffordable lag. Looking at the query which was passed to Redshift from Tableau when trying to summarize \u2018commission\u2019 figures I could tell that there\u2019s literally zero optimization done (server catching, local memory staging etc.) to prevent from this inefficient, repetitive \u2018SELECT OUT OF SELECT\u2019 execution (see image below).<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Commission_SQL.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1973\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Commission_SQL.png\" alt=\"Tableau_Commission_SQL\" width=\"580\" height=\"240\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">I can hear you saying that I could always mitigate all those problems by importing all or some data into my local environment. I concur. Given that the datasets are quite small I tried to import all the records and everything worked really well, however, given that Redshift\u2019s primary market is high volume data deployments, I would not be able to stage millions or billions of records on my machine, I wouldn\u2019t even want to. This conundrum proves that even though cloud data warehousing is making strides towards wider audience and higher adoption, there are still some big issues that can render it useless for some specyfic applications.<\/p>\n<p style=\"text-align: justify;\">Working in off-line mode, with all data imported and staged on my local machine the experience was much more rewarding and Tableau was a pleasure to work with.<\/p>\n<p style=\"text-align: justify;\">Below are the queries&#8217; CPU performance metrics as well as SQL statements passed for execution. Notice how Tableau formats SQL code with prefixing it with &#8216;SELECT TOP 1 * FROM&#8230;&#8217; statement.<\/p>\n<h3 style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Data_Load_Q1_ART.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1974\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Data_Load_Q1_ART.png\" alt=\"Tableau_Data_Load_Q1_ART\" width=\"580\" height=\"265\" \/><\/a><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Data_Load_Q1_Perf_ART.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1975\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Data_Load_Q1_Perf_ART.png\" alt=\"Tableau_Data_Load_Q1_Perf_ART\" width=\"580\" height=\"180\" \/><\/a><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Data_Load_Q2_ART.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1976\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Data_Load_Q2_ART.png\" alt=\"Tableau_Data_Load_Q2_ART\" width=\"580\" height=\"250\" \/><\/a><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Data_Load_Q2_Perf_ART.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1977\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/11\/Tableau_Data_Load_Q2_Perf_ART.png\" alt=\"Tableau_Data_Load_Q2_Perf_ART\" width=\"580\" height=\"185\" \/><\/a><\/h3>\n<h3 style=\"text-align: center;\">Final Conclusion<\/h3>\n<p style=\"text-align: justify;\">So what is the overall verdict on Amazon Redshift? In my books, even though the product is still rough around the edges and a bit more polish and finesse needs to be applied here and there, Redshift is a real deal. It performs very well, is easy to provision and manage, it has good connectivity options and the price is competitive. I would never though that I could deploy such a powerful architecture on an impromptu basis, perform my analysis and scrap it \u2013 all within less than one hour with the price paid cheaper than a cup of coffee. On one hand, this is a welcomed development as it has never been so easy and inexpensive to process huge volumes of data with minimal input. No longer does one need to pay for exorbitant licensing costs, employ legions of DBAs and system administrators or have a very good handle on the esoteric subjects or database programming and administration. Such deployment takes away most of the cost, human input and associated complexities thus empowering end users and making such technologies accessible to a wider audience. At the same time, it\u2019s not the elusive real life unicorn of the data warehousing world and it won\u2019t solve all your BI problems. With all its benefits and advantages revolving primarily around speed and cost, it does not support many common database features, your data is stored in the cloud (encrypted or not) which can pose security issues for some, more tightly regulated workshops and finally, its management flexibility is still on the \u201cI wish I could do such and such\u2026\u201d side. Throw in data latency issues with many businesses still operating on barely adequate internet speed and somewhat inferior analytical applications support when working with \u2018live\u2019 data and the whole concepts can lose its shine pretty quickly. Moreover, I can anticipate that some IT professionals will be reluctant to recommend such solution to their managers in fear of falling into obsolescence. Quoting my unenthusiastic friend, \u2018the fond years of database administrators being the center of all business activities are slowly getting reduced to documentation management, mostly taking away the mental challenge and hard-core problem solving opportunities\u2019. I would imagine that many professionals who can slowly feel the \u2018cloud pinch\u2019 will boycott (at least initially) the business transition to such technology, foreseeing their demise if no skills adjustments are made on their part. I may be overdramatising here but regardless of what your attitude is, cloud data warehousing is here to stay and Amazon Redshift is a perfect example of how fast, cheap, scalable and manageable such technology could be. From my point of view, given you have a problem which definition can fit into its capabilities, Redshift is a force to be reckoned for other vendors and a powerful and cost effective solution for those who manage and use it.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the FIRST installment to this series I briefly outlined the environment setup as well as Redshift data loading process. This was followed by my SECOND post where I analyzed performance findings when comparing query execution times on 4 different environments. In this post I would like to explore Redshift database connectivity options for data [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[61],"tags":[40,14,62,34,73,19,29],"class_list":["post-1956","post","type-post","status-publish","format-standard","hentry","category-cloud-computing","tag-amazon-redshift","tag-big-data","tag-cloud-computing","tag-excel","tag-mpp-rdbms","tag-sql-server","tag-tableau"],"aioseo_notices":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/1956","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/comments?post=1956"}],"version-history":[{"count":19,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/1956\/revisions"}],"predecessor-version":[{"id":2646,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/1956\/revisions\/2646"}],"wp:attachment":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/media?parent=1956"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/categories?post=1956"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/tags?post=1956"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}