Amazon Redshift Review – Data Warehouse in the Cloud Is Here, Part 3

November 2nd, 2013 / No Comments » / by admin

In the FIRST installment to this series I briefly outlined the environment setup as well as Redshift data loading process. This was followed by my SECOND post where I analyzed performance findings when comparing query execution times on 4 different environments. In this post I would like to explore Redshift database connectivity options for data visualization applications – Tableau and Microsoft Excel – as well as provide a brief summary of my experience with Redshift up until this point.

Microsoft Excel

Let’s start with the venerable Excel. There is probably no tool on the market which can beat Excel for its popularity, functionality and number of users who have had some experience with it. It is the love child of many corporate departments and a staple, go-to application when impromptu analysis and visualisation is required. Furthermore, with the latest release of Excel 2013 it has an added benefit of creating fancy dashboards or geospatial mappings using plugins such as PowerMap and PowerView with relative ease.

Connecting to Redshift via Excel is quite straightforward. As I do not want to repeat the step-by-step details on how to link your spreadsheet to Redshift data source – someone else has already done it for me – I will just post a link to the webpage where this information can be found. Under this LINK you will find comprehensive tutorial on how to achieve this. Alternatively, in case the link is no longer valid, head over to my SkyDrive folder HERE where among other files I saved a screen dump as an image file.

In order to visualize data stored in Redshift, I used the following two SQL queries returning aggregated results running on a multi-node deployment (2 nodes) on High Storage Extra Large (XL) DW instance.

SELECT
SUM(s.qtysold) as Quantity_Sold
,SUM(s.commission) as Commision
,caldate as Event_Date
,e.eventname as Event_Name
,v.venuecity as Venue_City
,v.venuestate as Venue_State
FROM Sales s
  join Event e on s.eventid = s.eventid
  join Venue v on e.venueid = v.venueid
  join Date d on d.dateid = s.dateid
  WHERE e.eventname IN ('Eddie Vedder', 'Stone Temple Pilots', 'Rush')
  GROUP BY e.eventname, caldate,
  v.venuecity, v.venuestate
SELECT
SUM(s.pricepaid) as Price_Paid,
d.day as Sales_Date,
e.eventname as Event_Name,
u1.city as City,
u1.state as State
FROM
sales s join users u1 on u1.userid = s.sellerid
join date d on d.dateid = s.dateid
join event e on s.eventid = e.eventid
WHERE d.holiday <> 1 and e.eventname <> 'Shrek the Musical' and u1.state <> 'IL' and EXISTS
		(SELECT 1
		FROM users u2 WHERE u2.userid = u1.userid
		and COALESCE(u2.likesports, u2.liketheatre, u2.likeconcerts, u2.likejazz, u2.likeclassical,
		u2.likeopera, u2.likerock, u2.likevegas, u2.likebroadway, u2.likemusicals) = 1)
GROUP BY
d.day,d.caldate,e.eventname, u1.state, u1.city
ORDER BY d.caldate ASC

On connection imported data from two queries above into my worksheets to observe cluster’s performance during execution. I wasn’t looking for anything specific and no particular variables were taken into consideration. The aim was simply to determine how flexible and issue-free using Excel as visual output can be as the Redshift front end. During data import, Excel becomes pretty much unresponsive, which depending on the query execution time can be frustrating. The only indication that data is being sourced or waited on was this little message in the bottom pane of the spreadsheet.

Excel_Data_Load_Status_ART

Also, in terms of network activity, I recorded fairly minimal traffic, with most of the time spent on waiting for the data to process before the final output.

Excel_Data_Load_Network_Stats_ART

In the case of the first query , SQL code execution time, regardless of the application connecting to Redshift (Excel, SQL Manager, Tableau etc.), was quite long with just over 2 minutes to crunch my sales and commission data from the extended TICKIT database. Also, as you can see from the images below, CPU spiked to 100 percent for a short period of time during query 1 processing.

Excel_Data_Load_Q1_ARTExcel_Data_Load_Q1_Perf_ART

Things were much snappier with query 2 which took hardly any time to run and populate my worksheet. Also, CPU utilisation was kept below 10 percent for both nodes involved.

Excel_Data_Load_Q2_ARTExcel_Data_Load_Q2_Perf_ART

Overall ease of use and flexibility Excel comes with is very good and it literally took me 5 minutes from the moment I run the first query to putting a small Sales dashboard report together (click on image to enlarge).

Excel_2008_sales_report_ART

Tableau

Tableau is quickly becoming the go-to application for rapid data analysis and I can see more and more businesses using it as a primary tool when it comes to visualisation and data exploration. It is also known for its vast connectivity options and Amazon Redshift is also included by default. Connecting to Redshift cluster is quite straightforward – given that Redshift integration is there out-of-the-box, all that’s required is filling in the blanks with authentication credentials and choosing between importing all database objects or running a selective query as per below.

Tableau_Conn_Details_ART

Next, Tableau imports database and objects metadata and provides 3 options for how the data interface between Redshift and the application should be managed – connect live, import all data or import some data.

Tableau_Data_Conn_Choices_ART

Unfortunately, working with the ‘active connection’ turned out to be quite problematic and only highlights why remotely hosted data warehouse can potentially be a big issue, at least when using Tableau in ‘always connected’ mode. Network latency is not a problem here as only a small subset of data is being returned to the client. Also, I could live with the fact that query execution time can, in some cases, be longer than anticipated thus adding to the increasing frustration. However, I would not be able to put up with a setup which provides the flexibility of real-time data interrogation for the price of great inconvenience and deficiency in how the current architecture is resolved. What I am referring to here is the way Tableau works with data, aggregating, sorting, and manipulating it when conducting the analysis. It appeared to me that, for query 1, for example, which as I have already mentioned took around 2 minutes to process, not only did I have to wait 2 minutes just for the metadata to be returned to the client but each time the application was required to reference the source dataset, additional 2 minutes were wasted for further execution due to lack of adequate optimisation. To explain further, let’s look at the image below outlining creating the initial data analysis sheet.

Tableau_Commission_Query_Exec

Nearly the same query was executed when attempting to drag ‘commission’ measure onto the summary field as we saw in the case of metadata fetching, which means that every time a new aggregation is created in Tableau, additional 2 minutes were added (read: wasted) for the query to hit Redshift’s processing engine, crunch the data and return it over the network. This means that utilizing live connection to the data source was virtually prohibitive and could severely impact the ‘what was intended as the quick analysis project’ to the point of time-unaffordable lag. Looking at the query which was passed to Redshift from Tableau when trying to summarize ‘commission’ figures I could tell that there’s literally zero optimization done (server catching, local memory staging etc.) to prevent from this inefficient, repetitive ‘SELECT OUT OF SELECT’ execution (see image below).

Tableau_Commission_SQL

I can hear you saying that I could always mitigate all those problems by importing all or some data into my local environment. I concur. Given that the datasets are quite small I tried to import all the records and everything worked really well, however, given that Redshift’s primary market is high volume data deployments, I would not be able to stage millions or billions of records on my machine, I wouldn’t even want to. This conundrum proves that even though cloud data warehousing is making strides towards wider audience and higher adoption, there are still some big issues that can render it useless for some specyfic applications.

Working in off-line mode, with all data imported and staged on my local machine the experience was much more rewarding and Tableau was a pleasure to work with.

Below are the queries’ CPU performance metrics as well as SQL statements passed for execution. Notice how Tableau formats SQL code with prefixing it with ‘SELECT TOP 1 * FROM…’ statement.

Tableau_Data_Load_Q1_ARTTableau_Data_Load_Q1_Perf_ARTTableau_Data_Load_Q2_ARTTableau_Data_Load_Q2_Perf_ART

Final Conclusion

So what is the overall verdict on Amazon Redshift? In my books, even though the product is still rough around the edges and a bit more polish and finesse needs to be applied here and there, Redshift is a real deal. It performs very well, is easy to provision and manage, it has good connectivity options and the price is competitive. I would never though that I could deploy such a powerful architecture on an impromptu basis, perform my analysis and scrap it – all within less than one hour with the price paid cheaper than a cup of coffee. On one hand, this is a welcomed development as it has never been so easy and inexpensive to process huge volumes of data with minimal input. No longer does one need to pay for exorbitant licensing costs, employ legions of DBAs and system administrators or have a very good handle on the esoteric subjects or database programming and administration. Such deployment takes away most of the cost, human input and associated complexities thus empowering end users and making such technologies accessible to a wider audience. At the same time, it’s not the elusive real life unicorn of the data warehousing world and it won’t solve all your BI problems. With all its benefits and advantages revolving primarily around speed and cost, it does not support many common database features, your data is stored in the cloud (encrypted or not) which can pose security issues for some, more tightly regulated workshops and finally, its management flexibility is still on the “I wish I could do such and such…” side. Throw in data latency issues with many businesses still operating on barely adequate internet speed and somewhat inferior analytical applications support when working with ‘live’ data and the whole concepts can lose its shine pretty quickly. Moreover, I can anticipate that some IT professionals will be reluctant to recommend such solution to their managers in fear of falling into obsolescence. Quoting my unenthusiastic friend, ‘the fond years of database administrators being the center of all business activities are slowly getting reduced to documentation management, mostly taking away the mental challenge and hard-core problem solving opportunities’. I would imagine that many professionals who can slowly feel the ‘cloud pinch’ will boycott (at least initially) the business transition to such technology, foreseeing their demise if no skills adjustments are made on their part. I may be overdramatising here but regardless of what your attitude is, cloud data warehousing is here to stay and Amazon Redshift is a perfect example of how fast, cheap, scalable and manageable such technology could be. From my point of view, given you have a problem which definition can fit into its capabilities, Redshift is a force to be reckoned for other vendors and a powerful and cost effective solution for those who manage and use it.

Tags: , , , , , ,

Amazon Redshift Review – Data Warehouse in the Cloud Is Here, Part 2

November 1st, 2013 / No Comments » / by admin

Background And Environment Specs Recap

In the FIRST part of this series I briefly explored the background of Amazon’s Redshift data warehouse offering as well as outlined the environments and tools used for managing test datasets. I also recreated TICKIT database schema on all instances provisioned for testing and preloaded them with dummy data. In this post I would like to explore the performance findings from running a bunch of sample SQL queries to find out how they behave under load. Final post outlining how to connect and use data stored in Redshift can be found HERE.

As mentioned in my FIRST post, all datasets which populated TICKIT sample database were quite small so I needed to synthetically expand one of the tables in order to increase data volumes to make the whole exercise viable. Using nothing but SQL I ‘stretched out’ my Sales fact table to around 50,000,000 records which is hardly enough for any database to break a sweat, but given the free storage limit on Amazon S3 is only 5 GB (you can also use DynamoDB as a file storage) and that my home internet bandwidth is still stuck on ADSL2 I had to get a little creative with the SQL queries rather than try to push for higher data volumes. Testing methodology was quite simple here – using mocked-up SQL SELECT statements, for each query executed I recorded execution time on all environments and plucked them into a graph for comparison. None of those statements represent any business function so I did not focus on any specific output or result; the idea was to simply make them run and wait until completion, taking note of how long it took from beginning to end. Each time the query was executed, the instance got restarted to ensure cold cache and full system resources available to database engine. I also tried to make my SQL as generic as possible. By default, SQL is not a very verbose language, with a not a lot of dichotomies between different vendors. Most of the core syntax principles apply to both PG/SQL and Microsoft T-SQL alike. However, there are some small distinctions which can get ‘lost in translation’ therefore I tried to keep the queries are as generic and universal as possible.

As far as physical infrastructure composition, the image below contains a quick recap of the testing environment used for this evaluation. Initially, I planned to test my queries on 3 different Redshift deployments – a single and a multi-node Redshift XL instance as well as a single Redshift 8XL instance, however, after casually running a few queries and seeing the performance level Redshift is capable of I decided to scale back and run only with Redshift XL class in a single and multi-node configurations.

Test_Env_Specs_ARTGraphing the results, for simplicity sake, I assigned the following monikers to each environment:

  • ‘Local Instance’ – my local SQL Server 2012 deployment
  • ‘RDS SQL Server’ – High-Memory Quadruple Extra Large DB Instance of SQL Server 2012
  • ‘1 x Redshift XL’ – Single High Storage Extra Large (XL) Redshift DW node (1 x dw.hs1.xlarge)
  • ‘4 x Redshift XL’ – Four High Storage Extra Large (XL) Redshift DW nodes (4 x dw.hs1.xlarge)

Test Results

All SQL queries run to collect execution times (expressed in seconds) as well as graphical representation of their duration are as per below.

--QUERY 1
SELECT COUNT(*) FROM
	(SELECT COUNT(*) AS counts, salesid, listid, sellerid, buyerid,
	eventid, dateid, qtysold, pricepaid, commission, saletime
	FROM sales
	GROUP BY salesid, listid, sellerid, buyerid, eventid,
	dateid, qtysold, pricepaid, commission, saletime) sub

Query1_Results_ART

--QUERY 2
WITH cte AS (
	SELECT
		SUM(s.pricepaid) AS pricepaid,
		sub.*,
		u1.email,
		DENSE_RANK() OVER (PARTITION BY sub.totalprice ORDER BY d.caldate DESC) AS daterank
	FROM sales s  JOIN
		(SELECT l.totalprice, l.listid, v.venuecity, c.catdesc
		FROM listing l
		LEFT JOIN event e on l.eventid = e.eventid
		INNER JOIN category c on e.catid = c.catid
		INNER JOIN venue v on e.venueid = v.venueid
		GROUP BY v.venuecity, c.catdesc, l.listid, l.totalprice) sub
	ON s.listid = sub.listid
	LEFT JOIN users u1 on s.buyerid = u1.userid
	LEFT JOIN Date d on d.dateid = s.dateid
	GROUP BY sub.totalprice, sub.listid, sub.venuecity, sub.catdesc, u1.email, d.caldate)
SELECT COUNT(*) FROM cte

Query2_Results_ART

--QUERY 3
SELECT COUNT(*) FROM(
SELECT s.*, u.* FROM sales s
		LEFT JOIN event e on s.eventid = e.eventid
		LEFT JOIN date d on d.dateid = s.dateid
		LEFT JOIN users u on u.userid = s.buyerid
		LEFT JOIN listing l on l.listid = s.listid
		LEFT JOIN category c on c.catid = e.catid
		LEFT JOIN venue v on v.venueid = e.venueid
	INTERSECT
SELECT s.*, u.* FROM sales s
		LEFT JOIN event e on s.eventid = e.eventid
		LEFT JOIN date d on d.dateid = s.dateid
		LEFT JOIN users u on u.userid = s.buyerid
		LEFT JOIN listing l on l.listid = s.listid
		LEFT JOIN category c on c.catid = e.catid
		LEFT JOIN venue v on v.venueid = e.venueid
		WHERE d.holiday =0) sub

Query3_Results_ART

--QUERY 4
WITH Temp (eventname, starttime, eventid) AS
	(SELECT ev.eventname, ev.starttime, ev.eventid
	FROM Event ev
	WHERE	NOT EXISTS	
			(SELECT 1 FROM Sales s
			WHERE s.pricepaid  > 500 AND s.pricepaid < 900 AND s.eventid = ev.eventid) 	
			AND  EXISTS	(SELECT 1 FROM Sales s 
			WHERE s.qtysold  >= 4 AND s.eventid = ev.eventid))
SELECT COUNT(t.eventid) AS EvCount,
SUM(l.totalprice)  - SUM((CAST(l.numtickets AS Decimal(8,2)) * l.priceperticket * 0.9)) as TaxPaid
FROM Temp t JOIN Listing l ON l.eventid = t.eventid
WHERE t.starttime <= l.listtime

Query4_Results_ART

--QUERY 5
SELECT COUNT(*) FROM
	(SELECT
		SUM(sub1.pricepaid) AS price_paid,
		sub1.caldate AS date,
		sub1.username
	FROM
		(select  s.pricepaid, d.caldate, u.username
		FROM Sales s
		LEFT JOIN date d on s.dateid=s.dateid
		LEFT JOIN users u on s.buyerid = u.userid
		WHERE s.saletime BETWEEN '2008-12-01' AND '2008-12-31'
		) sub1
	GROUP BY sub1.caldate, sub1.username
	) as sub2

Query5_Results_ART

--QUERY 6
WITH CTE1 (c1) AS
		(SELECT COUNT(1) as c1 FROM Sales s
		WHERE saletime BETWEEN '2008-12-01' AND '2008-12-31' 
		GROUP BY salesid 
		HAVING COUNT(*)=1),
	CTE2 (c2) AS
		(SELECT COUNT(1) as c2 FROM Listing e
		WHERE listtime BETWEEN '2008-12-01' AND '2008-12-31' 
		GROUP BY listid	
		HAVING COUNT(*)=1),
	CTE3 (c3) AS
		(SELECT COUNT(1) as c3 FROM Date d	
		GROUP BY dateid 
		HAVING COUNT(*)=1)
SELECT COUNT(*) FROM Cte1
					RIGHT JOIN CTE2 ON CTE1.c1 <> CTE2.c2
					RIGHT JOIN CTE3 ON CTE3.c3 = CTE2.c2 and CTE3.c3 <> CTE1.c1
					GROUP BY CTE1.c1
					HAVING COUNT(*)>1

Query6_Results_ART

--QUERY 7
SELECT COUNT(*) FROM
(SELECT COUNT(*) as c FROM sales s
WHERE s.saletime BETWEEN '2008-12-01' AND '2008-12-31'
GROUP BY salesid
HAVING COUNT(*)=1) a
LEFT JOIN
(SELECT COUNT(*) as c FROM date d
WHERE d.week = 1
GROUP BY dateid
HAVING COUNT(*)=1) b
ON a.c <> b.c
LEFT JOIN
(SELECT COUNT(*) as c FROM listing l
WHERE l.listtime BETWEEN '2008-12-01' AND '2008-12-31'
GROUP BY listid
HAVING COUNT(*)=1) c
ON a.c <> c.c AND b.c <> c.c
LEFT JOIN
(SELECT COUNT(*) as c FROM venue v
GROUP BY venueid
HAVING COUNT(*)=1) d
ON a.c <> d.c AND b.c <> d.c AND c.c <> d.c

Query7_Results_ART

Finally, before diving into conclusion section, a short mention of a few unexpected hiccups along the way. On a single node deployment 2 queries failed for two different reasons. First one returned an error message stating that ‘This type of correlated subquery pattern is not supported yet’ (see image below).

Query1_FAILED_ART

Because of this, I decided to remove this query from the overall results tally leaving me with 7 SQL statements for testing. Query 3 on the other hand failed with ‘Out of memory’ message on the client side and another cryptic error showing up in management console as per images below (click on image to enlarge).

Query2_FAILED_ART

Query2_1_FAILED_ART

I think that inadequate memory allocation issue was due to the lack of resources provisioned as it only appeared when running on a single node – the multi-node deployment executed without issues. Nevertheless, it was disappointing to see Redshift not coping very well with memory allocation and management, especially given that same query run fine (albeit slow) on both of the SQL Server databases.

For reference, below is a resource consumption graph (3 parameters included) for the multi-node deployment of Amazon Redshift during queries execution.

Perf_Monitor_ALLSQL_ART

Performance Verdict

I don’t think I have to interpret the results in much detail here – they are pretty much self-explanatory. Redshift, with the exception of few quibbles where SQL query was either unsupported or too much to handle for its default memory management, completely outrun the opposition. As I already stated in PART 1 of this series, relational database setup is probably not the best option for reading vast volumes of data with heavy aggregations thrown into the mix (that’s what SSAS is for when using Microsoft as a vendor) but at the same time I never expected such blazing performance (per dollar) from the Amazon offering. At this point I wish I had more data to throw at it as 50,000,000 records in my largest table is not really MPP database playground but nevertheless, Redshift screamed through what a gutsy SQL Server RDS deployment costing more completely chocked on. Quite impressive, although having heard other people’s comments and opinions, some even going as far as saying that Redshift is a great alternative to Hadoop (providing data being structured and homogeneous), I was not surprised.

In the next part to this series – PART 3 – I will be looking at connecting to Redshift database using some popular reporting/data visualization tools e.g. Tableau too see if this technology makes it a viable backend solution not just for data storage but also for querying through third-party applications.

Tags: , , , ,