{"id":2150,"date":"2014-02-12T01:42:53","date_gmt":"2014-02-12T01:42:53","guid":{"rendered":"http:\/\/bicortex.com\/?p=2150"},"modified":"2015-05-27T10:16:58","modified_gmt":"2015-05-27T10:16:58","slug":"which-one-should-i-use-not-exists-not-in-outer-apply-except-and-left-outer-join-performance-optimization","status":"publish","type":"post","link":"https:\/\/bicortex.com\/bicortex\/which-one-should-i-use-not-exists-not-in-outer-apply-except-and-left-outer-join-performance-optimization\/","title":{"rendered":"Which One Should I Use &#8211; NOT EXISTS, NOT IN, OUTER APPLY, EXCEPT and LEFT OUTER JOIN Performance Optimization"},"content":{"rendered":"<h3 style=\"text-align: center;\">Introduction<\/h3>\n<p style=\"text-align: justify;\">Lately I have been involved in a large enterprise data warehouse deployment project and the in last few weeks we have been at a point where the team\u00a0slowly begun transitioning from dimension tables development to fact tables development. Working with many different teams e.g. contractors, consultants, internal data professionals etc. what caught my attention was the fact that different developers tend to use different techniques to differentiate between the &#8216;new&#8217; vs. &#8216;old&#8217; data e.g. source OLTP system vs. target fact table in the OLAP schema to determine whether a given set of transactions have already been inserted or not. Given that most data flow logic was embedded into SQL stored procedures, with SQL Server Integration Services only controlling execution flow, a typical scenario for source vs. target comparison would involve applying dataset differentiating statement e.g. NOT EXISTS or NOT IN to account for any new transactions. There is a number of different options available here so I thought it would be a good idea to put them to the test and find out how performance is affected when using the most common ones i.e. NOT EXISTS, NOT IN, OUTER APPLY, EXCEPT and LEFT OUTER JOIN, on a well-structured dataset, with and without an index and with columns defined as NULLable vs. Non-NULLable.<\/p>\n<p style=\"text-align: justify;\">The short-winded version is that when presented with a choice of using either one of those 5 statements you should preferably stay away from NOT IN for reasons I described below. This will depend heavily on your schema, data and resources at your disposal but as a rule of thumb, NOT IN should never be the first option to give consideration to when other alternatives are possible.<\/p>\n<h3 style=\"text-align: center;\">TL;DR Version<\/h3>\n<p style=\"text-align: justify;\">Let\u2019s start with two sample datasets \u2013 Tbl1 and Tbl2 \u2013 where Tbl2 differs from its archetype by removing a small number of records (in this case around 50). The two tables and their data were created using the following SQL.<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nUSE &#x5B;master]\r\nGO\r\nIF EXISTS (SELECT name FROM sys.databases WHERE name = N'TestDB')\r\nBEGIN\r\n-- Close connections to the DW_Sample database\r\nALTER DATABASE &#x5B;TestDB] SET SINGLE_USER WITH ROLLBACK IMMEDIATE\r\nDROP DATABASE &#x5B;TestDB]\r\nEND\r\nGO\r\nCREATE DATABASE &#x5B;TestDB]\r\nGO\r\nUSE &#x5B;TestDB]\r\nGO\r\n\r\nCREATE TABLE Tbl1\r\n(ID int identity (1,1),\r\nID_NonIdentity int NOT NULL DEFAULT 0,\r\nobject_id int NOT NULL)\r\nGO\r\n\r\nCREATE TABLE Tbl2\r\n(ID int identity (1,1),\r\nID_NonIdentity int NOT NULL,\r\nobject_id int NOT NULL)\r\nGO\r\n\r\nINSERT INTO Tbl1 (object_id)\r\nSELECT c1.object_id FROM sys.objects c1\r\nCROSS JOIN (SELECT Top 100 name FROM sys.objects) c2\r\nCROSS JOIN (SELECT Top 100 type_desc FROM sys.objects) c3\r\nGO 25\r\n\r\nUPDATE Tbl1 SET ID_NonIdentity = ID\r\n\r\nINSERT INTO Tbl2 (ID_NonIdentity, object_id)\r\nSELECT ID_NonIdentity, object_id FROM Tbl1\r\n\r\nSET NOCOUNT ON\r\nDECLARE @start int = 0\r\nDECLARE @finish int = (SELECT MAX(id) FROM Tbl2)\r\nWHILE @start &lt;= @finish\r\n\tBEGIN\r\n\tDELETE FROM Tbl2 WHERE id = @start\r\n\tSET @start = @start+250000\r\n\tEND\r\n\r\nCREATE INDEX idx_Tbl1_ID_NonIdentity\r\nON Tbl1 (ID_NonIdentity)\r\nCREATE INDEX idx_Tbl2_ID_NonIdentity\r\nON Tbl2 (ID_NonIdentity)\r\n<\/pre>\n<p style=\"text-align: justify;\">Given that the two objects\u2019 data is slightly different, we can now compare their content and extract the dichotomies using ID_NonIdentity attribute or simply run EXCEPT statement across the two tables. You can also notice that at this stage both tables are defined using non-NULLable data types and have indexes created on ID_NonIdentity column. Let\u2019s run the sample SELECT statements using NOT EXISTS, NOT IN, OUTER APPLY, EXCEPT AND LEFT OUTER JOIN and look at execution times and plans in more detail.<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\n--QUERY 1\r\nDBCC FREEPROCCACHE\r\nDBCC DROPCLEANBUFFERS\r\nSET STATISTICS TIME ON\r\nSELECT ID\r\nFROM tbl1 a\r\nWHERE a.ID_NonIdentity NOT IN\r\n\t(SELECT b.ID_NonIdentity FROM tbl2 b)\r\nSET STATISTICS TIME OFF\r\n\r\n--QUERY 2\r\nDBCC FREEPROCCACHE\r\nDBCC DROPCLEANBUFFERS\r\nSET STATISTICS TIME ON\r\nSELECT ID\r\nFROM tbl1 a WHERE NOT EXISTS\r\n\t(SELECT ID_NonIdentity\r\n\tFROM tbl2 b\r\n\tWHERE a.ID_NonIdentity = b.ID_NonIdentity)\r\nSET STATISTICS TIME OFF\r\n\r\n--QUERY 3\r\nDBCC FREEPROCCACHE\r\nDBCC DROPCLEANBUFFERS\r\nSET STATISTICS TIME ON\r\nSELECT a.ID FROM Tbl1 a\r\nLEFT OUTER JOIN Tbl2 b ON a.ID_NonIdentity = b.ID_NonIdentity --11sec\r\nWHERE b.ID_NonIdentity IS NULL\r\nSET STATISTICS TIME OFF\r\n\r\n--QUERY 4\r\nDBCC FREEPROCCACHE\r\nDBCC DROPCLEANBUFFERS\r\nSET STATISTICS TIME ON\r\nSELECT a.ID\r\nFROM tbl1 a OUTER APPLY\r\n\t\t\t(SELECT ID_NonIdentity FROM Tbl2 b\r\n\t\t\tWHERE a.ID_NonIdentity=b.ID_NonIdentity) z\r\n\t\t\tWHERE z.ID_NonIdentity IS NULL\r\nSET STATISTICS TIME OFF\r\n\r\n--QUERY 5\r\nDBCC FREEPROCCACHE\r\nDBCC DROPCLEANBUFFERS\r\nSET STATISTICS TIME ON\r\nSELECT ID\r\nFROM tbl1 a\r\nEXCEPT\r\nSELECT ID\r\nFROM tbl2 b\r\nSET STATISTICS TIME OFF\r\n<\/pre>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2014\/02\/AllQueries_ExecTime_FirstPass.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2154\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2014\/02\/AllQueries_ExecTime_FirstPass.png\" alt=\"AllQueries_ExecTime_FirstPass\" width=\"580\" height=\"376\" srcset=\"https:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2014\/02\/AllQueries_ExecTime_FirstPass.png 580w, https:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2014\/02\/AllQueries_ExecTime_FirstPass-300x194.png 300w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Looking closely at the execution plans, utilizing NOT EXISTS and NOT IN produced identical query plans with Right Anti Semi Join being the most expensive operation here.<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2014\/02\/Q1andQ2_ExecPlan.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2153\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2014\/02\/Q1andQ2_ExecPlan.png\" alt=\"Q1andQ2_ExecPlan\" width=\"580\" height=\"200\" srcset=\"https:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2014\/02\/Q1andQ2_ExecPlan.png 580w, https:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2014\/02\/Q1andQ2_ExecPlan-300x103.png 300w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">This was somewhat similar to OUTER APPLY and LEFT OUTER JOIN, however for those two query types the optimizer chose Right Outer Join which seemed a little bit more expensive compared to Right Anti Semi Join due to the query bringing in all matching and non-matching records first and then applying a filter to eliminate matches as per image below.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2014\/02\/Q3andQ4_ExecPlan.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2170\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2014\/02\/Q3andQ4_ExecPlan.png\" alt=\"Q3andQ4_ExecPlan\" width=\"580\" height=\"173\" srcset=\"https:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2014\/02\/Q3andQ4_ExecPlan.png 580w, https:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2014\/02\/Q3andQ4_ExecPlan-300x89.png 300w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Using EXCEPT yielded similar execution plan to NOT EXISTS and NOT IN with the exception of optimizer utilizing Hash Match (Aggregate) to build a hash table in order to remove duplicates. This is an important point to make as EXCEPT includes implicit DISTINCT \u2013 if cases multiple rows with the same value are found, they will be eliminated from the left \u2018side of the equation\u2019 much like UNION vs. UNION ALL operators. Not an issue in this specific instance but something to watch out for when planning to query data differences.<\/p>\n<p style=\"text-align: justify;\">Regardless of the slight differences in execution plans, all queries with the exception of the one using EXCEPT run in a comparable time. Typically, such statements in a production environment would run over a potentially larger datasets with much more complex logic involved so larger variances can be expected. Generally though, performance is maintained on par and disparities should be minimal. Also, removing indexes from both tables did little to increase execution time for the above queries. But what happens if we enable NULL values ID_NonIdentity attribute? Let\u2019s execute the following SQL to change column NULLability and run the representative SQL SELECT statements again to see if a change can be attributed to ID_NonIdentity accepting NULL values. Notice that there is no change to the underlying data and previously created indexes are still in place.<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nALTER TABLE Tbl1\r\nALTER COLUMN ID_NonIdentity int NULL\r\nGO\r\nALTER TABLE Tbl2\r\nALTER COLUMN ID_NonIdentity int NULL\r\nGO\r\n<\/pre>\n<p style=\"text-align: justify;\">Execution times are as per the chart below and it\u2019s clear to see that while query 2, 3, 4 and 5 behaved in a predictable manner and returned all records within respectable time, query 1 failed abominably.<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2014\/02\/AllQueries_ExecTime_SecondPass.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2156\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2014\/02\/AllQueries_ExecTime_SecondPass.png\" alt=\"AllQueries_ExecTime_SecondPass\" width=\"580\" height=\"376\" srcset=\"https:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2014\/02\/AllQueries_ExecTime_SecondPass.png 580w, https:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2014\/02\/AllQueries_ExecTime_SecondPass-300x194.png 300w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">The main problem here is that the results can be surprising if the target column is NULLable as SQL Server cannot reliably tell if a NULL on the right side is or isn\u2019t equal to the reference record on the left side when executing the query using NOT IN clause. And that\u2019s regardless whether the column actually contain any NULL values or not. I let query 1 to run for 30 minutes after which it was clear that it was going to take a while to complete and was a good indication of the problems the NOT IN clause was causing to optimizer trying to select the most representative plan. You can also tell that the query plan generated after the column modification is quite a bit more involved with Nested Loops, Row Count Spool and heavy tempdb database usage to accomplish what seemed straightforward in the first pass.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2014\/02\/Q1_ExecPlan_PostColumnMod.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-2157\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2014\/02\/Q1_ExecPlan_PostColumnMod.png\" alt=\"Q1_ExecPlan_PostColumnMod\" width=\"580\" height=\"230\" \/><\/a><\/p>\n<h3 style=\"text-align: center;\">Conclusion<\/h3>\n<p style=\"text-align: justify;\">The upshot to this quick exercise in query performance is that whenever you plan to compare data in table A against data in table B where some condition does not exists in table B, using NOT IN clause should be avoided as much as possible. This, of course will be dependent on your workloads, hardware, data, schema and environment in general but it would be safe to say that using NOT EXISTS instead of NOT IN would most likely result in best query performance and execution time.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Lately I have been involved in a large enterprise data warehouse deployment project and the in last few weeks we have been at a point where the team\u00a0slowly begun transitioning from dimension tables development to fact tables development. Working with many different teams e.g. contractors, consultants, internal data professionals etc. what caught my attention [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[18,49,19],"class_list":["post-2150","post","type-post","status-publish","format-standard","hentry","category-sql","tag-microsoft","tag-sql","tag-sql-server"],"aioseo_notices":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/2150","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/comments?post=2150"}],"version-history":[{"count":16,"href":"https:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/2150\/revisions"}],"predecessor-version":[{"id":2543,"href":"https:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/2150\/revisions\/2543"}],"wp:attachment":[{"href":"https:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/media?parent=2150"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/categories?post=2150"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/tags?post=2150"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}