Lately I have been involved in a large enterprise data warehouse deployment project and the in last few weeks we have been at a point where the team slowly begun transitioning from dimension tables development to fact tables development. Working with many different teams e.g. contractors, consultants, internal data professionals etc. what caught my attention was the fact that different developers tend to use different techniques to differentiate between the ‘new’ vs. ‘old’ data e.g. source OLTP system vs. target fact table in the OLAP schema to determine whether a given set of transactions have already been inserted or not. Given that most data flow logic was embedded into SQL stored procedures, with SQL Server Integration Services only controlling execution flow, a typical scenario for source vs. target comparison would involve applying dataset differentiating statement e.g. NOT EXISTS or NOT IN to account for any new transactions. There is a number of different options available here so I thought it would be a good idea to put them to the test and find out how performance is affected when using the most common ones i.e. NOT EXISTS, NOT IN, OUTER APPLY, EXCEPT and LEFT OUTER JOIN, on a well-structured dataset, with and without an index and with columns defined as NULLable vs. Non-NULLable.
The short-winded version is that when presented with a choice of using either one of those 5 statements you should preferably stay away from NOT IN for reasons I described below. This will depend heavily on your schema, data and resources at your disposal but as a rule of thumb, NOT IN should never be the first option to give consideration to when other alternatives are possible.
TL;DR Version
Let’s start with two sample datasets – Tbl1 and Tbl2 – where Tbl2 differs from its archetype by removing a small number of records (in this case around 50). The two tables and their data were created using the following SQL.
USE [master]
GO
IF EXISTS (SELECT name FROM sys.databases WHERE name = N'TestDB')
BEGIN
-- Close connections to the DW_Sample database
ALTER DATABASE [TestDB] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [TestDB]
END
GO
CREATE DATABASE [TestDB]
GO
USE [TestDB]
GO
CREATE TABLE Tbl1
(ID int identity (1,1),
ID_NonIdentity int NOT NULL DEFAULT 0,
object_id int NOT NULL)
GO
CREATE TABLE Tbl2
(ID int identity (1,1),
ID_NonIdentity int NOT NULL,
object_id int NOT NULL)
GO
INSERT INTO Tbl1 (object_id)
SELECT c1.object_id FROM sys.objects c1
CROSS JOIN (SELECT Top 100 name FROM sys.objects) c2
CROSS JOIN (SELECT Top 100 type_desc FROM sys.objects) c3
GO 25
UPDATE Tbl1 SET ID_NonIdentity = ID
INSERT INTO Tbl2 (ID_NonIdentity, object_id)
SELECT ID_NonIdentity, object_id FROM Tbl1
SET NOCOUNT ON
DECLARE @start int = 0
DECLARE @finish int = (SELECT MAX(id) FROM Tbl2)
WHILE @start <= @finish
BEGIN
DELETE FROM Tbl2 WHERE id = @start
SET @start = @start+250000
END
CREATE INDEX idx_Tbl1_ID_NonIdentity
ON Tbl1 (ID_NonIdentity)
CREATE INDEX idx_Tbl2_ID_NonIdentity
ON Tbl2 (ID_NonIdentity)
Given that the two objects’ data is slightly different, we can now compare their content and extract the dichotomies using ID_NonIdentity attribute or simply run EXCEPT statement across the two tables. You can also notice that at this stage both tables are defined using non-NULLable data types and have indexes created on ID_NonIdentity column. Let’s run the sample SELECT statements using NOT EXISTS, NOT IN, OUTER APPLY, EXCEPT AND LEFT OUTER JOIN and look at execution times and plans in more detail.
--QUERY 1
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
SET STATISTICS TIME ON
SELECT ID
FROM tbl1 a
WHERE a.ID_NonIdentity NOT IN
(SELECT b.ID_NonIdentity FROM tbl2 b)
SET STATISTICS TIME OFF
--QUERY 2
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
SET STATISTICS TIME ON
SELECT ID
FROM tbl1 a WHERE NOT EXISTS
(SELECT ID_NonIdentity
FROM tbl2 b
WHERE a.ID_NonIdentity = b.ID_NonIdentity)
SET STATISTICS TIME OFF
--QUERY 3
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
SET STATISTICS TIME ON
SELECT a.ID FROM Tbl1 a
LEFT OUTER JOIN Tbl2 b ON a.ID_NonIdentity = b.ID_NonIdentity --11sec
WHERE b.ID_NonIdentity IS NULL
SET STATISTICS TIME OFF
--QUERY 4
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
SET STATISTICS TIME ON
SELECT a.ID
FROM tbl1 a OUTER APPLY
(SELECT ID_NonIdentity FROM Tbl2 b
WHERE a.ID_NonIdentity=b.ID_NonIdentity) z
WHERE z.ID_NonIdentity IS NULL
SET STATISTICS TIME OFF
--QUERY 5
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
SET STATISTICS TIME ON
SELECT ID
FROM tbl1 a
EXCEPT
SELECT ID
FROM tbl2 b
SET STATISTICS TIME OFF
Looking closely at the execution plans, utilizing NOT EXISTS and NOT IN produced identical query plans with Right Anti Semi Join being the most expensive operation here.
This was somewhat similar to OUTER APPLY and LEFT OUTER JOIN, however for those two query types the optimizer chose Right Outer Join which seemed a little bit more expensive compared to Right Anti Semi Join due to the query bringing in all matching and non-matching records first and then applying a filter to eliminate matches as per image below.
Using EXCEPT yielded similar execution plan to NOT EXISTS and NOT IN with the exception of optimizer utilizing Hash Match (Aggregate) to build a hash table in order to remove duplicates. This is an important point to make as EXCEPT includes implicit DISTINCT – if cases multiple rows with the same value are found, they will be eliminated from the left ‘side of the equation’ much like UNION vs. UNION ALL operators. Not an issue in this specific instance but something to watch out for when planning to query data differences.
Regardless of the slight differences in execution plans, all queries with the exception of the one using EXCEPT run in a comparable time. Typically, such statements in a production environment would run over a potentially larger datasets with much more complex logic involved so larger variances can be expected. Generally though, performance is maintained on par and disparities should be minimal. Also, removing indexes from both tables did little to increase execution time for the above queries. But what happens if we enable NULL values ID_NonIdentity attribute? Let’s execute the following SQL to change column NULLability and run the representative SQL SELECT statements again to see if a change can be attributed to ID_NonIdentity accepting NULL values. Notice that there is no change to the underlying data and previously created indexes are still in place.
ALTER TABLE Tbl1
ALTER COLUMN ID_NonIdentity int NULL
GO
ALTER TABLE Tbl2
ALTER COLUMN ID_NonIdentity int NULL
GO
Execution times are as per the chart below and it’s clear to see that while query 2, 3, 4 and 5 behaved in a predictable manner and returned all records within respectable time, query 1 failed abominably.
The main problem here is that the results can be surprising if the target column is NULLable as SQL Server cannot reliably tell if a NULL on the right side is or isn’t equal to the reference record on the left side when executing the query using NOT IN clause. And that’s regardless whether the column actually contain any NULL values or not. I let query 1 to run for 30 minutes after which it was clear that it was going to take a while to complete and was a good indication of the problems the NOT IN clause was causing to optimizer trying to select the most representative plan. You can also tell that the query plan generated after the column modification is quite a bit more involved with Nested Loops, Row Count Spool and heavy tempdb database usage to accomplish what seemed straightforward in the first pass.
Conclusion
The upshot to this quick exercise in query performance is that whenever you plan to compare data in table A against data in table B where some condition does not exists in table B, using NOT IN clause should be avoided as much as possible. This, of course will be dependent on your workloads, hardware, data, schema and environment in general but it would be safe to say that using NOT EXISTS instead of NOT IN would most likely result in best query performance and execution time.
Words counting applications, even though rather ordinary and conventional on the face value, are a great exercise in programming language exploration and learning process, typically encompassing many core aspects of any language development such as conditionals, loops, data structures, data types, database and/or files support if you choose to store the result outside the memory etc. not to mention that there is something inherently childish and playful about counting words or letters and seeing the result automagically computed for your pleasure. Such small application can be as trivial as counting letters in a small sentence or paragraph down to a more complex scenario where, for example, word counting is used by translators or writers to estimate effort and billing requirements thus complemented with features such as being able to import different document types, count Asian characters etc. My personal requirements for the small script this post is based on was to be able to count words on any Wikipedia page and display it graphically. Additionally, I wanted to have the option of removing certain characters e.g. the most commonly used ones in English language, remove certain sections out of the Wikipedia page parsed and store the page as a HTML document on my hard drive.
The app’s syntax is rather procedural and should be easy to follow, especially that it runs in a console mode and uses Python as a development language (Python 3.3 specifically). I may be able to wrap it around in a GUI interface later on, most likely using Tkinter but for now, it serves its purpose quite well. The web scraping part uses the fantastic BeautifulSoup module whereas graphing is done utilising equally brilliant Matplotlib library with a little bit of help from Numpy. Let’s look at the individual sections of the code and the functionality it represents.
The first thing to do is to import all relevant libraries and define certain variables used later on in the script execution. I used sqllite3 module to store results in a database table as sqlite is already included by default in any Python distribution (using a file would probably work just as well). I also decided to call collections module which supports rapid tallying. For web scraping BeautifulSoup is used whereas Urllib provides a high-level interface for fetching data from the nominated Wiki page. Finally, Numpy and Matplotlib are used for values arrangement and graphical output.
As far as the two global variables declared at the start i.e. ‘undesirables’ and ‘common_words’, these contain all the extra unwanted Wikipedia bits and bobs to be removed from HTML parsed and most commonly used words in English language (also to be removed) respectively. I could potentially scrape the unwanted words from another Wikipedia site e.g. from HERE but having all those encapsulated in an easy to modify list is probably a more flexible approach.
Final declaration before the main method execution defines two SQL statements for table ‘words_count’ creation and dropping. Those two statements are used at the later stage when doDbWork() function is executed.
Next up is the main() function which simply dictates application execution flow while gathering input from end user and storing it in variables e.g. Wikipedia URL address used for scraping.
#determine execution flow
def main():
url = str(input('Please enter Wiki web address you would like to scrape below (starting with http://)...\n-->'))
isValidLink(url)
checkConnectivity(url)
global file_dir
file_dir = str(input('Please enter directory path below where '
'database and html files will be stored e.g. '
'C:\\YourDirectory (if it does not exists it will be created)...\n-->'))
createDir(file_dir)
global db_file
db_file = file_dir + '\\temp_db.db' #database file location
doDbWork(db_file)
remove_commons = str(input('Would you like to remove most commonly '
'used English words from the result set? (Y/N)...\n-->'))
while remove_commons.lower() not in ('Y','y','N','n'):
remove_commons = str(input('Please select either Y (yes) or N (no) as an option for this input.\n-->'))
url_save = str(input('Would you like to save scraped HTML file for '
'reference in the nominated directory? (Y/N)...\n-->'))
while url_save.lower() not in ('Y','y','N','n'):
url_save = str(input('Please select either Y (yes) or N (no) as an option for this input.\n-->'))
print ('Attempting to scrape {}...'.format(url))
grabPage(url, url.split("/wiki/")[1].strip().replace("_", " "),db_file, url_save.lower(), remove_commons.lower())
plotWords(url)
Following on, we have a bunch of functions responsible for different aspects of the script execution starting with isValidLink() and checkConnectivity() functions. isValidLink() simply checks for the string being passed as the URL variable to ensure that only appropriate Wikipedia page is being used as an input. If incorrect string format is used, the code terminates. checkConnectivity() function, on the other hand, ensures that the page can be accessed, potentially highlighting problems such as internet connectivity or firewall issues.
#check if the URL link submitted is a valid one
def isValidLink(url):
if "/wiki/" in url and ":" in url and "http://" in url and "wikibooks" not in url \
and "#" not in url and "wikiquote" not in url and "wiktionary" not in url and "wikiversity" not in url \
and "wikivoyage" not in url and "wikisource" not in url and "wikinews" not in url and "wikiversity" not in url \
and "wikidata" not in url:
print('Wiki link is a valid string...continuing...')
else:
print('This is not a valid Wiki URL address. Press any key to exit.')
input()
sys.exit(0)
#check if the website is responding to a 'http' call
def checkConnectivity(url):
try:
print('Connecting...')
urllib.request.urlopen(url, timeout=5)
print("Connection to '{}' succeeded".format(url))
except:
urllib.request.URLError
print("Connection to '{}' DID NOT succeeded. You may want to check the following to resolve this issue:".format(url))
print( '1. Internet connection is enabled\n'
'2. You entered a valid address\n'
'3. The website is operational\n'
'...exiting now.')
input()
sys.exit(0)
Next two functions correspond to the user being prompted for a directory path selection where the database file will be created and HTML file stored after code successful execution (optionally). This functionality is provided by createDir() function which is closely followed by doDbWork() function which simply creates a sqlite database file in the nominated directory.
#create database and text file directory
def createDir(file_dir):
if not os.path.exists(file_dir):
try:
print('Attempting to create directory in the path specified...')
os.makedirs(file_dir)
print('Directory created successfully...')
except:
IOError
print("Directory COULD NOT be created in the location specified.")
sys.exit(0)
else:
print('Directory specified already exists....moving on...')
#create database file and schema using the scripts above
def doDbWork(db_file):
try:
db_is_new = not os.path.exists(db_file)
with sqlite3.connect(db_file) as conn:
if db_is_new:
print("Creating temp database schema on " + db_file + " database ...")
conn.execute(create_schema)
else:
print("Database schema may already exist. Dropping database schema on " + db_file + "...")
#os.remove(db_filename)
conn.execute(drop_schema)
print("Creating temporary database schema...")
conn.execute(create_schema)
except:
print("Unexpected error:", sys.exc_info()[0])
finally:
conn.commit()
conn.close()
grapPage() function is where most of heavy lifting is done. First, the URL is passed, web page opened and scraped with all unwanted elements represented by ‘undesirables’ variable removed. Wikipedia has a lot of nodes that I don’t want to parse so iterating through the list for each div/section/node I can ‘trim the fat’ and get rid of unnecessary sections I’m not interested in. Next, end user is required to confirm if he/she wishes to remove the most commonly used English words after which a connection to the database is made and individual words with their corresponding counts are inserted into the table. The default count is restricted to 40 which then gets whittled down to top 30 via the SQL DELETE statement (any more than that and the graph looks a bit congested). At this stage, regex functionality also removes certain characters from the dataset to disallow counting of full stops, commas, question marks etc. and 3 SQL DELETE statements are executed to perform final ‘clean up’ e.g. remove numerical characters, NULLs, duplicates etc. The user also has the option to save the URL as a file in the directory nominated for further reference.
# process URL page, exporting the HTML file into a directory nominated (optional)
# and inserting most commonly used words into a database file
def grabPage(url, name, db_file, url_save, remove_commons):
try:
opener = urllib.request.urlopen(url)
page = opener.read()
s = BeautifulSoup(page)
s = s.find(id="mw-content-text")
if hasattr(s, 'find_all'):
for notWanted in undesirables:
removal = s.find_all(notWanted['element'], notWanted['attr'])
if len(removal) > 0:
for el in removal:
el.extract()
also = s.find(id="See_also")
if (also != None):
also.extract()
tail = also.find_all_next()
if (len(tail) > 0):
for element in tail:
element.extract()
text = s.get_text(" ", strip=True)
opener.close()
conn = sqlite3.connect(db_file)
cursor = conn.cursor()
words = [word.lower() for word in text.split()]
c = Counter(words)
if remove_commons == 'y':
for key in common_words:
if key in c:
del c[key]
for word, count in c.most_common(40):
cursor.execute("INSERT INTO words_count (word, occurrence_count)\
SELECT (?), (?)", (re.sub('[–{@#!;+=_,$<(^)>?.:%/&}''"''-]', '', word.lower()), count))
#delete numerical characters, NULLs and empty spaces
cursor.execute("DELETE FROM words_count WHERE word glob '[0-9]*' or word ='' or word IS NULL")
#delete duplicate records where the same word is repeated more then once
cursor.execute("DELETE FROM words_count WHERE id NOT IN(\
SELECT MIN(id) FROM words_count GROUP BY word)")
#delete records outside top 30
cursor.execute("DELETE FROM words_count WHERE occurrence_count NOT IN(\
SELECT occurrence_count FROM words_count ORDER BY 1 DESC LIMIT 30)")
if url_save == 'y':
soup = BeautifulSoup(page, "html5lib", from_encoding="UTF-8")
content = soup.find(id="mw-content-text")
if hasattr(content, 'find_all'):
for notWanted in undesirables:
removal = content.find_all(notWanted['element'], notWanted['attr'])
if len(removal) > 0:
for el in removal:
el.extract()
also = content.find(id="See_also")
if (also != None):
also.extract()
tail = also.find_all_next()
if (len(tail) > 0):
for element in tail:
element.extract()
fileName = str(name)
doctype = "<!DOCTYPE html>"
head = "<head><meta charset=\"UTF-8\" /><title>" + fileName + "</title></head>"
f = open(file_dir + "/" + fileName.replace('/', '_') + ".html", 'w', encoding='utf-8')
f.write(
doctype + "<html lang=\"en\">" + head + "<body><h1>" + fileName + "</h1>" + str(content) + "</body></html>")
f.close()
print ('Scraped HTML file and database file have been saved in "{0}\\" directory '
'with a bar chart displayed in a separate window'.format(file_dir))
except:
print("Unexpected error:", sys.exc_info()[0])
conn.rollback()
finally:
conn.commit()
conn.close()
Finally, wordsOutput() and plotWords() functions create the graphic representation of each word occurrence frequency as a bar chart. This is followed by the main() function call which executes the whole script
#fetch database data
def wordsOutput():
try:
arr = []
conn = sqlite3.connect(db_file)
cursor = conn.cursor()
cursor.execute('SELECT word, occurrence_count FROM words_count ORDER BY occurrence_count DESC')
#column_names = [d[0] for d in cursor.description] # extract column names
for row in cursor:
arr.append(row)
return arr
except:
print("Unexpected error:", sys.exc_info()[0])
finally:
conn.close()
#plot data onto the bar chart
def plotWords(url):
data = wordsOutput()
N = len(data)
x = np.arange(1, N+1)
y = [num for (s, num) in data]
labels = [s for (s, num) in data]
width = 0.7
plt.bar(x, y, width, color="r")
plt.ylabel('Frequency')
plt.xlabel('Words')
plt.title('Word Occurrence Frequency For '"'{}'"' Wiki Page'.format(url.split("/wiki/")[1].strip().replace("_", " ")))
plt.xticks(x + width/2.0, labels)
plt.xticks(rotation=45)
plt.show()
#run from here!
if __name__ == '__main__':
main()
Below is a short footage depicting script execution in PyScripter (hence prompt boxes rather than command line input) and final graph output. You can fetch the complete script from my Skydrive folder HERE and customize it enable web page scraping for websites other than Wikipedia with little changes required.
So there you go, a fairly easy way of scraping Wikipedia pages to compute words frequency and displaying the results on a bar chart using Python in conjunction with a few third-party modules. Enjoy playing around with web scraping and words counting and if you found this post useful (or otherwise), please leave me a comment.
My name is Martin and this site is a random collection of recipes and reflections about various topics covering information management, data engineering, machine learning, business intelligence and visualisation plus everything else that I fancy to categorise under the 'analytics' umbrella. I'm a native of Poland but since my university days I have lived in Melbourne, Australia and worked as a DBA, developer, data architect, technical lead and team manager. My main interests lie in both, helping clients in technical aspects of information management e.g. data modelling, systems architecture, cloud deployments as well as business-oriented strategies e.g. enterprise data solutions project management, data governance and stewardship, data security and privacy or data monetisation. On the whole, I am very fond of anything closely or remotely related to data and as long as it can be represented as a string of ones and zeros and then analysed and visualised, you've got my attention!
Outside sporadic updates to this site I typically find myself fiddling with data, spending time with my kids or a good book, the gym or watching a good movie while eating Polish sausage with Zubrowka (best served on rocks with apple juice and a lime twist). Please read on and if you find these posts of any interests, don't hesitate to leave me a comment!