Web Scraping With Python – Wikipedia Words Frequency Analysis Using Matplotlib
February 6th, 2014 / 1 Comment » / by admin
Words counting applications, even though rather ordinary and conventional on the face value, are a great exercise in programming language exploration and learning process, typically encompassing many core aspects of any language development such as conditionals, loops, data structures, data types, database and/or files support if you choose to store the result outside the memory etc. not to mention that there is something inherently childish and playful about counting words or letters and seeing the result automagically computed for your pleasure. Such small application can be as trivial as counting letters in a small sentence or paragraph down to a more complex scenario where, for example, word counting is used by translators or writers to estimate effort and billing requirements thus complemented with features such as being able to import different document types, count Asian characters etc. My personal requirements for the small script this post is based on was to be able to count words on any Wikipedia page and display it graphically. Additionally, I wanted to have the option of removing certain characters e.g. the most commonly used ones in English language, remove certain sections out of the Wikipedia page parsed and store the page as a HTML document on my hard drive.
The app’s syntax is rather procedural and should be easy to follow, especially that it runs in a console mode and uses Python as a development language (Python 3.3 specifically). I may be able to wrap it around in a GUI interface later on, most likely using Tkinter but for now, it serves its purpose quite well. The web scraping part uses the fantastic BeautifulSoup module whereas graphing is done utilising equally brilliant Matplotlib library with a little bit of help from Numpy. Let’s look at the individual sections of the code and the functionality it represents.
The first thing to do is to import all relevant libraries and define certain variables used later on in the script execution. I used sqllite3 module to store results in a database table as sqlite is already included by default in any Python distribution (using a file would probably work just as well). I also decided to call collections module which supports rapid tallying. For web scraping BeautifulSoup is used whereas Urllib provides a high-level interface for fetching data from the nominated Wiki page. Finally, Numpy and Matplotlib are used for values arrangement and graphical output.
As far as the two global variables declared at the start i.e. ‘undesirables’ and ‘common_words’, these contain all the extra unwanted Wikipedia bits and bobs to be removed from HTML parsed and most commonly used words in English language (also to be removed) respectively. I could potentially scrape the unwanted words from another Wikipedia site e.g. from HERE but having all those encapsulated in an easy to modify list is probably a more flexible approach.
Final declaration before the main method execution defines two SQL statements for table ‘words_count’ creation and dropping. Those two statements are used at the later stage when doDbWork() function is executed.
#import relevant modules import sys, os, sqlite3, re, urllib.request from collections import Counter import numpy as np from bs4 import BeautifulSoup from matplotlib import pyplot as plt #define global variables global undesirables undesirables = [{"element": "table", "attr": {'class': 'infobox'}}, {"element": "table", "attr": {'class': 'vertical-navbox'}}, {"element": "span", "attr": {'class': 'mw-editsection'}}, {"element": "div", "attr": {'class': 'thumb'}}, {"element": "sup", "attr": {'class': 'reference'}}, {"element": "div", "attr": {'class': 'reflist'}}, {"element": "table", "attr": {'class': 'nowraplinks'}}, {"element": "table", "attr": {'class': 'ambox-Refimprove'}}, {"element": "img", "attr": None}, {"element": "script", "attr": None}, {"element": "table", "attr": {'class': 'mbox-small'}}, {"element": "span", "attr": {"id": "coordinates"}}, {"element": "table", "attr": {"class": "ambox-Orphan"}}, {"element": "div", "attr": {"class": "mainarticle"}}, {"element": None, "attr": {"id": "References"}}] global common_words common_words = ['a','able','about','across','after','all','almost','also', 'am','among','an','and','any','are','as','at','be','because', 'been','but','by','can','cannot','could','dear','did','do', 'does','either','else','ever','every','for','from','get','got', 'had','has','have','he','her','hers','him','his','how','however', 'i','if','in','into','is','it','its','just','least','let','like', 'likely','may','me','might','most','must','my','neither','no','nor', 'not','of','off','often','on','only','or','other','our','out','own', 'rather','said','say','says','she','should','since','so','some', 'such','than','that','the','their','them','then','there','these', 'they','this','to','too','us','wants','was','we','were','what', 'when','where','which','while','who','whom','why','will','with', 'would','yet','you','your'] #define database table and database dropping query create_schema = "CREATE TABLE words_count \ (id integer primary key autoincrement not null,word text,occurrence_count int)" drop_schema = "DROP TABLE words_count"
Next up is the main() function which simply dictates application execution flow while gathering input from end user and storing it in variables e.g. Wikipedia URL address used for scraping.
#determine execution flow def main(): url = str(input('Please enter Wiki web address you would like to scrape below (starting with http://)...\n-->')) isValidLink(url) checkConnectivity(url) global file_dir file_dir = str(input('Please enter directory path below where ' 'database and html files will be stored e.g. ' 'C:\\YourDirectory (if it does not exists it will be created)...\n-->')) createDir(file_dir) global db_file db_file = file_dir + '\\temp_db.db' #database file location doDbWork(db_file) remove_commons = str(input('Would you like to remove most commonly ' 'used English words from the result set? (Y/N)...\n-->')) while remove_commons.lower() not in ('Y','y','N','n'): remove_commons = str(input('Please select either Y (yes) or N (no) as an option for this input.\n-->')) url_save = str(input('Would you like to save scraped HTML file for ' 'reference in the nominated directory? (Y/N)...\n-->')) while url_save.lower() not in ('Y','y','N','n'): url_save = str(input('Please select either Y (yes) or N (no) as an option for this input.\n-->')) print ('Attempting to scrape {}...'.format(url)) grabPage(url, url.split("/wiki/")[1].strip().replace("_", " "),db_file, url_save.lower(), remove_commons.lower()) plotWords(url)
Following on, we have a bunch of functions responsible for different aspects of the script execution starting with isValidLink() and checkConnectivity() functions. isValidLink() simply checks for the string being passed as the URL variable to ensure that only appropriate Wikipedia page is being used as an input. If incorrect string format is used, the code terminates. checkConnectivity() function, on the other hand, ensures that the page can be accessed, potentially highlighting problems such as internet connectivity or firewall issues.
#check if the URL link submitted is a valid one def isValidLink(url): if "/wiki/" in url and ":" in url and "http://" in url and "wikibooks" not in url \ and "#" not in url and "wikiquote" not in url and "wiktionary" not in url and "wikiversity" not in url \ and "wikivoyage" not in url and "wikisource" not in url and "wikinews" not in url and "wikiversity" not in url \ and "wikidata" not in url: print('Wiki link is a valid string...continuing...') else: print('This is not a valid Wiki URL address. Press any key to exit.') input() sys.exit(0) #check if the website is responding to a 'http' call def checkConnectivity(url): try: print('Connecting...') urllib.request.urlopen(url, timeout=5) print("Connection to '{}' succeeded".format(url)) except: urllib.request.URLError print("Connection to '{}' DID NOT succeeded. You may want to check the following to resolve this issue:".format(url)) print( '1. Internet connection is enabled\n' '2. You entered a valid address\n' '3. The website is operational\n' '...exiting now.') input() sys.exit(0)
Next two functions correspond to the user being prompted for a directory path selection where the database file will be created and HTML file stored after code successful execution (optionally). This functionality is provided by createDir() function which is closely followed by doDbWork() function which simply creates a sqlite database file in the nominated directory.
#create database and text file directory def createDir(file_dir): if not os.path.exists(file_dir): try: print('Attempting to create directory in the path specified...') os.makedirs(file_dir) print('Directory created successfully...') except: IOError print("Directory COULD NOT be created in the location specified.") sys.exit(0) else: print('Directory specified already exists....moving on...') #create database file and schema using the scripts above def doDbWork(db_file): try: db_is_new = not os.path.exists(db_file) with sqlite3.connect(db_file) as conn: if db_is_new: print("Creating temp database schema on " + db_file + " database ...") conn.execute(create_schema) else: print("Database schema may already exist. Dropping database schema on " + db_file + "...") #os.remove(db_filename) conn.execute(drop_schema) print("Creating temporary database schema...") conn.execute(create_schema) except: print("Unexpected error:", sys.exc_info()[0]) finally: conn.commit() conn.close()
grapPage() function is where most of heavy lifting is done. First, the URL is passed, web page opened and scraped with all unwanted elements represented by ‘undesirables’ variable removed. Wikipedia has a lot of nodes that I don’t want to parse so iterating through the list for each div/section/node I can ‘trim the fat’ and get rid of unnecessary sections I’m not interested in. Next, end user is required to confirm if he/she wishes to remove the most commonly used English words after which a connection to the database is made and individual words with their corresponding counts are inserted into the table. The default count is restricted to 40 which then gets whittled down to top 30 via the SQL DELETE statement (any more than that and the graph looks a bit congested). At this stage, regex functionality also removes certain characters from the dataset to disallow counting of full stops, commas, question marks etc. and 3 SQL DELETE statements are executed to perform final ‘clean up’ e.g. remove numerical characters, NULLs, duplicates etc. The user also has the option to save the URL as a file in the directory nominated for further reference.
# process URL page, exporting the HTML file into a directory nominated (optional) # and inserting most commonly used words into a database file def grabPage(url, name, db_file, url_save, remove_commons): try: opener = urllib.request.urlopen(url) page = opener.read() s = BeautifulSoup(page) s = s.find(id="mw-content-text") if hasattr(s, 'find_all'): for notWanted in undesirables: removal = s.find_all(notWanted['element'], notWanted['attr']) if len(removal) > 0: for el in removal: el.extract() also = s.find(id="See_also") if (also != None): also.extract() tail = also.find_all_next() if (len(tail) > 0): for element in tail: element.extract() text = s.get_text(" ", strip=True) opener.close() conn = sqlite3.connect(db_file) cursor = conn.cursor() words = [word.lower() for word in text.split()] c = Counter(words) if remove_commons == 'y': for key in common_words: if key in c: del c[key] for word, count in c.most_common(40): cursor.execute("INSERT INTO words_count (word, occurrence_count)\ SELECT (?), (?)", (re.sub('[–{@#!;+=_,$<(^)>?.:%/&}''"''-]', '', word.lower()), count)) #delete numerical characters, NULLs and empty spaces cursor.execute("DELETE FROM words_count WHERE word glob '[0-9]*' or word ='' or word IS NULL") #delete duplicate records where the same word is repeated more then once cursor.execute("DELETE FROM words_count WHERE id NOT IN(\ SELECT MIN(id) FROM words_count GROUP BY word)") #delete records outside top 30 cursor.execute("DELETE FROM words_count WHERE occurrence_count NOT IN(\ SELECT occurrence_count FROM words_count ORDER BY 1 DESC LIMIT 30)") if url_save == 'y': soup = BeautifulSoup(page, "html5lib", from_encoding="UTF-8") content = soup.find(id="mw-content-text") if hasattr(content, 'find_all'): for notWanted in undesirables: removal = content.find_all(notWanted['element'], notWanted['attr']) if len(removal) > 0: for el in removal: el.extract() also = content.find(id="See_also") if (also != None): also.extract() tail = also.find_all_next() if (len(tail) > 0): for element in tail: element.extract() fileName = str(name) doctype = "<!DOCTYPE html>" head = "<head><meta charset=\"UTF-8\" /><title>" + fileName + "</title></head>" f = open(file_dir + "/" + fileName.replace('/', '_') + ".html", 'w', encoding='utf-8') f.write( doctype + "<html lang=\"en\">" + head + "<body><h1>" + fileName + "</h1>" + str(content) + "</body></html>") f.close() print ('Scraped HTML file and database file have been saved in "{0}\\" directory ' 'with a bar chart displayed in a separate window'.format(file_dir)) except: print("Unexpected error:", sys.exc_info()[0]) conn.rollback() finally: conn.commit() conn.close()
Finally, wordsOutput() and plotWords() functions create the graphic representation of each word occurrence frequency as a bar chart. This is followed by the main() function call which executes the whole script
#fetch database data def wordsOutput(): try: arr = [] conn = sqlite3.connect(db_file) cursor = conn.cursor() cursor.execute('SELECT word, occurrence_count FROM words_count ORDER BY occurrence_count DESC') #column_names = [d[0] for d in cursor.description] # extract column names for row in cursor: arr.append(row) return arr except: print("Unexpected error:", sys.exc_info()[0]) finally: conn.close() #plot data onto the bar chart def plotWords(url): data = wordsOutput() N = len(data) x = np.arange(1, N+1) y = [num for (s, num) in data] labels = [s for (s, num) in data] width = 0.7 plt.bar(x, y, width, color="r") plt.ylabel('Frequency') plt.xlabel('Words') plt.title('Word Occurrence Frequency For '"'{}'"' Wiki Page'.format(url.split("/wiki/")[1].strip().replace("_", " "))) plt.xticks(x + width/2.0, labels) plt.xticks(rotation=45) plt.show() #run from here! if __name__ == '__main__': main()
Below is a short footage depicting script execution in PyScripter (hence prompt boxes rather than command line input) and final graph output. You can fetch the complete script from my Skydrive folder HERE and customize it enable web page scraping for websites other than Wikipedia with little changes required.
So there you go, a fairly easy way of scraping Wikipedia pages to compute words frequency and displaying the results on a bar chart using Python in conjunction with a few third-party modules. Enjoy playing around with web scraping and words counting and if you found this post useful (or otherwise), please leave me a comment.