Web Scraping With Python – Wikipedia Words Frequency Analysis Using Matplotlib

Words counting applications, even though rather ordinary and conventional on the face value, are a great exercise in programming language exploration and learning process, typically encompassing many core aspects of any language development such as conditionals, loops, data structures, data types, database and/or files support if you choose to store the result outside the memory etc. not to mention that there is something inherently childish and playful about counting words or letters and seeing the result automagically computed for your pleasure. Such small application can be as trivial as counting letters in a small sentence or paragraph down to a more complex scenario where, for example, word counting is used by translators or writers to estimate effort and billing requirements thus complemented with features such as being able to import different document types, count Asian characters etc. My personal requirements for the small script this post is based on was to be able to count words on any Wikipedia page and display it graphically. Additionally, I wanted to have the option of removing certain characters e.g. the most commonly used ones in English language, remove certain sections out of the Wikipedia page parsed and store the page as a HTML document on my hard drive.

The app’s syntax is rather procedural and should be easy to follow, especially that it runs in a console mode and uses Python as a development language (Python 3.3 specifically). I may be able to wrap it around in a GUI interface later on, most likely using Tkinter but for now, it serves its purpose quite well. The web scraping part uses the fantastic BeautifulSoup module whereas graphing is done utilising equally brilliant Matplotlib library with a little bit of help from Numpy. Let’s look at the individual sections of the code and the functionality it represents.

The first thing to do is to import all relevant libraries and define certain variables used later on in the script execution. I used sqllite3 module to store results in a database table as sqlite is already included by default in any Python distribution (using a file would probably work just as well). I also decided to call collections module which supports rapid tallying. For web scraping BeautifulSoup is used whereas Urllib provides a high-level interface for fetching data from the nominated Wiki page. Finally, Numpy and Matplotlib are used for values arrangement and graphical output.

As far as the two global variables declared at the start i.e. ‘undesirables’ and ‘common_words’, these contain all the extra unwanted Wikipedia bits and bobs to be removed from HTML parsed and most commonly used words in English language (also to be removed) respectively. I could potentially scrape the unwanted words from another Wikipedia site e.g. from HERE but having all those encapsulated in an easy to modify list is probably a more flexible approach.

Final declaration before the main method execution defines two SQL statements for table ‘words_count’ creation and dropping. Those two statements are used at the later stage when doDbWork() function is executed.

#import relevant modules
import sys, os, sqlite3, re, urllib.request
from collections import Counter
import numpy as np
from bs4 import BeautifulSoup
from matplotlib import pyplot as plt

#define global variables
global undesirables
undesirables = [{"element": "table", "attr": {'class': 'infobox'}},
                {"element": "table", "attr": {'class': 'vertical-navbox'}},
                {"element": "span", "attr": {'class': 'mw-editsection'}},
                {"element": "div", "attr": {'class': 'thumb'}},
                {"element": "sup", "attr": {'class': 'reference'}},
                {"element": "div", "attr": {'class': 'reflist'}},
                {"element": "table", "attr": {'class': 'nowraplinks'}},
                {"element": "table", "attr": {'class': 'ambox-Refimprove'}},
                {"element": "img", "attr": None}, {"element": "script", "attr": None},
                {"element": "table", "attr": {'class': 'mbox-small'}},
                {"element": "span", "attr": {"id": "coordinates"}},
                {"element": "table", "attr": {"class": "ambox-Orphan"}},
                {"element": "div", "attr": {"class": "mainarticle"}},
                {"element": None, "attr": {"id": "References"}}]

global common_words
common_words = ['a','able','about','across','after','all','almost','also',
                'am','among','an','and','any','are','as','at','be','because',
                'been','but','by','can','cannot','could','dear','did','do',
                'does','either','else','ever','every','for','from','get','got',
                'had','has','have','he','her','hers','him','his','how','however',
                'i','if','in','into','is','it','its','just','least','let','like',
                'likely','may','me','might','most','must','my','neither','no','nor',
                'not','of','off','often','on','only','or','other','our','out','own',
                'rather','said','say','says','she','should','since','so','some',
                'such','than','that','the','their','them','then','there','these',
                'they','this','to','too','us','wants','was','we','were','what',
                'when','where','which','while','who','whom','why','will','with',
                'would','yet','you','your']

#define database table and database dropping query
create_schema = "CREATE TABLE words_count \
                (id integer primary key autoincrement not null,word text,occurrence_count int)"
drop_schema = "DROP TABLE words_count"

Next up is the main() function which simply dictates application execution flow while gathering input from end user and storing it in variables e.g. Wikipedia URL address used for scraping.

#determine execution flow
def main():
    url = str(input('Please enter Wiki web address you would like to scrape below (starting with http://)...\n-->'))
    isValidLink(url)
    checkConnectivity(url)
    global file_dir
    file_dir = str(input('Please enter directory path below where '
                         'database and html files will be stored e.g. '
                         'C:\\YourDirectory (if it does not exists it will be created)...\n-->'))
    createDir(file_dir)
    global db_file
    db_file = file_dir + '\\temp_db.db' #database file location
    doDbWork(db_file)
    remove_commons = str(input('Would you like to remove most commonly '
                               'used English words from the result set? (Y/N)...\n-->'))
    while remove_commons.lower() not in ('Y','y','N','n'):
        remove_commons = str(input('Please select either Y (yes) or N (no) as an option for this input.\n-->'))
    url_save = str(input('Would you like to save scraped HTML file for '
                         'reference in the nominated directory? (Y/N)...\n-->'))
    while url_save.lower() not in ('Y','y','N','n'):
        url_save = str(input('Please select either Y (yes) or N (no) as an option for this input.\n-->'))
    print ('Attempting to scrape {}...'.format(url))
    grabPage(url, url.split("/wiki/")[1].strip().replace("_", " "),db_file, url_save.lower(), remove_commons.lower())
    plotWords(url)

Following on, we have a bunch of functions responsible for different aspects of the script execution starting with isValidLink() and checkConnectivity() functions. isValidLink() simply checks for the string being passed as the URL variable to ensure that only appropriate Wikipedia page is being used as an input. If incorrect string format is used, the code terminates. checkConnectivity() function, on the other hand, ensures that the page can be accessed, potentially highlighting problems such as internet connectivity or firewall issues.

#check if the URL link submitted is a valid one
def isValidLink(url):
    if "/wiki/" in url and ":" in url and "http://"  in url and "wikibooks" not in url \
        and "#" not in url and "wikiquote" not in url and "wiktionary" not in url and "wikiversity" not in url \
        and "wikivoyage" not in url and "wikisource" not in url and "wikinews" not in url and "wikiversity" not in url \
        and "wikidata" not in url:
        print('Wiki link is a valid string...continuing...')
    else:
        print('This is not a valid Wiki URL address. Press any key to exit.')
        input()
        sys.exit(0)

#check if the website is responding to a 'http' call
def checkConnectivity(url):
    try:
        print('Connecting...')
        urllib.request.urlopen(url, timeout=5)
        print("Connection to '{}' succeeded".format(url))
    except:
        urllib.request.URLError
        print("Connection to '{}' DID NOT succeeded. You may want to check the following to resolve this issue:".format(url))
        print(  '1. Internet connection is enabled\n'
                '2. You entered a valid address\n'
                '3. The website is operational\n'
                '...exiting now.')
        input()
        sys.exit(0)

Next two functions correspond to the user being prompted for a directory path selection where the database file will be created and HTML file stored after code successful execution (optionally). This functionality is provided by createDir() function which is closely followed by doDbWork() function which simply creates a sqlite database file in the nominated directory.

#create database and text file directory
def createDir(file_dir):
    if not os.path.exists(file_dir):
        try:
            print('Attempting to create directory in the path specified...')
            os.makedirs(file_dir)
            print('Directory created successfully...')
        except:
            IOError
            print("Directory COULD NOT be created in the location specified.")
            sys.exit(0)
    else:
        print('Directory specified already exists....moving on...')


#create database file and schema using the scripts above
def doDbWork(db_file):
    try:
        db_is_new = not os.path.exists(db_file)
        with sqlite3.connect(db_file) as conn:
            if db_is_new:
                print("Creating temp database schema on " + db_file + " database ...")
                conn.execute(create_schema)
            else:
                print("Database schema may already exist. Dropping database schema on " + db_file + "...")
                #os.remove(db_filename)
                conn.execute(drop_schema)
                print("Creating temporary database schema...")
                conn.execute(create_schema)
    except:
        print("Unexpected error:", sys.exc_info()[0])
    finally:
        conn.commit()
        conn.close()

grapPage() function is where most of heavy lifting is done. First, the URL is passed, web page opened and scraped with all unwanted elements represented by ‘undesirables’ variable removed. Wikipedia has a lot of nodes that I don’t want to parse so iterating through the list for each div/section/node I can ‘trim the fat’ and get rid of unnecessary sections I’m not interested in. Next, end user is required to confirm if he/she wishes to remove the most commonly used English words after which a connection to the database is made and individual words with their corresponding counts are inserted into the table. The default count is restricted to 40 which then gets whittled down to top 30 via the SQL DELETE statement (any more than that and the graph looks a bit congested). At this stage, regex functionality also removes certain characters from the dataset to disallow counting of full stops, commas, question marks etc. and 3 SQL DELETE statements are executed to perform final ‘clean up’ e.g. remove numerical characters, NULLs, duplicates etc. The user also has the option to save the URL as a file in the directory nominated for further reference.

# process URL page, exporting the HTML file into a directory nominated (optional)
# and inserting most commonly used words into a database file
def grabPage(url, name, db_file, url_save, remove_commons):
    try:
        opener = urllib.request.urlopen(url)
        page = opener.read()
        s = BeautifulSoup(page)
        s = s.find(id="mw-content-text")
        if hasattr(s, 'find_all'):
                for notWanted in undesirables:
                    removal = s.find_all(notWanted['element'], notWanted['attr'])
                    if len(removal) > 0:
                        for el in removal:
                            el.extract()
                also = s.find(id="See_also")
                if (also != None):
                    also.extract()
                    tail = also.find_all_next()
                    if (len(tail) > 0):
                        for element in tail:
                            element.extract()
        text = s.get_text(" ", strip=True)
        opener.close()
        conn = sqlite3.connect(db_file)
        cursor = conn.cursor()
        words = [word.lower() for word in text.split()]
        c = Counter(words)
        if remove_commons == 'y':
                for key in common_words:
                    if key in c:
                        del c[key]
        for word, count in c.most_common(40):
                cursor.execute("INSERT INTO words_count (word, occurrence_count)\
                              SELECT (?), (?)", (re.sub('[–{@#!;+=_,$<(^)>?.:%/&}''"''-]', '', word.lower()), count))
        #delete numerical characters, NULLs and empty spaces
        cursor.execute("DELETE FROM words_count WHERE word glob '[0-9]*' or word ='' or word IS NULL")
        #delete duplicate records where the same word is repeated more then once
        cursor.execute("DELETE  FROM words_count WHERE id NOT IN(\
                        SELECT  MIN(id) FROM  words_count GROUP BY word)")
        #delete records outside top 30
        cursor.execute("DELETE FROM words_count WHERE occurrence_count NOT IN(\
                        SELECT occurrence_count FROM words_count ORDER BY 1 DESC LIMIT 30)")
        if url_save == 'y':
            soup = BeautifulSoup(page, "html5lib", from_encoding="UTF-8")
            content = soup.find(id="mw-content-text")
            if hasattr(content, 'find_all'):
                for notWanted in undesirables:
                    removal = content.find_all(notWanted['element'], notWanted['attr'])
                    if len(removal) > 0:
                        for el in removal:
                            el.extract()
                also = content.find(id="See_also")
                if (also != None):
                    also.extract()
                    tail = also.find_all_next()
                    if (len(tail) > 0):
                        for element in tail:
                            element.extract()
                fileName = str(name)
                doctype = "<!DOCTYPE html>"
                head = "<head><meta charset=\"UTF-8\" /><title>" + fileName + "</title></head>"
                f = open(file_dir + "/" + fileName.replace('/', '_') + ".html", 'w', encoding='utf-8')
                f.write(
                    doctype + "<html lang=\"en\">" + head + "<body><h1>" + fileName + "</h1>" + str(content) + "</body></html>")
                f.close()
                print ('Scraped HTML file and database file have been saved in "{0}\\" directory '
                       'with a bar chart displayed in a separate window'.format(file_dir))
    except:
        print("Unexpected error:", sys.exc_info()[0])
        conn.rollback()
    finally:
        conn.commit()
        conn.close()

Finally, wordsOutput() and plotWords() functions create the graphic representation of each word occurrence frequency as a bar chart. This is followed by the main() function call which executes the whole script

#fetch database data
def wordsOutput():
    try:
        arr = []
        conn = sqlite3.connect(db_file)
        cursor = conn.cursor()
        cursor.execute('SELECT word, occurrence_count FROM words_count ORDER BY occurrence_count DESC')
        #column_names = [d[0] for d in cursor.description] # extract column names
        for row in cursor:
            arr.append(row)
        return arr
    except:
        print("Unexpected error:", sys.exc_info()[0])
    finally:
        conn.close()

#plot data onto the bar chart
def plotWords(url):
    data = wordsOutput()
    N = len(data)
    x = np.arange(1, N+1)
    y = [num for (s, num) in data]
    labels = [s for (s, num) in data]
    width = 0.7
    plt.bar(x, y, width, color="r")
    plt.ylabel('Frequency')
    plt.xlabel('Words')
    plt.title('Word Occurrence Frequency For '"'{}'"' Wiki Page'.format(url.split("/wiki/")[1].strip().replace("_", " ")))
    plt.xticks(x + width/2.0, labels)
    plt.xticks(rotation=45)
    plt.show()

#run from here!
if __name__ == '__main__':
    main()

Below is a short footage depicting script execution in PyScripter (hence prompt boxes rather than command line input) and final graph output. You can fetch the complete script from my Skydrive folder HERE and customize it enable web page scraping for websites other than Wikipedia with little changes required.

So there you go, a fairly easy way of scraping Wikipedia pages to compute words frequency and displaying the results on a bar chart using Python in conjunction with a few third-party modules. Enjoy playing around with web scraping and words counting and if you found this post useful (or otherwise), please leave me a comment.

http://scuttle.org/bookmarks.php/pass?action=add

Tags: , ,

This entry was posted on Thursday, February 6th, 2014 at 1:29 am and is filed under How To's, Programming, Visualisation. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

One Response to “Web Scraping With Python – Wikipedia Words Frequency Analysis Using Matplotlib”

An Exercise in Analytics: Using Microsoft SandDance to Visualize Trends at CraftConf | Survival of the Craziest August 27th, 2016 at 10:19 pm

[…] to teach myself SandDance (and some Python as well). So I put together a quick experiment: Using Martin’s excellent tutorial on Web Scraping with Python as a reference, I wrote  a Python script that scrapes CraftConf’s talk descriptions (using […]

Leave a Reply