{"id":2133,"date":"2014-02-06T01:29:46","date_gmt":"2014-02-06T01:29:46","guid":{"rendered":"http:\/\/bicortex.com\/?p=2133"},"modified":"2014-03-04T15:03:51","modified_gmt":"2014-03-04T15:03:51","slug":"web-scraping-with-python-wikipedia-words-frequency-analysis-using-matplotlib","status":"publish","type":"post","link":"http:\/\/bicortex.com\/bicortex\/web-scraping-with-python-wikipedia-words-frequency-analysis-using-matplotlib\/","title":{"rendered":"Web Scraping With Python &#8211; Wikipedia Words Frequency Analysis Using Matplotlib"},"content":{"rendered":"<p style=\"text-align: justify;\">Words counting applications, even though rather ordinary and conventional on the face value, are a great exercise in programming language exploration and learning process, typically encompassing many core aspects of any language development such as conditionals, loops, data structures, data types, database and\/or files support if you choose to store the result outside the memory etc. not to mention that there is something inherently childish and playful about counting words or letters and seeing the result automagically computed for your pleasure. Such small application can be as trivial as counting letters in a small sentence or paragraph down to a more complex scenario where, for example, word counting is used by translators or writers to estimate effort and billing requirements thus complemented with features such as being able to import different document types, count Asian characters etc. My personal requirements for the small script this post is based on was to be able to count words on any Wikipedia page and display it graphically. Additionally, I wanted to have the option of removing certain characters e.g. the most commonly used ones in English language, remove certain sections out of the Wikipedia page parsed and store the page as a HTML document on my hard drive.<\/p>\n<p style=\"text-align: justify;\">The app\u2019s syntax is rather procedural and should be easy to follow, especially that it runs in a console mode and uses Python as a development language (Python 3.3 specifically). I may be able to wrap it around in a GUI interface later on, most likely using Tkinter but for now, it serves its purpose quite well. The web scraping part uses the fantastic BeautifulSoup module whereas graphing is done utilising equally brilliant Matplotlib library with a little bit of help from Numpy. Let\u2019s look at the individual sections of the code and the functionality it represents.<\/p>\n<p style=\"text-align: justify;\">The first thing to do is to import all relevant libraries and define certain variables used later on in the script execution. I used sqllite3 module to store results in a database table as sqlite is already included by default in any Python distribution (using a file would probably work just as well). I also decided to call collections module which supports rapid tallying. For web scraping BeautifulSoup is used whereas Urllib provides a high-level interface for fetching data from the nominated Wiki page. Finally, Numpy and Matplotlib are used for values arrangement and graphical output.<\/p>\n<p style=\"text-align: justify;\">As far as the two global variables declared at the start i.e. &#8216;undesirables&#8217; and &#8216;common_words&#8217;, these contain all the extra unwanted Wikipedia bits and bobs to be removed from HTML parsed and most commonly used words in English language (also to be removed) respectively. I could potentially scrape the unwanted words from another Wikipedia site e.g. from <a href=\"http:\/\/en.wikipedia.org\/wiki\/Most_common_words_in_English\" target=\"_blank\"><b>HERE<\/b><\/a> but having all those encapsulated in an easy to modify list is probably a more flexible approach.<\/p>\n<p style=\"text-align: justify;\">Final declaration before the main method execution defines two SQL statements for table &#8216;words_count&#8217; creation and dropping. Those two statements are used at the later stage when doDbWork() function is executed.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n#import relevant modules\r\nimport sys, os, sqlite3, re, urllib.request\r\nfrom collections import Counter\r\nimport numpy as np\r\nfrom bs4 import BeautifulSoup\r\nfrom matplotlib import pyplot as plt\r\n\r\n#define global variables\r\nglobal undesirables\r\nundesirables = &#x5B;{&quot;element&quot;: &quot;table&quot;, &quot;attr&quot;: {'class': 'infobox'}},\r\n                {&quot;element&quot;: &quot;table&quot;, &quot;attr&quot;: {'class': 'vertical-navbox'}},\r\n                {&quot;element&quot;: &quot;span&quot;, &quot;attr&quot;: {'class': 'mw-editsection'}},\r\n                {&quot;element&quot;: &quot;div&quot;, &quot;attr&quot;: {'class': 'thumb'}},\r\n                {&quot;element&quot;: &quot;sup&quot;, &quot;attr&quot;: {'class': 'reference'}},\r\n                {&quot;element&quot;: &quot;div&quot;, &quot;attr&quot;: {'class': 'reflist'}},\r\n                {&quot;element&quot;: &quot;table&quot;, &quot;attr&quot;: {'class': 'nowraplinks'}},\r\n                {&quot;element&quot;: &quot;table&quot;, &quot;attr&quot;: {'class': 'ambox-Refimprove'}},\r\n                {&quot;element&quot;: &quot;img&quot;, &quot;attr&quot;: None}, {&quot;element&quot;: &quot;script&quot;, &quot;attr&quot;: None},\r\n                {&quot;element&quot;: &quot;table&quot;, &quot;attr&quot;: {'class': 'mbox-small'}},\r\n                {&quot;element&quot;: &quot;span&quot;, &quot;attr&quot;: {&quot;id&quot;: &quot;coordinates&quot;}},\r\n                {&quot;element&quot;: &quot;table&quot;, &quot;attr&quot;: {&quot;class&quot;: &quot;ambox-Orphan&quot;}},\r\n                {&quot;element&quot;: &quot;div&quot;, &quot;attr&quot;: {&quot;class&quot;: &quot;mainarticle&quot;}},\r\n                {&quot;element&quot;: None, &quot;attr&quot;: {&quot;id&quot;: &quot;References&quot;}}]\r\n\r\nglobal common_words\r\ncommon_words = &#x5B;'a','able','about','across','after','all','almost','also',\r\n                'am','among','an','and','any','are','as','at','be','because',\r\n                'been','but','by','can','cannot','could','dear','did','do',\r\n                'does','either','else','ever','every','for','from','get','got',\r\n                'had','has','have','he','her','hers','him','his','how','however',\r\n                'i','if','in','into','is','it','its','just','least','let','like',\r\n                'likely','may','me','might','most','must','my','neither','no','nor',\r\n                'not','of','off','often','on','only','or','other','our','out','own',\r\n                'rather','said','say','says','she','should','since','so','some',\r\n                'such','than','that','the','their','them','then','there','these',\r\n                'they','this','to','too','us','wants','was','we','were','what',\r\n                'when','where','which','while','who','whom','why','will','with',\r\n                'would','yet','you','your']\r\n\r\n#define database table and database dropping query\r\ncreate_schema = &quot;CREATE TABLE words_count \\\r\n                (id integer primary key autoincrement not null,word text,occurrence_count int)&quot;\r\ndrop_schema = &quot;DROP TABLE words_count&quot;\r\n<\/pre>\n<p style=\"text-align: justify;\">Next up is the main() function which simply dictates application execution flow while gathering input from end user and storing it in variables e.g. Wikipedia URL address used for scraping.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n#determine execution flow\r\ndef main():\r\n    url = str(input('Please enter Wiki web address you would like to scrape below (starting with http:\/\/)...\\n--&gt;'))\r\n    isValidLink(url)\r\n    checkConnectivity(url)\r\n    global file_dir\r\n    file_dir = str(input('Please enter directory path below where '\r\n                         'database and html files will be stored e.g. '\r\n                         'C:\\\\YourDirectory (if it does not exists it will be created)...\\n--&gt;'))\r\n    createDir(file_dir)\r\n    global db_file\r\n    db_file = file_dir + '\\\\temp_db.db' #database file location\r\n    doDbWork(db_file)\r\n    remove_commons = str(input('Would you like to remove most commonly '\r\n                               'used English words from the result set? (Y\/N)...\\n--&gt;'))\r\n    while remove_commons.lower() not in ('Y','y','N','n'):\r\n        remove_commons = str(input('Please select either Y (yes) or N (no) as an option for this input.\\n--&gt;'))\r\n    url_save = str(input('Would you like to save scraped HTML file for '\r\n                         'reference in the nominated directory? (Y\/N)...\\n--&gt;'))\r\n    while url_save.lower() not in ('Y','y','N','n'):\r\n        url_save = str(input('Please select either Y (yes) or N (no) as an option for this input.\\n--&gt;'))\r\n    print ('Attempting to scrape {}...'.format(url))\r\n    grabPage(url, url.split(&quot;\/wiki\/&quot;)&#x5B;1].strip().replace(&quot;_&quot;, &quot; &quot;),db_file, url_save.lower(), remove_commons.lower())\r\n    plotWords(url)\r\n<\/pre>\n<p style=\"text-align: justify;\">Following on, we have a bunch of functions responsible for different aspects of the script execution starting with isValidLink() and checkConnectivity() functions. isValidLink() simply checks for the string being passed as the URL variable to ensure that only appropriate Wikipedia page is being used as an input. If incorrect string format is used, the code terminates. checkConnectivity() function, on the other hand, ensures that the page can be accessed, potentially highlighting problems such as internet connectivity or firewall issues.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n#check if the URL link submitted is a valid one\r\ndef isValidLink(url):\r\n    if &quot;\/wiki\/&quot; in url and &quot;:&quot; in url and &quot;http:\/\/&quot;  in url and &quot;wikibooks&quot; not in url \\\r\n        and &quot;#&quot; not in url and &quot;wikiquote&quot; not in url and &quot;wiktionary&quot; not in url and &quot;wikiversity&quot; not in url \\\r\n        and &quot;wikivoyage&quot; not in url and &quot;wikisource&quot; not in url and &quot;wikinews&quot; not in url and &quot;wikiversity&quot; not in url \\\r\n        and &quot;wikidata&quot; not in url:\r\n        print('Wiki link is a valid string...continuing...')\r\n    else:\r\n        print('This is not a valid Wiki URL address. Press any key to exit.')\r\n        input()\r\n        sys.exit(0)\r\n\r\n#check if the website is responding to a 'http' call\r\ndef checkConnectivity(url):\r\n    try:\r\n        print('Connecting...')\r\n        urllib.request.urlopen(url, timeout=5)\r\n        print(&quot;Connection to '{}' succeeded&quot;.format(url))\r\n    except:\r\n        urllib.request.URLError\r\n        print(&quot;Connection to '{}' DID NOT succeeded. You may want to check the following to resolve this issue:&quot;.format(url))\r\n        print(  '1. Internet connection is enabled\\n'\r\n                '2. You entered a valid address\\n'\r\n                '3. The website is operational\\n'\r\n                '...exiting now.')\r\n        input()\r\n        sys.exit(0)\r\n<\/pre>\n<p style=\"text-align: justify;\">Next two functions correspond to the user being prompted for a directory path selection where the database file will be created and HTML file stored after code successful execution (optionally). This functionality is provided by createDir() function which is closely followed by doDbWork() function which simply creates a sqlite database file in the nominated directory.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n#create database and text file directory\r\ndef createDir(file_dir):\r\n    if not os.path.exists(file_dir):\r\n        try:\r\n            print('Attempting to create directory in the path specified...')\r\n            os.makedirs(file_dir)\r\n            print('Directory created successfully...')\r\n        except:\r\n            IOError\r\n            print(&quot;Directory COULD NOT be created in the location specified.&quot;)\r\n            sys.exit(0)\r\n    else:\r\n        print('Directory specified already exists....moving on...')\r\n\r\n\r\n#create database file and schema using the scripts above\r\ndef doDbWork(db_file):\r\n    try:\r\n        db_is_new = not os.path.exists(db_file)\r\n        with sqlite3.connect(db_file) as conn:\r\n            if db_is_new:\r\n                print(&quot;Creating temp database schema on &quot; + db_file + &quot; database ...&quot;)\r\n                conn.execute(create_schema)\r\n            else:\r\n                print(&quot;Database schema may already exist. Dropping database schema on &quot; + db_file + &quot;...&quot;)\r\n                #os.remove(db_filename)\r\n                conn.execute(drop_schema)\r\n                print(&quot;Creating temporary database schema...&quot;)\r\n                conn.execute(create_schema)\r\n    except:\r\n        print(&quot;Unexpected error:&quot;, sys.exc_info()&#x5B;0])\r\n    finally:\r\n        conn.commit()\r\n        conn.close()\r\n<\/pre>\n<p style=\"text-align: justify;\">grapPage() function is where most of heavy lifting is done. First, the URL is passed, web page opened and scraped with all unwanted elements represented by &#8216;undesirables&#8217; variable removed. Wikipedia has a lot of nodes that I don\u2019t want to parse so iterating through the list for each div\/section\/node I can &#8216;trim the fat&#8217; and get rid of unnecessary sections I\u2019m not interested in. Next, end user is required to confirm if he\/she wishes to remove the most commonly used English words after which a connection to the database is made and individual words with their corresponding counts are inserted into the table. The default count is restricted to 40 which then gets whittled down to top 30 via the SQL DELETE statement (any more than that and the graph looks a bit congested). At this stage, regex functionality also removes certain characters from the dataset to disallow counting of full stops, commas, question marks etc. and 3 SQL DELETE statements are executed to perform final &#8216;clean up&#8217; e.g. remove numerical characters, NULLs, duplicates etc. The user also has the option to save the URL as a file in the directory nominated for further reference.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n# process URL page, exporting the HTML file into a directory nominated (optional)\r\n# and inserting most commonly used words into a database file\r\ndef grabPage(url, name, db_file, url_save, remove_commons):\r\n    try:\r\n        opener = urllib.request.urlopen(url)\r\n        page = opener.read()\r\n        s = BeautifulSoup(page)\r\n        s = s.find(id=&quot;mw-content-text&quot;)\r\n        if hasattr(s, 'find_all'):\r\n                for notWanted in undesirables:\r\n                    removal = s.find_all(notWanted&#x5B;'element'], notWanted&#x5B;'attr'])\r\n                    if len(removal) &gt; 0:\r\n                        for el in removal:\r\n                            el.extract()\r\n                also = s.find(id=&quot;See_also&quot;)\r\n                if (also != None):\r\n                    also.extract()\r\n                    tail = also.find_all_next()\r\n                    if (len(tail) &gt; 0):\r\n                        for element in tail:\r\n                            element.extract()\r\n        text = s.get_text(&quot; &quot;, strip=True)\r\n        opener.close()\r\n        conn = sqlite3.connect(db_file)\r\n        cursor = conn.cursor()\r\n        words = &#x5B;word.lower() for word in text.split()]\r\n        c = Counter(words)\r\n        if remove_commons == 'y':\r\n                for key in common_words:\r\n                    if key in c:\r\n                        del c&#x5B;key]\r\n        for word, count in c.most_common(40):\r\n                cursor.execute(&quot;INSERT INTO words_count (word, occurrence_count)\\\r\n                              SELECT (?), (?)&quot;, (re.sub('&#x5B;\u2013{@#!;+=_,$&lt;(^)&gt;?.:%\/&amp;}''&quot;''-]', '', word.lower()), count))\r\n        #delete numerical characters, NULLs and empty spaces\r\n        cursor.execute(&quot;DELETE FROM words_count WHERE word glob '&#x5B;0-9]*' or word ='' or word IS NULL&quot;)\r\n        #delete duplicate records where the same word is repeated more then once\r\n        cursor.execute(&quot;DELETE  FROM words_count WHERE id NOT IN(\\\r\n                        SELECT  MIN(id) FROM  words_count GROUP BY word)&quot;)\r\n        #delete records outside top 30\r\n        cursor.execute(&quot;DELETE FROM words_count WHERE occurrence_count NOT IN(\\\r\n                        SELECT occurrence_count FROM words_count ORDER BY 1 DESC LIMIT 30)&quot;)\r\n        if url_save == 'y':\r\n            soup = BeautifulSoup(page, &quot;html5lib&quot;, from_encoding=&quot;UTF-8&quot;)\r\n            content = soup.find(id=&quot;mw-content-text&quot;)\r\n            if hasattr(content, 'find_all'):\r\n                for notWanted in undesirables:\r\n                    removal = content.find_all(notWanted&#x5B;'element'], notWanted&#x5B;'attr'])\r\n                    if len(removal) &gt; 0:\r\n                        for el in removal:\r\n                            el.extract()\r\n                also = content.find(id=&quot;See_also&quot;)\r\n                if (also != None):\r\n                    also.extract()\r\n                    tail = also.find_all_next()\r\n                    if (len(tail) &gt; 0):\r\n                        for element in tail:\r\n                            element.extract()\r\n                fileName = str(name)\r\n                doctype = &quot;&lt;!DOCTYPE html&gt;&quot;\r\n                head = &quot;&lt;head&gt;&lt;meta charset=\\&quot;UTF-8\\&quot; \/&gt;&lt;title&gt;&quot; + fileName + &quot;&lt;\/title&gt;&lt;\/head&gt;&quot;\r\n                f = open(file_dir + &quot;\/&quot; + fileName.replace('\/', '_') + &quot;.html&quot;, 'w', encoding='utf-8')\r\n                f.write(\r\n                    doctype + &quot;&lt;html lang=\\&quot;en\\&quot;&gt;&quot; + head + &quot;&lt;body&gt;&lt;h1&gt;&quot; + fileName + &quot;&lt;\/h1&gt;&quot; + str(content) + &quot;&lt;\/body&gt;&lt;\/html&gt;&quot;)\r\n                f.close()\r\n                print ('Scraped HTML file and database file have been saved in &quot;{0}\\\\&quot; directory '\r\n                       'with a bar chart displayed in a separate window'.format(file_dir))\r\n    except:\r\n        print(&quot;Unexpected error:&quot;, sys.exc_info()&#x5B;0])\r\n        conn.rollback()\r\n    finally:\r\n        conn.commit()\r\n        conn.close()\r\n<\/pre>\n<p style=\"text-align: justify;\">Finally, wordsOutput() and plotWords() functions create the graphic representation of each word occurrence frequency as a bar chart. This is followed by the main() function call which executes the whole script<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n#fetch database data\r\ndef wordsOutput():\r\n    try:\r\n        arr = &#x5B;]\r\n        conn = sqlite3.connect(db_file)\r\n        cursor = conn.cursor()\r\n        cursor.execute('SELECT word, occurrence_count FROM words_count ORDER BY occurrence_count DESC')\r\n        #column_names = &#x5B;d&#x5B;0] for d in cursor.description] # extract column names\r\n        for row in cursor:\r\n            arr.append(row)\r\n        return arr\r\n    except:\r\n        print(&quot;Unexpected error:&quot;, sys.exc_info()&#x5B;0])\r\n    finally:\r\n        conn.close()\r\n\r\n#plot data onto the bar chart\r\ndef plotWords(url):\r\n    data = wordsOutput()\r\n    N = len(data)\r\n    x = np.arange(1, N+1)\r\n    y = &#x5B;num for (s, num) in data]\r\n    labels = &#x5B;s for (s, num) in data]\r\n    width = 0.7\r\n    plt.bar(x, y, width, color=&quot;r&quot;)\r\n    plt.ylabel('Frequency')\r\n    plt.xlabel('Words')\r\n    plt.title('Word Occurrence Frequency For '&quot;'{}'&quot;' Wiki Page'.format(url.split(&quot;\/wiki\/&quot;)&#x5B;1].strip().replace(&quot;_&quot;, &quot; &quot;)))\r\n    plt.xticks(x + width\/2.0, labels)\r\n    plt.xticks(rotation=45)\r\n    plt.show()\r\n\r\n#run from here!\r\nif __name__ == '__main__':\r\n    main()\r\n<\/pre>\n<p style=\"text-align: justify;\">Below is a short footage depicting script execution in PyScripter (hence prompt boxes rather than command line input) and final graph output. You can fetch the complete script from my Skydrive folder <a href=\"https:\/\/skydrive.live.com\/redir?resid=715AEF07A82832E1!57745&amp;authkey=!ACruXruT60dpCvk&amp;ithint=folder%2c.py\" target=\"_blank\"><b>HERE<\/b><\/a> and customize it enable web page scraping for websites other than Wikipedia with little changes required.<\/p>\n<p><iframe loading=\"lazy\" src=\"http:\/\/www.youtube.com\/embed\/9NV5DiQz7IY\" frameborder=\"0\" width=\"580\" height=\"330\"><\/iframe><\/p>\n<p style=\"text-align: justify;\">So there you go, a fairly easy way of scraping Wikipedia pages to compute words frequency and displaying the results on a bar chart using Python in conjunction with a few third-party modules. Enjoy playing around with web scraping and words counting and if you found this post useful (or otherwise), please leave me a comment.<\/p>\n<p style=\"text-align: justify;\">\n","protected":false},"excerpt":{"rendered":"<p>Words counting applications, even though rather ordinary and conventional on the face value, are a great exercise in programming language exploration and learning process, typically encompassing many core aspects of any language development such as conditionals, loops, data structures, data types, database and\/or files support if you choose to store the result outside the memory [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[32,56,53],"tags":[24,41,15],"class_list":["post-2133","post","type-post","status-publish","format-standard","hentry","category-how-tos","category-programming","category-visualisation","tag-programming","tag-python","tag-visualisation"],"aioseo_notices":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/2133","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/comments?post=2133"}],"version-history":[{"count":17,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/2133\/revisions"}],"predecessor-version":[{"id":2228,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/2133\/revisions\/2228"}],"wp:attachment":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/media?parent=2133"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/categories?post=2133"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/tags?post=2133"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}