bicortex

Web Scraping With Python – Wikipedia Words Frequency Analysis Using Matplotlib

February 6th, 2014 / 1 Comment » / by admin

Words counting applications, even though rather ordinary and conventional on the face value, are a great exercise in programming language exploration and learning process, typically encompassing many core aspects of any language development such as conditionals, loops, data structures, data types, database and/or files support if you choose to store the result outside the memory etc. not to mention that there is something inherently childish and playful about counting words or letters and seeing the result automagically computed for your pleasure. Such small application can be as trivial as counting letters in a small sentence or paragraph down to a more complex scenario where, for example, word counting is used by translators or writers to estimate effort and billing requirements thus complemented with features such as being able to import different document types, count Asian characters etc. My personal requirements for the small script this post is based on was to be able to count words on any Wikipedia page and display it graphically. Additionally, I wanted to have the option of removing certain characters e.g. the most commonly used ones in English language, remove certain sections out of the Wikipedia page parsed and store the page as a HTML document on my hard drive.

The app’s syntax is rather procedural and should be easy to follow, especially that it runs in a console mode and uses Python as a development language (Python 3.3 specifically). I may be able to wrap it around in a GUI interface later on, most likely using Tkinter but for now, it serves its purpose quite well. The web scraping part uses the fantastic BeautifulSoup module whereas graphing is done utilising equally brilliant Matplotlib library with a little bit of help from Numpy. Let’s look at the individual sections of the code and the functionality it represents.

The first thing to do is to import all relevant libraries and define certain variables used later on in the script execution. I used sqllite3 module to store results in a database table as sqlite is already included by default in any Python distribution (using a file would probably work just as well). I also decided to call collections module which supports rapid tallying. For web scraping BeautifulSoup is used whereas Urllib provides a high-level interface for fetching data from the nominated Wiki page. Finally, Numpy and Matplotlib are used for values arrangement and graphical output.

As far as the two global variables declared at the start i.e. ‘undesirables’ and ‘common_words’, these contain all the extra unwanted Wikipedia bits and bobs to be removed from HTML parsed and most commonly used words in English language (also to be removed) respectively. I could potentially scrape the unwanted words from another Wikipedia site e.g. from HERE but having all those encapsulated in an easy to modify list is probably a more flexible approach.

Final declaration before the main method execution defines two SQL statements for table ‘words_count’ creation and dropping. Those two statements are used at the later stage when doDbWork() function is executed.

#import relevant modules
import sys, os, sqlite3, re, urllib.request
from collections import Counter
import numpy as np
from bs4 import BeautifulSoup
from matplotlib import pyplot as plt

#define global variables
global undesirables
undesirables = [{"element": "table", "attr": {'class': 'infobox'}},
                {"element": "table", "attr": {'class': 'vertical-navbox'}},
                {"element": "span", "attr": {'class': 'mw-editsection'}},
                {"element": "div", "attr": {'class': 'thumb'}},
                {"element": "sup", "attr": {'class': 'reference'}},
                {"element": "div", "attr": {'class': 'reflist'}},
                {"element": "table", "attr": {'class': 'nowraplinks'}},
                {"element": "table", "attr": {'class': 'ambox-Refimprove'}},
                {"element": "img", "attr": None}, {"element": "script", "attr": None},
                {"element": "table", "attr": {'class': 'mbox-small'}},
                {"element": "span", "attr": {"id": "coordinates"}},
                {"element": "table", "attr": {"class": "ambox-Orphan"}},
                {"element": "div", "attr": {"class": "mainarticle"}},
                {"element": None, "attr": {"id": "References"}}]

global common_words
common_words = ['a','able','about','across','after','all','almost','also',
                'am','among','an','and','any','are','as','at','be','because',
                'been','but','by','can','cannot','could','dear','did','do',
                'does','either','else','ever','every','for','from','get','got',
                'had','has','have','he','her','hers','him','his','how','however',
                'i','if','in','into','is','it','its','just','least','let','like',
                'likely','may','me','might','most','must','my','neither','no','nor',
                'not','of','off','often','on','only','or','other','our','out','own',
                'rather','said','say','says','she','should','since','so','some',
                'such','than','that','the','their','them','then','there','these',
                'they','this','to','too','us','wants','was','we','were','what',
                'when','where','which','while','who','whom','why','will','with',
                'would','yet','you','your']

#define database table and database dropping query
create_schema = "CREATE TABLE words_count \
                (id integer primary key autoincrement not null,word text,occurrence_count int)"
drop_schema = "DROP TABLE words_count"

Next up is the main() function which simply dictates application execution flow while gathering input from end user and storing it in variables e.g. Wikipedia URL address used for scraping.

#determine execution flow
def main():
    url = str(input('Please enter Wiki web address you would like to scrape below (starting with http://)...\n-->'))
    isValidLink(url)
    checkConnectivity(url)
    global file_dir
    file_dir = str(input('Please enter directory path below where '
                         'database and html files will be stored e.g. '
                         'C:\\YourDirectory (if it does not exists it will be created)...\n-->'))
    createDir(file_dir)
    global db_file
    db_file = file_dir + '\\temp_db.db' #database file location
    doDbWork(db_file)
    remove_commons = str(input('Would you like to remove most commonly '
                               'used English words from the result set? (Y/N)...\n-->'))
    while remove_commons.lower() not in ('Y','y','N','n'):
        remove_commons = str(input('Please select either Y (yes) or N (no) as an option for this input.\n-->'))
    url_save = str(input('Would you like to save scraped HTML file for '
                         'reference in the nominated directory? (Y/N)...\n-->'))
    while url_save.lower() not in ('Y','y','N','n'):
        url_save = str(input('Please select either Y (yes) or N (no) as an option for this input.\n-->'))
    print ('Attempting to scrape {}...'.format(url))
    grabPage(url, url.split("/wiki/")[1].strip().replace("_", " "),db_file, url_save.lower(), remove_commons.lower())
    plotWords(url)

Following on, we have a bunch of functions responsible for different aspects of the script execution starting with isValidLink() and checkConnectivity() functions. isValidLink() simply checks for the string being passed as the URL variable to ensure that only appropriate Wikipedia page is being used as an input. If incorrect string format is used, the code terminates. checkConnectivity() function, on the other hand, ensures that the page can be accessed, potentially highlighting problems such as internet connectivity or firewall issues.

#check if the URL link submitted is a valid one
def isValidLink(url):
    if "/wiki/" in url and ":" in url and "http://"  in url and "wikibooks" not in url \
        and "#" not in url and "wikiquote" not in url and "wiktionary" not in url and "wikiversity" not in url \
        and "wikivoyage" not in url and "wikisource" not in url and "wikinews" not in url and "wikiversity" not in url \
        and "wikidata" not in url:
        print('Wiki link is a valid string...continuing...')
    else:
        print('This is not a valid Wiki URL address. Press any key to exit.')
        input()
        sys.exit(0)

#check if the website is responding to a 'http' call
def checkConnectivity(url):
    try:
        print('Connecting...')
        urllib.request.urlopen(url, timeout=5)
        print("Connection to '{}' succeeded".format(url))
    except:
        urllib.request.URLError
        print("Connection to '{}' DID NOT succeeded. You may want to check the following to resolve this issue:".format(url))
        print(  '1. Internet connection is enabled\n'
                '2. You entered a valid address\n'
                '3. The website is operational\n'
                '...exiting now.')
        input()
        sys.exit(0)

Next two functions correspond to the user being prompted for a directory path selection where the database file will be created and HTML file stored after code successful execution (optionally). This functionality is provided by createDir() function which is closely followed by doDbWork() function which simply creates a sqlite database file in the nominated directory.

#create database and text file directory
def createDir(file_dir):
    if not os.path.exists(file_dir):
        try:
            print('Attempting to create directory in the path specified...')
            os.makedirs(file_dir)
            print('Directory created successfully...')
        except:
            IOError
            print("Directory COULD NOT be created in the location specified.")
            sys.exit(0)
    else:
        print('Directory specified already exists....moving on...')


#create database file and schema using the scripts above
def doDbWork(db_file):
    try:
        db_is_new = not os.path.exists(db_file)
        with sqlite3.connect(db_file) as conn:
            if db_is_new:
                print("Creating temp database schema on " + db_file + " database ...")
                conn.execute(create_schema)
            else:
                print("Database schema may already exist. Dropping database schema on " + db_file + "...")
                #os.remove(db_filename)
                conn.execute(drop_schema)
                print("Creating temporary database schema...")
                conn.execute(create_schema)
    except:
        print("Unexpected error:", sys.exc_info()[0])
    finally:
        conn.commit()
        conn.close()

grapPage() function is where most of heavy lifting is done. First, the URL is passed, web page opened and scraped with all unwanted elements represented by ‘undesirables’ variable removed. Wikipedia has a lot of nodes that I don’t want to parse so iterating through the list for each div/section/node I can ‘trim the fat’ and get rid of unnecessary sections I’m not interested in. Next, end user is required to confirm if he/she wishes to remove the most commonly used English words after which a connection to the database is made and individual words with their corresponding counts are inserted into the table. The default count is restricted to 40 which then gets whittled down to top 30 via the SQL DELETE statement (any more than that and the graph looks a bit congested). At this stage, regex functionality also removes certain characters from the dataset to disallow counting of full stops, commas, question marks etc. and 3 SQL DELETE statements are executed to perform final ‘clean up’ e.g. remove numerical characters, NULLs, duplicates etc. The user also has the option to save the URL as a file in the directory nominated for further reference.

# process URL page, exporting the HTML file into a directory nominated (optional)
# and inserting most commonly used words into a database file
def grabPage(url, name, db_file, url_save, remove_commons):
    try:
        opener = urllib.request.urlopen(url)
        page = opener.read()
        s = BeautifulSoup(page)
        s = s.find(id="mw-content-text")
        if hasattr(s, 'find_all'):
                for notWanted in undesirables:
                    removal = s.find_all(notWanted['element'], notWanted['attr'])
                    if len(removal) > 0:
                        for el in removal:
                            el.extract()
                also = s.find(id="See_also")
                if (also != None):
                    also.extract()
                    tail = also.find_all_next()
                    if (len(tail) > 0):
                        for element in tail:
                            element.extract()
        text = s.get_text(" ", strip=True)
        opener.close()
        conn = sqlite3.connect(db_file)
        cursor = conn.cursor()
        words = [word.lower() for word in text.split()]
        c = Counter(words)
        if remove_commons == 'y':
                for key in common_words:
                    if key in c:
                        del c[key]
        for word, count in c.most_common(40):
                cursor.execute("INSERT INTO words_count (word, occurrence_count)\
                              SELECT (?), (?)", (re.sub('[–{@#!;+=_,$<(^)>?.:%/&}''"''-]', '', word.lower()), count))
        #delete numerical characters, NULLs and empty spaces
        cursor.execute("DELETE FROM words_count WHERE word glob '[0-9]*' or word ='' or word IS NULL")
        #delete duplicate records where the same word is repeated more then once
        cursor.execute("DELETE  FROM words_count WHERE id NOT IN(\
                        SELECT  MIN(id) FROM  words_count GROUP BY word)")
        #delete records outside top 30
        cursor.execute("DELETE FROM words_count WHERE occurrence_count NOT IN(\
                        SELECT occurrence_count FROM words_count ORDER BY 1 DESC LIMIT 30)")
        if url_save == 'y':
            soup = BeautifulSoup(page, "html5lib", from_encoding="UTF-8")
            content = soup.find(id="mw-content-text")
            if hasattr(content, 'find_all'):
                for notWanted in undesirables:
                    removal = content.find_all(notWanted['element'], notWanted['attr'])
                    if len(removal) > 0:
                        for el in removal:
                            el.extract()
                also = content.find(id="See_also")
                if (also != None):
                    also.extract()
                    tail = also.find_all_next()
                    if (len(tail) > 0):
                        for element in tail:
                            element.extract()
                fileName = str(name)
                doctype = "<!DOCTYPE html>"
                head = "<head><meta charset=\"UTF-8\" /><title>" + fileName + "</title></head>"
                f = open(file_dir + "/" + fileName.replace('/', '_') + ".html", 'w', encoding='utf-8')
                f.write(
                    doctype + "<html lang=\"en\">" + head + "<body><h1>" + fileName + "</h1>" + str(content) + "</body></html>")
                f.close()
                print ('Scraped HTML file and database file have been saved in "{0}\\" directory '
                       'with a bar chart displayed in a separate window'.format(file_dir))
    except:
        print("Unexpected error:", sys.exc_info()[0])
        conn.rollback()
    finally:
        conn.commit()
        conn.close()

Finally, wordsOutput() and plotWords() functions create the graphic representation of each word occurrence frequency as a bar chart. This is followed by the main() function call which executes the whole script

#fetch database data
def wordsOutput():
    try:
        arr = []
        conn = sqlite3.connect(db_file)
        cursor = conn.cursor()
        cursor.execute('SELECT word, occurrence_count FROM words_count ORDER BY occurrence_count DESC')
        #column_names = [d[0] for d in cursor.description] # extract column names
        for row in cursor:
            arr.append(row)
        return arr
    except:
        print("Unexpected error:", sys.exc_info()[0])
    finally:
        conn.close()

#plot data onto the bar chart
def plotWords(url):
    data = wordsOutput()
    N = len(data)
    x = np.arange(1, N+1)
    y = [num for (s, num) in data]
    labels = [s for (s, num) in data]
    width = 0.7
    plt.bar(x, y, width, color="r")
    plt.ylabel('Frequency')
    plt.xlabel('Words')
    plt.title('Word Occurrence Frequency For '"'{}'"' Wiki Page'.format(url.split("/wiki/")[1].strip().replace("_", " ")))
    plt.xticks(x + width/2.0, labels)
    plt.xticks(rotation=45)
    plt.show()

#run from here!
if __name__ == '__main__':
    main()

Below is a short footage depicting script execution in PyScripter (hence prompt boxes rather than command line input) and final graph output. You can fetch the complete script from my Skydrive folder HERE and customize it enable web page scraping for websites other than Wikipedia with little changes required.

So there you go, a fairly easy way of scraping Wikipedia pages to compute words frequency and displaying the results on a bar chart using Python in conjunction with a few third-party modules. Enjoy playing around with web scraping and words counting and if you found this post useful (or otherwise), please leave me a comment.

Posted in: How To's, Programming, Visualisation

Tags: Programming, Python, Visualisation

Introduction to MongoDB NoSQL database for SQL Developers – Part 4

January 30th, 2014 / No Comments » / by admin

In the last three posts in this series (post 1 HERE, post 2 HERE and post 3 HERE) I outlined the core aspects of provisioning MongoDB instance as well as a few fundamental concepts relating to CRUD database operations using Mongo’s syntax. In this post I would like to further elaborate on this information which should hopefully bring those details together in a concise and simple tutorial.

I already assume you have MongoDB files downloaded and extracted on your machine (if not, please see post 1 HERE) so let’s start the server, Mongo’s console and create a sample collection with a few documents in it. Below you can see my command line in Microsoft Windows with all the start-up entries included. The steps outlined here are responsible for the following:

Session 1 – change working directory where mongod demon is installed and start mongoDB server.
Session 2 – change working directory to where MongoDB is extracted to, start Mongo’s Shell and connect to database ‘foo’ (notice that the database does not need to be created prior to issuing ‘use foo’ to connect to it).

Next, we will insert 10 sample records (documents) into ‘Customers’ collection on ‘foo’ database. If you remember from reading post 2 and post 3, MongoDB uses JSON format to structure its data and an ‘insert’ function to add data to an existing or new collection. Let’s continue with Session 2 and run the following to populate the new ‘Customers’ collection:

db.purchase_hist.insert([{
  "_id" : ObjectId("52ce3648a4a0bb15a8e07b75"),
  "CustomerID" : 23424,
  "First_Name" : "Mary",
  "Middle_Name" : "Joeanne",
  "Last_Name" : "Black",
  "Email_Address" : "mary0120@yahoo.com.au",
  "Phone_Number" : "03-8573-9455",
  "Purchased_Units" : 1,
  "Purchased_Value" : 29.99,
  "Item_SKU" : "RMK973",
  "Rec_Created_Date" : ISODate("2011-12-31T13:00:00Z")
},
{
  "_id" : ObjectId("52ce3648a4a0bb15a8e07b76"),
  "CustomerID" : 84933,
  "First_Name" : "John",
  "Middle_Name" : "Lorance",
  "Last_Name" : "Moore",
  "Email_Address" : "johnnym@awol.com",
  "Phone_Number" : "03-3423-1155",
  "Purchased_Units" : 1,
  "Purchased_Value" : 49.0,
  "Item_SKU" : "MKK833",
  "Rec_Created_Date" : ISODate("2011-12-31T13:00:00Z")
},
{
  "_id" : ObjectId("52ce3648a4a0bb15a8e07b77"),
  "CustomerID" : 19583,
  "First_Name" : "Martin",
  "Middle_Name" : null,
  "Last_Name" : "Laser",
  "Email_Address" : "mlaser91@aol.com",
  "Phone_Number" : "03-2355-1109",
  "Purchased_Units" : 5,
  "Purchased_Value" : 2099.49,
  "Item_SKU" : "HHY009",
  "Rec_Created_Date" : ISODate("2012-03-31T13:00:00Z")
},
{
  "_id" : ObjectId("52ce3648a4a0bb15a8e07b78"),
  "CustomerID" : 34311,
  "First_Name" : "Spencer",
  "Middle_Name" : "Chris",
  "Last_Name" : "McEvans",
  "Email_Address" : "spencerdude@hotmail.com",
  "Phone_Number" : "03-1122-0007",
  "Purchased_Units" : 32,
  "Purchased_Value" : 0.99,
  "Item_SKU" : "YNK227",
  "Rec_Created_Date" : ISODate("2012-03-31T13:00:00Z")
},
{
  "_id" : ObjectId("52ce3648a4a0bb15a8e07b79"),
  "CustomerID" : 74998,
  "First_Name" : "Mark",
  "Middle_Name" : null,
  "Last_Name" : "King",
  "Email_Address" : "mk038722@gmail.com",
  "Phone_Number" : "03-3423-1155",
  "Purchased_Units" : 1,
  "Purchased_Value" : 88.0,
  "Item_SKU" : "JKL777",
  "Rec_Created_Date" : ISODate("2011-12-31T13:00:00Z")
},
{
  "_id" : ObjectId("52ce3648a4a0bb15a8e07b7a"),
  "CustomerID" : 30002,
  "First_Name" : "Mary",
  "Middle_Name" : "Susan",
  "Last_Name" : "Grey",
  "Email_Address" : "mmgrey@gmail.com",
  "Phone_Number" : "03-1299-3859",
  "Purchased_Units" : 2,
  "Purchased_Value" : 22.99,
  "Item_SKU" : "AAN199",
  "Rec_Created_Date" : ISODate("2011-12-31T13:00:00Z")
},
{
  "_id" : ObjectId("52ce3648a4a0bb15a8e07b7b"),
  "CustomerID" : 22374,
  "First_Name" : "Luis",
  "Middle_Name" : "Blake",
  "Last_Name" : "Shimaro",
  "Email_Address" : "shimarolou@yahoo.com.au",
  "Phone_Number" : "03-0385-3999",
  "Purchased_Units" : 77,
  "Purchased_Value" : 1229.49,
  "Item_SKU" : "NMN800",
  "Rec_Created_Date" : ISODate("2012-11-10T13:00:00Z")
},
{
  "_id" : ObjectId("52ce3648a4a0bb15a8e07b7c"),
  "CustomerID" : 99931,
  "First_Name" : "Natalie",
  "Middle_Name" : "G",
  "Last_Name" : "Chin",
  "Email_Address" : "nataliechin@mediasmarts.com.au",
  "Phone_Number" : "03-3759-1001",
  "Purchased_Units" : 1,
  "Purchased_Value" : 123.0,
  "Item_SKU" : "HZX201",
  "Rec_Created_Date" : ISODate("2012-07-22T14:00:00Z")
},
{
  "_id" : ObjectId("52ce3648a4a0bb15a8e07b7d"),
  "CustomerID" : 22374,
  "First_Name" : "Marian",
  "Middle_Name" : null,
  "Last_Name" : "McErin",
  "Email_Address" : "marianmcerin@gmail.com",
  "Phone_Number" : "03-3400-3331",
  "Purchased_Units" : 10,
  "Purchased_Value" : 999.99,
  "Item_SKU" : "YQP232",
  "Rec_Created_Date" : ISODate("2012-12-18T13:00:00Z")
},
{
  "_id" : ObjectId("52ce3648a4a0bb15a8e07b7e"),
  "CustomerID" : 23424,
  "First_Name" : "Rick",
  "Middle_Name" : "Tony",
  "Last_Name" : "Webster",
  "Email_Address" : "rikky69@gmail.com",
  "Phone_Number" : "03-9459-1112",
  "Purchased_Units" : 3,
  "Purchased_Value" : 239.99,
  "Item_SKU" : "AAL155",
  "Rec_Created_Date" : ISODate("2012-01-22T13:00:00Z")
}])

Once the insert has been committed, let’s display the data to confirm the collection and its documents have indeed been created using db.collection.find() function as per below:

Now that we have out sample database, collection and a number of documents created, we can run a few queries against those to output the data we’re interested. For example:

/*=================================================================================
To select all customers where their middle name is missing in a sorted order:
=================================================================================*/
db.purchase_hist.find({"Middle_Name": null}).sort({"CustomerID": 1});

--Which is equivalent to the following SQL in RDBMS:
SELECT * FROM purchase_hist WHERE Middle_Name IS NULL
ORDER BY CustomerID;

/*=================================================================================
To sum up purchased value and purchased units (where there was more than 1),
grouping by when the record was first created:
=================================================================================*/
db.purchase_hist.group({
    "key": {
        "Rec_Created_Date": true
    },
    "initial": {
        "Units": 0,
        "Value": 0
    },
    "reduce": function(obj, prev) {
        prev.Units = prev.Units + obj.Purchased_Units - 0;
        prev.Value = prev.Value + obj.Purchased_Value - 0;
    },
    "cond": {
        "Purchased_Units": {
            "$gt": 1
        }
    }
});

--Which is equivalent to the following SQL in RDBMS:
SELECT
SUM(Purchased_Units) as Units, SUM(Purchased_Value) as Value, Rec_Created_Date
FROM purchase_hist
WHERE Purchased_Units > 1
GROUP BY Rec_Created_Date;

/*=================================================================================
To select average purchase value grouped by customer ID where their last name
is either 'Webster' or 'Grey' and averaged purchased value is greater than 0:
=================================================================================*/
db.purchase_hist.group({
    "key": {
        "CustomerID": true
    },
    "initial": {
        "sumforaverageAvg_Purchased_Value": 0,
        "countforaverageAvg_Purchased_Value": 0
    },
    "reduce": function(obj, prev) {
        prev.sumforaverageAvg_Purchased_Value += obj.Purchased_Value;
        prev.countforaverageAvg_Purchased_Value++;
    },
    "finalize": function(prev) {
        prev.Avg_Purchased_Value = prev.sumforaverageAvg_Purchased_Value / prev.countforaverageAvg_Purchased_Value;
        delete prev.sumforaverageAvg_Purchased_Value;
        delete prev.countforaverageAvg_Purchased_Value;
    },
    "cond": {
        "Last_Name": {
            "$in": ["Webster", "Grey"]
        }
    }
});

--Which is equivalent to the following SQL in RDBMS:
SELECT AVG(Purchased_Value) as Avg_Purchased_Value, CustomerID
  FROM purchase_hist
  WHERE Last_Name IN ('Webster', 'Grey')
  GROUP BY CustomerID
  HAVING AVG(Purchased_Value) > 0;

If you’re working with such RDBMS systems as Microsoft SQL Server, MySQL or PostgeSQL and need a tool to migrate your relational data into MongoDB, write some queries in an interactive manner or simply to relieve yourself from the ‘burden’ of working with Mongo’s Shell all the time, there are a few tools that can help with that. I personally tried a few different applications but eventually settled for two I consider the best MongoVue and RoboMongo. Both are fairly simplistic in their design but making a transition from Mongo’s Shell to either one of those will be like playing Pong on a monochrome Atari from 1972 and the latest version of Angry Birds on the current smartphone. Having said that, if you came from SQL Server management Studio or Oracle’s SQL Developer background, chances are you will be underwhelmed. Each tool has its strengths and advantages and they somewhat complement each other. I, for example, much prefer RoboMongo’s Shell implementation which also includes syntax highlighting, auto-completion, different view modes (text, tree, custom) etc. Below is a sample view of RoboMongo’s interface.

On the other hand, MongoVue (paid version) has a relational database data import tool (providing you’re using MS SQL Server, MySQL or PostreSQL) which in case of MS SQL Server worked really well. It also provides some basic database monitoring/profiling capabilities, GridFS explorer and Map Reduce feature. MongoVue interface look as per below.

Final Thoughts

This is the last post in this four part series on MongoDB for SQL developers – the other three can be found HERE, HERE and HERE. So far I only scratched the surface of MongoDB’s features and functionality not to mention the whole NoSQL paradigm. Concepts such as indexing, MapReduce, replication, sharding, monitoring, backups or application design are unfortunately out of scope for this series but there is plenty of great resources (including MongoDB’s own documentation) that can be found – as always Google is your friend. Also, Amazon has got a nice collection of books which cover both admin and development side of using MongoDB so with a little bit of reading/hands-on training it’s not as overwhelming as it may initially seem.

RDBMS systems have long been the staple, go-to applications for data storage layer for a long time now, however, with the recent rise of data volume, velocity and variety a new crop of technologies was required to adapt to these changes. NoSQL databases, in specific contexts such as big data and real-time web applications, provide better simplicity of design, horizontal scaling and finer control over availability when compared to relational databases. At present (Jan 2014), MongoDB specifically is the most popular NoSQL database on the market and the fact that it has been adopted as backend software by a number of major websites and services, including Craigslist, eBay, Foursquare, SourceForge, and The New York Times, among others is a testament to its credibility.

Working with MongoDB is not without challenges though and, depending how you look at it, it has a number of faults. For starters, Mongo requires a lot of memory, preferring to put as much data as possible into working memory for fast access. Furthermore, Mongo doesn’t support the notion of ACID transactions, which is a touchstone of the RDBMS world. Data isn’t immediately written to disk upon an insert (although you can optionally require this via a flag) – a background process eventually writes unsaved data to disk. This makes writes extremely fast, but corresponding reads can occasionally be inconsistent. Coming from a relational database background, it is also hard to wrap your head around the fact that queries don’t support joins between collections, not to mention sparse security features and the lack of Business Intelligence tools integration. You can read some interesting insights into why MongoDB is not a great fit for all data storage needs HERE or HERE. There is also an interesting video on migrating Diaspora database from MongoDB to MySQL due issues the developers encountered when using is for social network data HERE. Finally, Andrew C. Oliver from the Open Software Integrators has written a short post on the challenges of porting RDBMS schema to Mongo HERE.

With all those points considered, the relational database is still the staple data store for the vast majority of applications built today. But for some applications, the flexibility offered by Mongo provides advantages with respect to development speed and overall application performance. Additionally, Mongo’s relative ease-of-use and low price tag make it an attractive option for companies whose project objective can fit into the NoSQL space and MongoDB’s document storage architecture.

Posted in: NoSQL

Tags: MongoDB, NoSQL, SQL

What you are looking at...

My name is Martin and this site is a random collection of recipes and reflections about various topics covering information management, data engineering, machine learning, business intelligence and visualisation plus everything else that I fancy to categorise under the 'analytics' umbrella. I'm a native of Poland but since my university days I have lived in Melbourne, Australia and worked as a DBA, developer, data architect, technical lead and team manager. My main interests lie in both, helping clients in technical aspects of information management e.g. data modelling, systems architecture, cloud deployments as well as business-oriented strategies e.g. enterprise data solutions project management, data governance and stewardship, data security and privacy or data monetisation. On the whole, I am very fond of anything closely or remotely related to data and as long as it can be represented as a string of ones and zeros and then analysed and visualised, you've got my attention!

Outside sporadic updates to this site I typically find myself fiddling with data, spending time with my kids or a good book, the gym or watching a good movie while eating Polish sausage with Zubrowka (best served on rocks with apple juice and a lime twist). Please read on and if you find these posts of any interests, don't hesitate to leave me a comment!

Subscribe via RSS | Comments (RSS)

Advertising

Recent Comments
1. terry w on Using Polybase and In-Database Python Runtime for Building SQL Server to Azure Data Lake Data Extraction PipelinesMay 12, 2025
  Never thought Polybase could be used in this capacity and don't think Microsoft advertised this feature (off-loading DB data as…
2. admin on SQL Server to Snowflake Solution Architecture and Implementation – How to Extract, Ingest and Model Wide World Importers Database using Snowflake PlatformMay 5, 2025
  Hi Karissa It's definitely doable and I don't foresee too many changes from what what was outlined in the post.…
3. Karissa Lambert on SQL Server to Snowflake Solution Architecture and Implementation – How to Extract, Ingest and Model Wide World Importers Database using Snowflake PlatformMay 5, 2025
  Thanks for the details write-up Martin. I have one question though, can something similar be developed on top of Azure…
4. Leonard Chu on AWS S3 data ingestion and augmentation patterns using DuckDB and PythonApril 30, 2025
  Very informative, love these little nuggets of knowledge from the DuckDB community and anyone involved, thanks Martin!
5. admin on AWS S3 data ingestion and augmentation patterns using DuckDB and PythonJanuary 23, 2025
  Not sure about Polars but I did take ClickHouse (chDB) and DuckDB for a quick test drive - you can…

bicortex

Web Scraping With Python – Wikipedia Words Frequency Analysis Using Matplotlib

Introduction to MongoDB NoSQL database for SQL Developers – Part 4

Final Thoughts

What you are looking at...

Subscribe

Advertising

Tags

Recent Comments