Introduction to MongoDB NoSQL database for SQL Developers – Part 1

January 30th, 2014 / No Comments » / by admin

Note: This series is the first part of a collection of post depicting MongoDB NoSQL database key features and functionality. Further posts can be found HERE (part 2), HERE (part 3) and HERE (part 4).

Introduction

Coming from mainly relational database background, NoSQL movement seemed like a bit of a fad when it first started making rounds in the early 2000 and my immediate though was that perhaps with the exception of a few hippy Silicon Valley startups, it will end up being just a flash in the pan. It is 2014 and it turns out that not only has it proven to be a viable data storage solution for many enterprise-level and beyond (don’t know how one would refer to the likes of Google or Amazon) companies but it has been slowly making inroads into more conservative environments, where long-term reliance on RDBMSs from vendors such as Oracle, IBM or Microsoft seemed undisturbed.

Working in IT, it has been both, a curse and an asset to see changes proliferation throughout the industry, with new programing languages and frameworks popping up like wild mushrooms. However, for a typical database professional, since the inception of Codd’s famous relational model, things have been pretty steady for quite some time. SQL language has been the golden standard for all major database vendors and providing one could master basic methodologies and principles of database development and management, working as a DBA or database developer has been a steady and nonfluctuating field to be in. That’s until recently. Not long ago I wrote about the new data (warehouse) management frontier emerging in a form of a cloud deployment (link HERE). In this post I would like to explore one variant of the NoSQL databases plethora – document-oriented MongoDB.

MongoDB Database Characteristics

A quick Wikipedia lookup under NoSQL outlines a rich ecosystem of NoSQL database with some major categories such as column, key-value, document or graph database. MongoDB falls under document-oriented database, characterised by high performance, scalability and flexibility and low complexity and variable functionality. MongoDB is a free and open-source software and has been adopted as backend database by a number of major websites and services, including Craigslist, eBay, Foursquare, SourceForge, and The New York Times, among others. Main features include indexing support, MapReduce support for batch processing of data and aggregation operations, file storage as well as built-in replication and load balancing. Programmatic access is available in a number of languages such as C, C++, C#, Erlang, Java, Python, Ruby, and more.

MongoDB makes it (arguably) easy for someone accustomed to thinking in terms of tables, columns and rows as the mapping between those is quite straightforward. A ‘table’, represented in relational terms, equals to a ‘collection’ in MongoDB whereas a ‘row’ is represented by a ‘document’; everything else i.e. database, indexes, primary keys etc. definitions tend to overlap with few distinct differences. Another distinct feature is the fact that MongoDB is a schema-less database – a document’s key and values are not fixed types or sizes, therefore adding and removing fields (columns) becomes easier.

Finally, queries executed on MongoDB follow different syntax rules then those created using declarative SQL. MongoDB, by default, uses JSON-style notation which can be quite confusing at first attempt, regardless of your SQL expertise level. There have been a number of projects to simplify or standardise querying NoSQL data stores and few services exist where you can parse SQL query to have it converted into MongoDB syntax e.g. HERE, however, for me personally, it was easier to pick up MDX or DMX (languages used for OLAP and data mining queries) which by virtue of their syntax were quite a departure from my comfort zone i.e. ‘classic’ SQL, then to run with MongoDB native query implementation structure. That’s not to say that with a little bit of persistence and time it cannot be mastered; additionally, an occasional paradigm shift from the archetypical SQL provides a breath of fresh to otherwise very conventional field.

MongoDB Database Installation Overview

Without further ado let’s look at MongoDB in more detail. In this video I will show you how to download, install and perform basic configuration on a MongoDB instance.

Also, if you’re interested in Mongo’s query syntax essentials and would like to use it with something a little bit less primitive then the console window, please check out the remaining three posts to this series HERE (part 2), HERE (part 3) and HERE (part 4).

Tags: , ,

Twitter Data Sentiment Analysis Using etcML and Python

January 5th, 2014 / 4 Comments » / by admin

A while ago I put together a few posts describing Twitter sentiment analysis using a few different tools and services e.g. Zapier, RapidMiner, SQL etc. which can be found HERE, HERE and HERE. All those techniques, although presented in a rudimentary fashion in my blog require an analyst to go through a comprehensive list of processes which are a necessary prelude to getting any valid information out of the datasets processed. Data mining is still a discipline very much entrenched in a field of mathematics and supporting software and algorithms, however, recently I have witnessed an emergence of web services which tend to simplify and reduce the complexities (at least on the face value) involved in a traditional approach to advanced data mining. One such service comes from a group of Stanford machine learning researches and serves as a sentiment analysis tool for textual inputs e.g. Twitter feeds. It’s called etCML (link HERE) and even though there is no information provided on the technical aspects of the project, since the development is led by Andrew Ng and Richard Socher my guess would be that it uses NaSent (Neural Analysis of Sentiment) deep learning algorithm which you can read more about at the Stanford Engineering blog HERE.

Currently, the most widely used methods of sentiment analysis have been limited to so-called ‘bag of words’ models, which don’t take word order into account. They simply parse through a collection of words, mark each as positive or negative, and use that count to estimate whether a sentence or paragraph has a positive or negative meaning.

NaSent claims to be different. It can identify changes in the polarity of each word as it interacts with other words around it. In evaluating sample sentences, NaSent organizes words into what the scientists call grammatical tree structures (as per image below) that put the words into context.

NaSent1_Sentence_Breakdown

For example, NaSent’s analysis of the sentence: ‘Unlike the surreal Leon, this movie is weird but likeable.’ Red nodes indicate that the model assigned a negative sentiment to the underlying word or phrase, such as in the case of ‘weird.’ The phrase ‘is weird but likeable’ is correctly assigned a positive sentiment (indicated by blue). Similarly, in NaSent’s analysis of ‘Unlike the surreal but likeable Leon, this movie is weird.’ Sentence, the model correctly focuses on the overall negative sentiment, even though this sentence uses exactly the same words as the one above. This is represented by the grammatical tree structure as per image below.

NaSent2_Sentence_Breakdown

To build NaSent, Socher and his team used 12,000 sentences taken from the movie reviews website Rotten Tomatoes. They split these sentences into roughly 214,000 phrases that were labelled as very negative, negative, neutral, positive, or very positive, and then they fed this labelled data into the system, which NaSent then used to predict whether sentences were positive, neutral or negative on its own. NaSent, the researchers say, was about 85 percent accurate, an improvement over the 80 percent accuracy of previous models. The algorithm effectiveness based on the sample text parsed can be tested and analyzed using the online demo setup HERE.

Back to etCML and specific use cases to analyse text data. The service is very easy to use indeed with step-by-step instructions provided. In fact the biggest hassle as far as utilizing it as a classification tool is the fact that data needs to be submitted in either a text (tab- separated values) or zipped format. Everything else is straightforward and anyone looking at trialling it will be pleasantly surprised. At the moment there is no publicly available API so unfortunately, manual dragging and dropping is as good as it gets. Even though the tool allows for uploading your own data for further classification (can be labelled or unlabelled) as well as to train the classifiers on your datasets, I think that many people will be using it interactively to perform sentiment analysis on Twitter data. You simply click on ‘Search fort tweets’, enter a hashtag, keyword or handle, choose a classifier (it will automatically suggest one designed for sentiment analysis on tweets) and etCML does the rest. It returns an interactive timeline visualization, a collection of tweets that scored strongly for each label, and a searchable table for table for sorting the results or finding specific words/tweets. I had a quick play with some of the more popular Twitter handles and hashtags and have to say that sentiment tagging was mostly accurate, however, not without a few odd cases where I felt the message was somewhat ‘misunderstood’. It etCML defence, majority of feeds that the algorithm failed to tag accurately were sarcastic in nature.

Below is a short video outlining etCML application and functionality so if text (sentiment) classification is what you need I think that this service can meet some basic requirements, particularly for ad hoc analysis.

Also, if you intend to use this tool for Twitter sentiment analysis with your own feeds collection, you can use the below Python code to harvest the tweets (text file and Python code file used in the above demo can be downloaded from HERE).

"""
Application:
This Python script fetches Twitter feeds for further analysis (sentiment analysis using etCML web service).
It executes multiple times (number of iterations set by the variable), stores tweets temporarily in a SQLite
database (mainly for duplicates removal functionality) and outputs relevant tweets as a text format document,
removing URL shortening links and certain characters.
For a full script application and solution overview please visit www.bicortex.com

Pre-requisites:
1.  Python version 3 with Twython wrapper for Twitter API installed. Twython wrapper can be downloaded from here:
    https://github.com/ryanmcgrath/twython
2.  Twitter registered application in order to provide consumer_key, consumer_secret, access_token
    and access_token_secret. All these can be sourced by registering and application through the following link:
    https://dev.twitter.com/apps

Things to adjust:
1.  Insert you own consumer_key, consumer_secret, access_token and access_token_secret keys.
2.  Populate 'harvest_list' with the key words of interest which will be reflected in the Twitter
    feeds search content output.
3.  Adjust 'file_location' variable to correspond to the folder path of your choice
"""

#import necessary modules
from twython import Twython
import sys
import os
import time
import sqlite3
import re

#set up some of the variables/lists
db_filename = r"c:\Twitter_Test_Data_Folder\temp_db" #database file name
consumer_key = ""  #application's consumer key
consumer_secret = "" #application's consumer secret
access_token = "" #application's access_token
access_token_secret = ""  #application's access_token_secret
harvest_keyword = 'Microsoft' #searches Twitter feeds with this keyword
file_location = r'c:\Twitter_Test_Data_Folder\Twitter_Test_Data.txt'
exec_times = 100  #number of times Twitter feeds will be fetched
pause_for = 15   #sleep for 30 seconds between intervals

#define database table and database dropping query
create_schema = "CREATE TABLE twitter_feeds \
                (id integer primary key autoincrement not null,\
                twitter_feed text)"
drop_schema = "DROP TABLE twitter_feeds"

#create database file and schema using the scripts above
db_is_new = not os.path.exists(db_filename)
with sqlite3.connect(db_filename) as conn:
    if db_is_new:
        print("Creating temp database schema on " + db_filename + " database ...\n")
        conn.executescript(create_schema)
    else:
        print("Database schema may already exist. Dropping database schema on " + db_filename + "...")
        #os.remove(db_filename)
        conn.executescript(drop_schema)
        print("Creating temporary database schema...\n")
        conn.executescript(create_schema)

#fetch Twitter feeds and format them, removing unwanted characters/strings
def fetch_twitter_feeds():
    twitter = Twython(consumer_key, consumer_secret, access_token, access_token_secret)
    search_results = twitter.search(q=harvest_keyword, rpp="100")
    for tweet in search_results['statuses']:
        try:
            #the following can be enabled to see Twitter feeds content being fetched
            #print("-" * 250)
            #print("Tweet from @%s -->" % (tweet['user']['screen_name']))
            #print("    ", tweet['text'], "\n")
            #print("-" * 250)
            feed = tweet['text']
            feed = str(feed.replace("\n", ""))                                  #concatnate if tweet is on multiple lines
            feed = re.sub(r'http://[\w.]+/+[\w.]+', "", feed, re.IGNORECASE)    #remove http:// URL shortening links
            feed = re.sub(r'https://[\w.]+/+[\w.]+',"", feed, re.IGNORECASE)    #remove https:// URL shortening links
            feed = re.sub('[@#$<>:%&]', '', feed)                               #remove certain characters
            cursor = conn.cursor()
            cursor.execute("INSERT INTO twitter_feeds (twitter_feed) SELECT (?)", [feed])  #populate database table with the feeds collected
        except:
            print("Unexpected error:", sys.exc_info()[0])
            conn.rollback()
        finally:
            conn.commit()


#delete duplicated, too short or empty records from 'twitter_feeds' table
def drop_dups_and_short_strs():
    try:
        cursor = conn.cursor()
        cursor.execute("DELETE  FROM twitter_feeds WHERE id NOT IN(\
                        SELECT  MIN(id) FROM  twitter_feeds GROUP BY twitter_feed)")
        cursor.execute("DELETE FROM twitter_feeds WHERE LENGTH(twitter_feed) < 10")
        cursor.execute("DELETE FROM twitter_feeds WHERE twitter_feed IS NULL OR twitter_feed = ''")
    except:
        print("Unexpected error:", sys.exc_info()[0])
        conn.rollback()
    finally:
        conn.commit()

#display progress bar in a console
def progressbar(it, prefix = "", size = 60):
    count = len(it)
    def _show(_i):
        x = int(size*_i/count)
        print("%s[%s%s] %i/%i\r"  % (prefix, "#"*x, "."*(size-x), _i, count), end='')
        sys.stdout.flush()
    _show(0)
    for i, item in enumerate(it):
        yield item
        _show(i+1)
    print()


if __name__ == '__main__':
    try:
        for i, z in zip(range(exec_times+1), progressbar(range(exec_times), "Fetching Twitter Feeds: ", 100)):
            fetch_twitter_feeds()
            time.sleep(pause_for)
        drop_dups_and_short_strs()
        cursor = conn.cursor()
        tweets = cursor.execute("SELECT twitter_feed tw FROM twitter_feeds")
        f = open(file_location, 'w', encoding='utf-8')
        for row in tweets:
            row = ''.join(row)
            f.write(row)
            f.write("\n")
    except:
        print("Unexpected error:", sys.exc_info()[0])
    finally:
        conn.close()
        f.close()
        os.remove(db_filename)

Tags: , , , ,