Twitter Data Sentiment Analysis Using etcML and Python

January 5th, 2014 / 4 Comments » / by admin

A while ago I put together a few posts describing Twitter sentiment analysis using a few different tools and services e.g. Zapier, RapidMiner, SQL etc. which can be found HERE, HERE and HERE. All those techniques, although presented in a rudimentary fashion in my blog require an analyst to go through a comprehensive list of processes which are a necessary prelude to getting any valid information out of the datasets processed. Data mining is still a discipline very much entrenched in a field of mathematics and supporting software and algorithms, however, recently I have witnessed an emergence of web services which tend to simplify and reduce the complexities (at least on the face value) involved in a traditional approach to advanced data mining. One such service comes from a group of Stanford machine learning researches and serves as a sentiment analysis tool for textual inputs e.g. Twitter feeds. It’s called etCML (link HERE) and even though there is no information provided on the technical aspects of the project, since the development is led by Andrew Ng and Richard Socher my guess would be that it uses NaSent (Neural Analysis of Sentiment) deep learning algorithm which you can read more about at the Stanford Engineering blog HERE.

Currently, the most widely used methods of sentiment analysis have been limited to so-called ‘bag of words’ models, which don’t take word order into account. They simply parse through a collection of words, mark each as positive or negative, and use that count to estimate whether a sentence or paragraph has a positive or negative meaning.

NaSent claims to be different. It can identify changes in the polarity of each word as it interacts with other words around it. In evaluating sample sentences, NaSent organizes words into what the scientists call grammatical tree structures (as per image below) that put the words into context.

NaSent1_Sentence_Breakdown

For example, NaSent’s analysis of the sentence: ‘Unlike the surreal Leon, this movie is weird but likeable.’ Red nodes indicate that the model assigned a negative sentiment to the underlying word or phrase, such as in the case of ‘weird.’ The phrase ‘is weird but likeable’ is correctly assigned a positive sentiment (indicated by blue). Similarly, in NaSent’s analysis of ‘Unlike the surreal but likeable Leon, this movie is weird.’ Sentence, the model correctly focuses on the overall negative sentiment, even though this sentence uses exactly the same words as the one above. This is represented by the grammatical tree structure as per image below.

NaSent2_Sentence_Breakdown

To build NaSent, Socher and his team used 12,000 sentences taken from the movie reviews website Rotten Tomatoes. They split these sentences into roughly 214,000 phrases that were labelled as very negative, negative, neutral, positive, or very positive, and then they fed this labelled data into the system, which NaSent then used to predict whether sentences were positive, neutral or negative on its own. NaSent, the researchers say, was about 85 percent accurate, an improvement over the 80 percent accuracy of previous models. The algorithm effectiveness based on the sample text parsed can be tested and analyzed using the online demo setup HERE.

Back to etCML and specific use cases to analyse text data. The service is very easy to use indeed with step-by-step instructions provided. In fact the biggest hassle as far as utilizing it as a classification tool is the fact that data needs to be submitted in either a text (tab- separated values) or zipped format. Everything else is straightforward and anyone looking at trialling it will be pleasantly surprised. At the moment there is no publicly available API so unfortunately, manual dragging and dropping is as good as it gets. Even though the tool allows for uploading your own data for further classification (can be labelled or unlabelled) as well as to train the classifiers on your datasets, I think that many people will be using it interactively to perform sentiment analysis on Twitter data. You simply click on ‘Search fort tweets’, enter a hashtag, keyword or handle, choose a classifier (it will automatically suggest one designed for sentiment analysis on tweets) and etCML does the rest. It returns an interactive timeline visualization, a collection of tweets that scored strongly for each label, and a searchable table for table for sorting the results or finding specific words/tweets. I had a quick play with some of the more popular Twitter handles and hashtags and have to say that sentiment tagging was mostly accurate, however, not without a few odd cases where I felt the message was somewhat ‘misunderstood’. It etCML defence, majority of feeds that the algorithm failed to tag accurately were sarcastic in nature.

Below is a short video outlining etCML application and functionality so if text (sentiment) classification is what you need I think that this service can meet some basic requirements, particularly for ad hoc analysis.

Also, if you intend to use this tool for Twitter sentiment analysis with your own feeds collection, you can use the below Python code to harvest the tweets (text file and Python code file used in the above demo can be downloaded from HERE).

"""
Application:
This Python script fetches Twitter feeds for further analysis (sentiment analysis using etCML web service).
It executes multiple times (number of iterations set by the variable), stores tweets temporarily in a SQLite
database (mainly for duplicates removal functionality) and outputs relevant tweets as a text format document,
removing URL shortening links and certain characters.
For a full script application and solution overview please visit www.bicortex.com

Pre-requisites:
1.  Python version 3 with Twython wrapper for Twitter API installed. Twython wrapper can be downloaded from here:
    https://github.com/ryanmcgrath/twython
2.  Twitter registered application in order to provide consumer_key, consumer_secret, access_token
    and access_token_secret. All these can be sourced by registering and application through the following link:
    https://dev.twitter.com/apps

Things to adjust:
1.  Insert you own consumer_key, consumer_secret, access_token and access_token_secret keys.
2.  Populate 'harvest_list' with the key words of interest which will be reflected in the Twitter
    feeds search content output.
3.  Adjust 'file_location' variable to correspond to the folder path of your choice
"""

#import necessary modules
from twython import Twython
import sys
import os
import time
import sqlite3
import re

#set up some of the variables/lists
db_filename = r"c:\Twitter_Test_Data_Folder\temp_db" #database file name
consumer_key = ""  #application's consumer key
consumer_secret = "" #application's consumer secret
access_token = "" #application's access_token
access_token_secret = ""  #application's access_token_secret
harvest_keyword = 'Microsoft' #searches Twitter feeds with this keyword
file_location = r'c:\Twitter_Test_Data_Folder\Twitter_Test_Data.txt'
exec_times = 100  #number of times Twitter feeds will be fetched
pause_for = 15   #sleep for 30 seconds between intervals

#define database table and database dropping query
create_schema = "CREATE TABLE twitter_feeds \
                (id integer primary key autoincrement not null,\
                twitter_feed text)"
drop_schema = "DROP TABLE twitter_feeds"

#create database file and schema using the scripts above
db_is_new = not os.path.exists(db_filename)
with sqlite3.connect(db_filename) as conn:
    if db_is_new:
        print("Creating temp database schema on " + db_filename + " database ...\n")
        conn.executescript(create_schema)
    else:
        print("Database schema may already exist. Dropping database schema on " + db_filename + "...")
        #os.remove(db_filename)
        conn.executescript(drop_schema)
        print("Creating temporary database schema...\n")
        conn.executescript(create_schema)

#fetch Twitter feeds and format them, removing unwanted characters/strings
def fetch_twitter_feeds():
    twitter = Twython(consumer_key, consumer_secret, access_token, access_token_secret)
    search_results = twitter.search(q=harvest_keyword, rpp="100")
    for tweet in search_results['statuses']:
        try:
            #the following can be enabled to see Twitter feeds content being fetched
            #print("-" * 250)
            #print("Tweet from @%s -->" % (tweet['user']['screen_name']))
            #print("    ", tweet['text'], "\n")
            #print("-" * 250)
            feed = tweet['text']
            feed = str(feed.replace("\n", ""))                                  #concatnate if tweet is on multiple lines
            feed = re.sub(r'http://[\w.]+/+[\w.]+', "", feed, re.IGNORECASE)    #remove http:// URL shortening links
            feed = re.sub(r'https://[\w.]+/+[\w.]+',"", feed, re.IGNORECASE)    #remove https:// URL shortening links
            feed = re.sub('[@#$<>:%&]', '', feed)                               #remove certain characters
            cursor = conn.cursor()
            cursor.execute("INSERT INTO twitter_feeds (twitter_feed) SELECT (?)", [feed])  #populate database table with the feeds collected
        except:
            print("Unexpected error:", sys.exc_info()[0])
            conn.rollback()
        finally:
            conn.commit()


#delete duplicated, too short or empty records from 'twitter_feeds' table
def drop_dups_and_short_strs():
    try:
        cursor = conn.cursor()
        cursor.execute("DELETE  FROM twitter_feeds WHERE id NOT IN(\
                        SELECT  MIN(id) FROM  twitter_feeds GROUP BY twitter_feed)")
        cursor.execute("DELETE FROM twitter_feeds WHERE LENGTH(twitter_feed) < 10")
        cursor.execute("DELETE FROM twitter_feeds WHERE twitter_feed IS NULL OR twitter_feed = ''")
    except:
        print("Unexpected error:", sys.exc_info()[0])
        conn.rollback()
    finally:
        conn.commit()

#display progress bar in a console
def progressbar(it, prefix = "", size = 60):
    count = len(it)
    def _show(_i):
        x = int(size*_i/count)
        print("%s[%s%s] %i/%i\r"  % (prefix, "#"*x, "."*(size-x), _i, count), end='')
        sys.stdout.flush()
    _show(0)
    for i, item in enumerate(it):
        yield item
        _show(i+1)
    print()


if __name__ == '__main__':
    try:
        for i, z in zip(range(exec_times+1), progressbar(range(exec_times), "Fetching Twitter Feeds: ", 100)):
            fetch_twitter_feeds()
            time.sleep(pause_for)
        drop_dups_and_short_strs()
        cursor = conn.cursor()
        tweets = cursor.execute("SELECT twitter_feed tw FROM twitter_feeds")
        f = open(file_location, 'w', encoding='utf-8')
        for row in tweets:
            row = ''.join(row)
            f.write(row)
            f.write("\n")
    except:
        print("Unexpected error:", sys.exc_info()[0])
    finally:
        conn.close()
        f.close()
        os.remove(db_filename)

Tags: , , , ,

Introduction to Microsoft SQL Server 2014 Clustered Columnstore Indexes Performance Testing

December 11th, 2013 / No Comments » / by admin

Introduction

If you an avid Microsoft SQL Server database user and/or had a chance to work with SQL Server 2012 version and its new indexing options, you’ve probably heard of columnstore index concept. Columnstore index, first introduced in SQL Server 2012, stores data in columnar fashion (unlike the traditional B-tree structures used for clustered and non-clustered rowstore indexes) to speed-up the processing time of common data warehousing queries. Typical data warehousing workloads involve summarizing large amounts of data. The techniques typically used in data warehousing and decision support systems to improve performance are pre-computed summary tables, indexed views or OLAP cubes. Although these can greatly speed up query processing, they can also prove inflexible, difficult to maintain, must be designed specifically for each query problem and require certain level of OLAP expertise. Leveraging their existing SQL skills, developers and DBAs can now achieve near OLAP-like performance by means of utilising columnstore indexes. A columnstore index organizes the data in individual columns that are joined together to form the index. This structure can offer significant performance gains for queries that summarize large quantities of data, the sort typically used for business intelligence (BI) and data warehousing.

Architecture and Performance Testing

Just like a normal clustered index, a clustered columnstore index defines how the data is physically stored on the disc. A columnstore backed table is initially organized into segments known as row groups. Each rowgroup holds from 102,400 to 1,048,576 rows. Once a rowgroup is identified it is broken up into column segments, which are then compressed and inserted into the actual columnstore. When dealing with small amounts of data, small being defined as less than a hundred thousand rows, the data is staged in a section known as the deltastore. Once it reaches the minimum size the deltastore can be drained, its data being processed as a new rowgroup. This is depicted in the MSDN diagram below.

ColumnStore_Data_Storage_Architecture_MSDN_Img

A deltastore will be closed while it is being converted. This, however, is not a blocking operation for the table as a whole. An additional deltastore can be created for a given table when the current deltastores are unavailable due to locking. And if the table is partitioned, then each partition gets its own set of deltastores.

Columnstore indexes are not the panacea for all potential performance issues. While it’s possible to build a system that stores all data in columnar format, row stores still have advantages in some situations.  A B-tree is a very efficient data structure for looking up or modifying a single row of data.  So if your workload entails many single row lookups and many updates and deletes, which is common for OLTP workloads, you will probably continue to use row store technology.  Data warehouse workloads typically scan, aggregate, and join large amounts of data.  In those scenarios, column stores really shine.

Let’s look at some performance gains which can be achieved by using columnstore index technology. For the purpose of this demonstration I used a couple of text files (all files and SQL queries for this presentation can be downloaded from HERE) holding the subset of data I intend to load into my tables. These files have been placed in a folder on my c:\ drive and their data loaded into a ‘sandbox’ database for performance testing. I also ‘stretched out’ one of the tables to take advantage of much larger volume of data to run a few test queries, comparing execution times between tables with ‘classic’ b-tree indexes vs. columnstore indexes.  Again, all scripts used in this presentation as well as additional files can be downloaded from HERE.

The quick video footage below demonstrates the process of test environment preparation as well as all performance tests executed with their corresponding results. At the very last moment I also decided to include some rudimentary data I managed to collect regarding table compression efficiency (by the way, great achievement by SQL Server team at MSFT!) when implemented using a columnstore index. Enjoy and leave me a comment if you find it useful or otherwise!

Tags: ,