Analysis of Twitter Data

Twitter represents a fundamentally new instrument to make social measurements. Millions of people voluntarily express opinions across any topic imaginable --- this data source is incredibly valuable for both research and business.

In this project we access the twitter Application Programming Interface (API) using python. We estimate the public's perception (sentiment) of terms and phrases, and analyze the relationship between location and mood based on a sample of twitter data.

This code was written with the constraints that it would run in a protected environment where one could rely on only the standard Python libraries. Therefore it does not use any external libraries or web services that could otherwise make the code better.

Getting the Twitter Data

The first thing to do is import some basic libraries.

In [1]:
import sys
import oauth2 as oauth
import urllib2 as urllib
import json
import re

Next, it is necessary to get access to the twitter API. To do so requires access keys, which are available as "API Key", "API secret","Access token" and "Access token secret" to any developer with a twitter app. All four should be visible on the API Keys page once a dummy app has been created on twitter.

In [ ]:
_debug = 0

oauth_token    = oauth.Token(key=access_token_key, secret=access_token_secret)
oauth_consumer = oauth.Consumer(key=api_key, secret=api_secret)

signature_method_hmac_sha1 = oauth.SignatureMethod_HMAC_SHA1()

http_method = "GET"
#https://dev.twitter.com/rest/reference/get/search/tweets

http_handler  = urllib.HTTPHandler(debuglevel=_debug)
https_handler = urllib.HTTPSHandler(debuglevel=_debug)

The function below allows one to construct, sign, and open a twitter request using the credentials above.

In [2]:
def twitterreq(url, method, parameters):
    req = oauth.Request.from_consumer_and_token(oauth_consumer,
                                             token=oauth_token,
                                             http_method=http_method,
                                             http_url=url, 
                                             parameters=parameters)

    req.sign_request(signature_method_hmac_sha1, oauth_consumer, oauth_token)

    headers = req.to_header()

    if http_method == "POST":
        encoded_post_data = req.to_postdata()
    else:
        encoded_post_data = None
        url = req.to_url()

    opener = urllib.OpenerDirector()
    opener.add_handler(http_handler)
    opener.add_handler(https_handler)

    response = opener.open(url, encoded_post_data)

    return response

Finally, we will be obtaining 'random' tweets from the API, which means we will be listening to the twitter stream at "https://stream.twitter.com/1.1/statuses/sample.json". It's important to note what version of the twitter API you are using (here 1.1) to make sure everything works. One can also use the search API, which has examples in the commented code below, but we won't be using that. Here we print to stdout, which we can simply assign to be a file if we wish. Before assign stdout to the file, it's good to create a copy of stdout so we can easily switch back whenever we want.

In [4]:
#stdout = sys.stdout
#file = open('output.txt', 'w')
#sys.stdout = file
In [5]:
def fetchsamples():
    #url = "https://api.twitter.com/1.1/search/tweets.json?q=microsoft "
    #url = "https://search.twitter.com/search.json?q=%23baseball&result_type=recent"
    
    url = "https://stream.twitter.com/1.1/statuses/sample.json"
    parameters = []
    response = twitterreq(url, "GET", parameters)
    for line in response:
        print line.strip()

Next, we compute the sentiment of each tweet based on the sentiment scores of the terms in the tweet. The sentiment of a tweet is equivalent to the sum of the sentiment scores for each term in the tweet. We use a file called AFINN-111.txt containing a list of pre-computed sentiment scores. Each line in the file contains a word or phrase followed by a sentiment score. Each word or phrase that is found in a tweet but not found in AFINN-111.txt is given a sentiment score of 0. To use the data in the AFINN-111.txt file, we build a dictionary.

In [2]:
afinnfile = "AFINN-111.txt"
scores = {} # initialize an empty dictionary
for line in open(afinnfile):
    term, score  = line.split("\t")  # The file is tab-delimited. "\t" means "tab character"
    scores[term] = int(score)  # Convert the score to an integer.

Tweets have many characters in them, including emoticons with eyes, optional noses, html tags, @-mentions, hashtags, URLs, numbers, and made up words. It's therefore very difficult to parse many tweets perfectly, so we will settle with the best quick and dirty solution. There are nice libraries available for parsing tweets, including the nltk TweetTokenizer. Here we will compile a regular expressions string from a very nice example here to deal with all of these problems.

In [3]:
emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
 
regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
 
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
    
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)

This helps to recognize where all of these messy characters appear so the text of tweets can be split up into individual components, or tokens. We can use the findall function to find every instance where a given regex appears in our sets of tweets.

In [4]:
def tokenize(s):
    return tokens_re.findall(s)

Sentiment of Tweets

We can evaluate the sentment of each tweet by summing up the sentiment score for each token which appears in our dictionary scores. Strings in the twitter data are unicode strings. Unicode is a standard for representing a much larger variety of characters beyond the roman alphabet. In most circumstances, we will be able to use a unicode object just like a string, but in some cases there will be problems and therefore it is desirable to use the encode method to properly print the international characters.

In [5]:
def evaluate_sentiment(tweet, scores):
    tokens = tokenize(tweet)
    score = sum([scores[term] for term in tokens 
                 if term.encode('utf-8') in scores.keys()])
    return score

In passing, we note that we could have used the nltk toolkit in this way

In [32]:
#from nltk.tokenize import TweetTokenizer
#tknzr = TweetTokenizer()
#tokens = tknzr.tokenize(tweet)

We can use json.loads to load each tweet in our collection of tweets (json file). The 'text' field of each tweet corresponds to the content of the tweet. The Twitter documentation describes how to access the information in each tweet in greater detail for those who are interested. We do have to include a try, except block to ensure that the text of the tweet is in english and that the tweet contains the information we want.

In [35]:
file = open('output.txt')
count = 0
for n, line in enumerate(file):
    parsed = json.loads(line)
    try:
        if parsed['lang'] != 'en':
            continue
    except:
        continue
    
    count += 1
    if count > 15:
        break
    
    tweet = parsed['text'].strip()
    print tweet, evaluate_sentiment(tweet, scores)
@akosibattman218 @ALDub_RTeam @ALDUB_inARTeam @imcr8d4u @yodabuda @jophie30 @by_nahjie @WHairedFairy @WhilczelCanlas 
Needs
#ALDUBinEurope 0
RT @meanpIastic: THIS IS WORSE THAN PAPAW AND HIS BURGERS https://t.co/pR57jmIBv7 0
@earthtomermer I get told I'm tall literally everyday 0
#comcast email stopped working contact -1
RT @krystaloveyou: Lol damn RT @goddywest: Waiting for the glow up man -4
RT @DrakeDirect_: Drake and 21 Savage. https://t.co/39zY8YOsVT 0
Rich The Kid Type Beat "Keep Up" (Prod.) BrandonThePro: https://t.co/LcrhC9bPhw via @YouTube 0
11:11 make a wish 1
By 2020, a pack of cigarettes will cost $40 in Australia. 0
Been playing 2k all day 0
RT @KenndaIlJenner: These Rich Snapchat Kids Are Really The Worst..
https://t.co/8iAu5XmE8B 0
RT @YG_BLACKPINK: #GDRAGON #BIGBANG IG updates

#BLACKPINK #SQUARETWOinYOURAREA https://t.co/SoiAGELMTQ 0
Dinner tonight actually went very well β˜ΊοΈπŸŽƒπŸ‘» 0
RT @iliveforfacts: If it's worth it, fight for it. If it's not, move on. 1
#hot pictures of woman losing their virginity two women and one men sex https://t.co/kR59kxsYIq -3

Sentiment of New Terms

The next step in the analysis is to derive the sentiment of new terms. We will be creating a script that computes the sentiment for the terms that do not appear in the file AFINN-111.txt. We know we can use the sentiment-carrying words in AFINN-111.txt to deduce the overall sentiment of a tweet. Once we deduce the sentiment of a tweet, we can work backwards to deduce the sentiment of the non-sentiment carrying words that do not appear in AFINN-111.txt. For example, if the word soccer always appears in proximity with positive words like great and fun, then we can deduce that the term soccer itself carries a positive sentiment.

We'll use a basic library called collections which is useful to solve this problem. We will need to count the total number of times each word occurs in the corpus of tweets so that we can easily compute the average sentiment score of tweets in which each word occurs. The code below may make this more clear.

In [6]:
from collections import Counter
In [34]:
counter = Counter()
In [ ]:
total_sentiment_score = {}
normalized_scores = {}

file = open('output.txt')
for n, line in enumerate(file):
    parsed = json.loads(line)
    try:
        if parsed['lang'] != 'en':
            continue
    except:
        continue
    
    tweet = parsed['text'].strip()
    #get the score of the tweet
    score = evaluate_sentiment(tweet, scores)
    
    tokens = tokenize(tweet)
    #get the set of tokens which appear in the tweet
    #which are not already in our list of words with known
    #sentiment scores
    new_tokens = set(tokens) - set(scores.keys())
    sentiment_tokens = set(tokens) - new_tokens
    #update the number of tweets that each token has appeared in
    counter.update(new_tokens)
    
    #Update the total sentiment for each token
    #We weight sentiment by the number of 
    #sentiment carrying words in the tweet
    for token in new_tokens:
        if token not in total_sentiment_score.keys():
            total_sentiment_score[token] = float(score)/(len(sentiment_tokens)+1)
        else:
            total_sentiment_score[token] += int(score)/(len(sentiment_tokens)+1)

#compute the average sentiment for each word.
for word in total_sentiment_score.keys():

    normalized_scores[word] = float(total_sentiment_score[word])/counter[word]   
In [45]:
count = 0
for word in normalized_scores.keys():
    count += 1
    if count > 25:
        break
    print word + " " + str(normalized_scores[word]) 
@imrraven 0.0
#TrumpPence16 0.0
@diinnaad 1.0
EXPLAIN 0.333333333333
🎍 0.0
https://t.co/0DN15uBRlw 0.5
woods 0.0
hanging -0.25
LAST 0.0714285714286
#tatasteel 1.5
Gday 0.75
https://t.co/ShaIbAqVxH 0.0
Western 0.375
hermana 0.0
arranged 0.0
@spurs 0.0
https://t.co/cAafY9nvjY 0.0
⚾ 0.0
@sexiestmanalive 0.0
https://t.co/TYziPNOJQq 0.25
eZ 1.0
Secularism 0.0
appropriation -1.0
@BornWitaCharm 1.0
BOOST 0.0

Computing Term Frequency

Next, we will compute the percentage of tweets that each term appears in our corpus of tweets.

The frequency of a term can be calculated as [# of occurrences of the term in all tweets]/[# of occurrences of all terms in all tweets]

In [46]:
counter_all = Counter()
term_freq = {}

file = open('output.txt')
for n, line in enumerate(file):

    parsed = json.loads(line)
    try:
        if parsed['lang'] != 'en':
            continue
    except:
        continue
    tweet = parsed['text'].strip()

    tokens = tokenize(tweet)
    counter_all.update(tokens)
    
normalizer = sum(counter_all.values()) #number of occurrences of all words in all tweets
#normalize the frequency of terms
for key in counter_all.keys():
    term_freq[key] = float(counter_all[key])/normalizer

We are interested in the n most common terms which occur in our corpus. We can use the library heapq to extract this information. heapq is a highly optimized library for sorting lists.

In [7]:
import heapq

def nlargest(n, word_scores):
    return heapq.nlargest(n, word_scores, key=lambda x: x[1]) 
In [58]:
ten_largest = nlargest(10, term_freq.items())
for term in ten_largest:
    print term[0] + " " + str(term[1])
: 0.0407534401764
. 0.0356177212513
RT 0.0347466630415
the 0.0143749779718
to 0.0140779118771
! 0.0125170561254
, 0.0124717409584
I 0.0116560679526
a 0.0110770408189
you 0.00962695547533

This list is not surprising or particularly impressive since we didn't eliminate words that appear commonly in all collections of documents and tweets. These are called stopwords, and we can see what happens if we include tools for nltk, including stopwords, and repeat the analysis.

In [64]:
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize
import string
 
punctuation = list(string.punctuation)
stop_words = set(stopwords.words('english') + punctuation + ['rt', 'via', 'https', '://', 'co', '...', '. . .'])

tknzr = TweetTokenizer()
counter_all = Counter()
term_freq = {}

file = open('output.txt')
for n, line in enumerate(file):

    parsed = json.loads(line)
    try:
        if parsed['lang'] != 'en':
            continue
    except:
        continue
    tweet = parsed['text'].strip()

    #tokens = tokenize(tweet)
    tokens = tknzr.tokenize(tweet)
    tokens =  [i.lower() for i in wordpunct_tokenize(' '.join(tokens)) if i.lower() not in stop_words]
    
    counter_all.update(tokens)
    
normalizer = sum(counter_all.values())
for key in counter_all.keys():
    term_freq[key] = float(counter_all[key])/normalizer
    
ten_largest = nlargest(10, term_freq.items())
for term in ten_largest:
    print term[0] + " " + str(term[1])
… 0.0160572234461
πŸ˜‚ 0.00688561763161
halloween 0.00644316830588
like 0.0045904117544
love 0.00415718012296
happy 0.00392673776582
get 0.00372394849152
one 0.00329071686008
u 0.00313401605722
️ 0.00299575064293

We did a better job this time around. In general people seem to be happy about halloween, which is around when the corpus was mined, so this makes sense.

Which US State is the happiest?

We are interested in using our sentiment scores from before to determine which US state is the happiest, based on the highest average sentiment of tweets grouped by which state they are coming from.

We will take advantage of the coordinates field to geocode the tweet. This means that we will not have much data because this field is not always available. We will also show a few ways that we can go about determining the location of a tweet from other metadata in the tweet.

We don't actually need any of the libraries below to determine which state a tweet originates, but we will demonstrate how to use each of them in the context of this problem.

In [8]:
import matplotlib.path as mplPath
import shapefile
from geopy.geocoders import Nominatim
import matplotlib.pyplot as plt
%matplotlib inline

We can use geopy to determine the location given a pair of coordinates, but there are other methods available as well. It can also be used to standardize location given messy location data, such as various representations or misspellings of San Diego, like san diego, san diegoe, or "San Diego, CA". In a tweet parsed loaded in with json.load, this would appear in parsed['user']['location'] if it existed. This field tends to be more commonly available in tweets, but the disadvantage is that you must have access to the internet in order to use geopy or other similar geolocating services. With coordinates it is possible to use offline methods to determine the state of origin of a tweet.

In [12]:
geolocator = Nominatim()
l1 = geolocator.geocode("San Diego")
l2 = geolocator.geocode("san diego")
l3 = geolocator.geocode("san diegoe")
l4 = geolocator.geocode("San Diego, CA")

In [14]:
print l1.address
print l2.address
print l3.address
print l4.address
San Diego, San Diego County, California, United States of America
San Diego, San Diego County, California, United States of America
San Diego, San Diego County, California, United States of America
San Diego, San Diego County, California, United States of America

We might then extract information about the state as

In [16]:
state_info = l1.address.split(',')[-2]
state_info
Out[16]:
u' California'

Here is how we can get location information from a set of coordinates

In [24]:
l5 = geolocator.reverse("32.7, -117.2")
print(l5.address)
East B Road, Coronado, San Diego County, California, 92135, United States of America

We could then get information about the state by cross referencing with a list of states

In [26]:
import us
states = [state.name for state in us.states.STATES]
In [29]:
[address_item.strip() for address_item in l5.address.split(',') if address_item.strip() in states][0]
Out[29]:
u'California'

Or if we are thorough and check that the state always tends to appear in a specific place in the coordinates query, we can just use that list item

In [28]:
l5.address.split(',')[-3]
Out[28]:
u' California'

Using geolocating services is probably the most straightforward way to extract information about a tweet, but another approach is to use information about the boundaries of states, which can be found in shapefiles. These shapefiles can be found at the census website. With the information about the boundaries of states, we can then use a point in polygon test to determine what state a given coordinate is located. We can use the shapefile library in python to read in shapefiles and extract this information. Here the file is called 'tl_2016_us_state', and the information about the format of the file doesn't need to be specified.

In [32]:
sf = shapefile.Reader("tl_2016_us_state")

shapes = sf.shapes()

state_names = [state[6] for state in sf.records()]
state_names_abbreviations = [state[5] for state in sf.records()]

We can put up the skeleton of all of the US states and territories using this information

In [33]:
for j in range(len(list(sf.iterShapes()))):
    state_shape = shapes[j].points
    x = [i[0] for i in state_shape]
    y = [i[1] for i in state_shape]
    plt.plot(x,y)

Clearly this would need some work, such as translating the longitude by some fixed amount so that the United States appears centered and Alaska doesn't appear so messed up. We could also get rid of Alaska and the other territories, but we won't consider this further here.

We also note that each state has a bounding box which can be used as a quick and dirty way to determine whether a coordinate is in a state or not.

In [35]:
shapes[0].bbox
Out[35]:
[-82.64459099999999, 37.20154, -77.71951899999999, 40.638801]

Here we note that the coordinates for the bounding box would be as follows

bx1 = state_shape[0]

by1 = state_shape[1]

bx2 = state_shape[2]

by2 = state_shape[3]

state_shape = [(bx1, by1),(bx1, by2), (bx2, by2), (bx2, by1), (bx1, by1)]

We will implement our point in polygon test using matplotlib.path, which we import as mplPath here. mplPath accepts latitude and longitude in the reverse order from the the way they are represented in our shapefiles. We will include in the comments the corresponding code for using the user entered information about location as well.

In [ ]:
tweet_file = open('output.txt')
sentiment_file = open('AFINN-111.txt')

tweet_data = []
state_score = {}
state_count = {}
for n, line in enumerate(tweet_file):
    parsed = json.loads(line)
    try:
        #if parsed['lang'] != 'en' or parsed['geo'] == None:
        if parsed['lang'] != 'en' or parsed['user']['location'] == None:
            continue
    except:
        continue
    """
    place = parsed['user']['location']
    try:
        
        place = geolocator.geocode(place).address
    
        state_name = [address_item.strip() for address_item in place.split(',') if address_item.strip() in state_names][0]
    
        tweet_data.append({'text':tweet, 'sentiment score':score, 'state name':state_name})
    except:
        continue
    """
    
    coordinates = parsed['geo']['coordinates']
    tweet = parsed['text'].strip()
    score = evaluate_sentiment(tweet, scores)
    
    for j in range(len(list(sf.iterShapes()))):
        
        state_shape = shapes[j].points
        bbPath = mplPath.Path(state_shape)
        
        contains_point = bbPath.contains_point((coordinates[1], coordinates[0]))

        if contains_point:
            state_name = state_names[j]
            
            tweet_data.append({'text':tweet, 'sentiment score':score, 'state name':state_name})
            
            if state_abb not in state_score.keys():
                state_score[state_name] = score #start total sentiment score for each state
                state_count[state_name] = 1 #start a count for total number of tweets for each state
            else:
                state_score[state_name] += score
                state_count[state_name] += 1
            
            break
    

We find it most convenient to find the state with the highest

In [47]:
import pandas as pd

max_state = -100000
max_state_name = None

#Get the state with highest sentiment score

df = pd.DataFrame(tweet_data)
max_state = list(df.groupby('state name').mean().idxmax())[0]
df[df['state name'] == max_state]
Out[47]:
sentiment score state name text
8 5 Tennessee I want to wish a very Happy Birthday to this b...

If we didn't want to use pandas for some reason, we could find the max with this code instead

In [ ]:
#for state in state_score.keys():
#    state_score[state] = float(state_score[state])/state_count[state]
    
#    if state_score[state] > max_state:
#        max_state = state_score[state]
#        max_state_name = state

#print str(max_state_name)

Top 10 most common hashtags

Finally, we'd like to know what the top ten most frequently occurring hashtags are in the data we've gathered. The hashtags, fortunately, have already been extracted by twitter, which we can find in the parsed['entities']['hashtags'] field, where parsed is just a tweet that we have loaded in with json.

In [10]:
tweet_file = open('output.txt')

hashtag_counter = Counter()

tweet_data = []

for n, line in enumerate(tweet_file):
    parsed = json.loads(line)
    try:
        if parsed['lang'] != 'en' or parsed['entities']['hashtags'] == []:
            continue
    except:
        continue
        
    hashtags = [entity['text'] for entity in parsed['entities']['hashtags']]
    hashtag_counter.update(hashtags)

ten_largest = nlargest(10, hashtag_counter.items())

for hashtag in ten_largest:
    print hashtag[0] + " " + str(hashtag[1])
ALDUBinEurope 64
HappyHalloween 61
EMABiggestFansArianaGrande 44
instagram 43
Instagram 37
InstagramForBusiness 37
Jewelry 35
Halloween 26
free 24
halloween 24

Not surprisingly, we find a bit of an emphasis on halloween, pictures, and some other trending topics.