This article demonstrates how to perform sentiment analysis on Twitter tweets using Python and the Natural Language Toolkit (NLTK).
Sentiment Analysis involves examining the sentiment expressed in a text or document and classifying it into specific categories, such as positive or negative. Essentially, it assigns a sentiment label to the text, which can be categorized as either positive or negative. Additional categories, such as neutral, highly positive, or highly negative, can also be included.
Sentiment Analysis, also known as Opinion Mining, is commonly applied to social media data and customer reviews.
Supervised Classification
In this article, we will focus on supervised text classification, where the classifier is trained using labeled data.
We will utilize the twitter_samples corpus from NLTK as our labeled training dataset. This corpus consists of 2,000 movie reviews with predefined sentiment polarity, as compiled by Pang and Lee.
Our classification task involves two categories: positive and negative. The twitter_samples corpus already categorizes the tweets into these two sentiment classes.
The twitter_samples corpus includes three files:
- negative_tweets.json: Contains 5,000 negative tweets.
- positive_tweets.json: Contains 5,000 positive tweets.
- tweets.20150430-223406.json: Contains 20,000 tweets, both positive and negative.
from nltk.corpus import twitter_samples
print (twitter_samples.fileids())
'''
Output:
['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']
'''
pos_tweets = twitter_samples.strings('positive_tweets.json')
print (len(pos_tweets)) # Output: 5000
neg_tweets = twitter_samples.strings('negative_tweets.json')
print (len(neg_tweets)) # Output: 5000
all_tweets = twitter_samples.strings('tweets.20150430-223406.json')
print (len(all_tweets)) # Output: 20000
for tweet in pos_tweets[:5]:
print (tweet)
'''
Output:
#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!
@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!
@97sides CONGRATS :)
yeaaaah yippppy!!! my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days
'''
Tokenize Tweets
NLTK provides a TweetTokenizer module that efficiently tokenizes tweets by splitting them into a list of individual words.
When initializing the TweetTokenizer class, you can specify three parameters:
- preserve_case: When set to
False, the tokenizer converts the tweet to lowercase. IfTrue, it keeps the original capitalization. - strip_handles: If set to
True, the tokenizer removes Twitter handles from the tweet. IfFalse, it keeps the handles in the text. - reduce_len: When set to
True, the tokenizer shortens elongated words like "hurrayyyy" and "yipppiieeee." IfFalse, it maintains the original length of the words.
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
for tweet in pos_tweets[:5]:
print (tweet_tokenizer.tokenize(tweet))
'''
Output:
['#followfriday', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']
['hey', 'james', '!', 'how', 'odd', ':/', 'please', 'call', 'our', 'contact', 'centre', 'on', '02392441234', 'and', 'we', 'will', 'be', 'able', 'to', 'assist', 'you', ':)', 'many', 'thanks', '!']
['we', 'had', 'a', 'listen', 'last', 'night', ':)', 'as', 'you', 'bleed', 'is', 'an', 'amazing', 'track', '.', 'when', 'are', 'you', 'in', 'scotland', '?', '!']
['congrats', ':)']
['yeaaah', 'yipppy', '!', '!', '!', 'my', 'accnt', 'verified', 'rqst', 'has', 'succeed', 'got', 'a', 'blue', 'tick', 'mark', 'on', 'my', 'fb', 'profile', ':)', 'in', '15', 'days']
'''
Cleaning Tweet
During the tweet cleaning process, we will perform the following steps:
- Eliminate stock market tickers, such as
$GE. - Remove retweet indicators like
RT. - Delete hyperlinks.
- Strip out hashtags, retaining only the hashtag symbol (#), not the associated words.
- Discard common stop words such as "a," "and," "the," "is," "are," etc.
- Remove emoticons like
:),:D,:(',:-), etc. - Remove punctuation marks including periods, commas, exclamation points, etc.
- Reduce words to their base or root forms using the Porter Stemming Algorithm. For example, words like "working," "works," and "worked" will be simplified to the base word "work."
We will implement a function called clean_tweets that returns a list of words from a given tweet after removing the specified elements.
import string
import re
from nltk.corpus import stopwords
stopwords_english = stopwords.words('english')
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
from nltk.tokenize import TweetTokenizer
# Happy Emoticons
emoticons_happy = set([
':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
'=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
'<3'
])
# Sad Emoticons
emoticons_sad = set([
':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
':c', ':{', '>:\\', ';('
])
# all emoticons (happy + sad)
emoticons = emoticons_happy.union(emoticons_sad)
def clean_tweets(tweet):
# remove stock market tickers like $GE
tweet = re.sub(r'\$\w*', '', tweet)
# remove old style retweet text "RT"
tweet = re.sub(r'^RT[\s]+', '', tweet)
# remove hyperlinks
tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
# remove hashtags
# only removing the hash # sign from the word
tweet = re.sub(r'#', '', tweet)
# tokenize tweets
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)
tweets_clean = []
for word in tweet_tokens:
if (word not in stopwords_english and # remove stopwords
word not in emoticons and # remove emoticons
word not in string.punctuation): # remove punctuation
#tweets_clean.append(word)
stem_word = stemmer.stem(word) # stemming word
tweets_clean.append(stem_word)
return tweets_clean
custom_tweet = "RT @Twitter @mavenbird Hello There! Have a great day. :) #good #morning http://mavenbird.com.np"
# print cleaned tweet
print (clean_tweets(custom_tweet))
'''
Output:
['hello', 'great', 'day', 'good', 'morning']
'''
print (pos_tweets[5])
'''
Output:
@User1 @User2 This one is irresistible :)
#FlipkartFashionFriday http://t.co/EbZ0L2VENM
'''
print (clean_tweets(pos_tweets[5]))
'''
Output:
['one', 'irresistible', 'flipkartfashionfriday']
'''
Feature Extraction
We define a basic bag_of_words function to extract unigram features from the tweets.
# feature extractor function
def bag_of_words(tweet):
words = clean_tweets(tweet)
words_dictionary = dict([word, True] for word in words)
return words_dictionary
custom_tweet = "RT @Twitter @mavenbird Hello There! Have a great day. :) #good #morning https://www.mavenbird.com/"
print (bag_of_words(custom_tweet))
'''
Output:
{'great': True, 'good': True, 'morning': True, 'hello': True, 'day': True}
'''
# positive tweets feature set
pos_tweets_set = []
for tweet in pos_tweets:
pos_tweets_set.append((bag_of_words(tweet), 'pos'))
# negative tweets feature set
neg_tweets_set = []
for tweet in neg_tweets:
neg_tweets_set.append((bag_of_words(tweet), 'neg'))
print (len(pos_tweets_set), len(neg_tweets_set)) # Output: (5000, 5000)
Create Train and Test Set
We have 5,000 positive tweets and 5,000 negative tweets. We will use 20% from each set—1,000 positive tweets and 1,000 negative tweets—as the test set. The remaining tweets from both the positive and negative sets will be utilized as the training set.
# radomize pos_reviews_set and neg_reviews_set
# doing so will output different accuracy result everytime we run the program
from random import shuffle
shuffle(pos_tweets_set)
shuffle(neg_tweets_set)
test_set = pos_tweets_set[:1000] + neg_tweets_set[:1000]
train_set = pos_tweets_set[1000:] + neg_tweets_set[1000:]
print(len(test_set), len(train_set)) # Output: (2000, 8000)
Training Classifier and Calculating Accuracy
We train the Naive Bayes Classifier with the training set and then evaluate its classification accuracy using the test set.
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_set)
accuracy = classify.accuracy(classifier, test_set)
print(accuracy) # Output: 0.765
print (classifier.show_most_informative_features(10))
'''
Output:
Most Informative Features
via = True pos : neg = 37.0 : 1.0
glad = True pos : neg = 25.0 : 1.0
sad = True neg : pos = 22.6 : 1.0
aw = True neg : pos = 21.7 : 1.0
bam = True pos : neg = 21.0 : 1.0
x15 = True neg : pos = 19.7 : 1.0
appreci = True pos : neg = 17.7 : 1.0
arriv = True pos : neg = 15.0 : 1.0
ugh = True neg : pos = 14.3 : 1.0
justin = True neg : pos = 13.0 : 1.0
'''
Testing Classifier with Custom Tweet
We input a custom tweet and observe the classification results from the trained classifier. The classifier accurately identifies both negative and positive tweets as expected.
custom_tweet = "I hated the film. It was a disaster. Poor direction, bad acting."
custom_tweet_set = bag_of_words(custom_tweet)
print (classifier.classify(custom_tweet_set)) # Output: neg
# Negative tweet correctly classified as negative
# probability result
prob_result = classifier.prob_classify(custom_tweet_set)
print (prob_result) # Output:
print (prob_result.max()) # Output: neg
print (prob_result.prob("neg")) # Output: 0.941844352481
print (prob_result.prob("pos")) # Output: 0.0581556475194
custom_tweet = "It was a wonderful and amazing movie.Best direction, good acting."
custom_tweet_set = bag_of_words(custom_tweet)
print (classifier.classify(custom_tweet_set)) # Output: pos
# Positive tweet correctly classified as positive
# probability result
prob_result = classifier.prob_classify(custom_tweet_set)
print (prob_result) # Output:
print (prob_result.max()) # Output: pos
print (prob_result.prob("neg")) # Output: 0.00131055449755
print (prob_result.prob("pos")) # Output: 0.998689445502
Precision, Recall & F1-Score
Accuracy is calculated as the ratio of correctly predicted observations to the total number of observations.
Precision measures how accurate the predictions are:
- It indicates how many of the predicted positive results were actually correct.
- For instance, if you answered only 1 question correctly out of 100 questions, your precision would be 100%.
- Precision assesses how often the classifier's predictions are correct.
Recall, in contrast to precision, focuses on the classifier's ability to identify all relevant instances:
- It measures how well the classifier detects all positive cases.
- It evaluates how often the classifier correctly predicts "yes" when the actual result is "yes."
The F1 Score, or F-measure, is the harmonic mean of precision and recall, providing a single metric that balances both aspects.
True Positive (TP): This refers to the number of patients who actually have cancer and were correctly diagnosed as having cancer.
True Negative (TN): This represents the number of patients who do not have cancer and were accurately identified as not having cancer.
False Positive (FP): This is the count of patients who do not have cancer but were mistakenly diagnosed as having cancer (also known as Type I error).
False Negative (FN): This indicates the number of patients who have cancer but were incorrectly diagnosed as not having cancer (also known as Type II error).
- Accuracy: (TP + TN) / (TP + TN + FP + FN)
- Precision: TP / (TP + FP)
- Recall: TP / (TP + FN)
- F1 Score: 2 * (precision * recall) / (precision + recall)
actual_set = defaultdict(set)
predicted_set = defaultdict(set)
actual_set_cm = []
predicted_set_cm = []
for index, (feature, actual_label) in enumerate(test_set):
actual_set[actual_label].add(index)
actual_set_cm.append(actual_label)
predicted_label = classifier.classify(feature)
predicted_set[predicted_label].add(index)
predicted_set_cm.append(predicted_label)
from nltk.metrics import precision, recall, f_measure, ConfusionMatrix
print 'pos precision:', precision(actual_set['pos'], predicted_set['pos']) # Output: pos precision: 0.762896825397
print 'pos recall:', recall(actual_set['pos'], predicted_set['pos']) # Output: pos recall: 0.769
print 'pos F-measure:', f_measure(actual_set['pos'], predicted_set['pos']) # Output: pos F-measure: 0.76593625498
print 'neg precision:', precision(actual_set['neg'], predicted_set['neg']) # Output: neg precision: 0.767137096774
print 'neg recall:', recall(actual_set['neg'], predicted_set['neg']) # Output: neg recall: 0.761
print 'neg F-measure:', f_measure(actual_set['neg'], predicted_set['neg']) # Output: neg F-measure: 0.7640562249
Confusion Matrix
The Confusion Matrix is a table used to describe the performance of a classifier.
The Confusion Matrix is represented in the following format:
'''
| Predicted NO | Predicted YES |
-----------+---------------------+---------------------+
Actual NO | True Negative (TN) | False Positive (FP) |
Actual YES | False Negative (FN) | True Positive (TP) |
-----------+---------------------+---------------------+
'''
The output of the confusion matrix below illustrates the performance of our trained classifier.
- 761 negative tweets were correctly classified as negative (TN).
- 239 negative tweets were incorrectly classified as positive (FP).
- 231 positive tweets were incorrectly classified as negative (FN).
- 769 positive tweets were correctly classified as positive (TP).
# Confusion Matrix for the test set
#
# Output:
# row = actual_set_cm
# column = predicted_set_cm
cm = ConfusionMatrix(actual_set_cm, predicted_set_cm)
print (cm)
'''
Output:
| n p |
| e o |
| g s |
----+---------+
neg |<761>239 |
pos | 231<769>|
----+---------+
(row = reference; col = test)
'''
print (cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))
'''
Output:
| n p |
| e o |
| g s |
----+---------------+
neg | <38.0%> 11.9% |
pos | 11.6% <38.5%>|
----+---------------+
(row = reference; col = test)
'''