How to Master Text Classification and Sentiment Analysis with Python and NLTK

This article explores the application of various feature sets to train three distinct classifiers: Naive Bayes Classifier, Maximum Entropy (MaxEnt) Classifier, and Support Vector Machine (SVM) Classifier.

Feature set generation utilizes methods including Bag of Words, Stopword Filtering, and Bigram Collocations.

The training dataset is created from text reviews in the Yelp Academic Dataset.

The evaluation process involves cross-validation to assess model performance.

The code referenced in this article is adapted from a StreamHacker article.

Python programming language is employed along with the Natural Language Toolkit (NLTK) Library for implementation.

For the dataset, 500 positive reviews (5-star ratings) and 500 negative reviews (1-star ratings) were selected from the Yelp dataset. The positive reviews are stored in a CSV file named positive-data.csv, while the negative reviews are kept in negative-data.csv.

Download: Positive and Negative Training Data

In the process, certain words (such as "no," "not," "more," "most," "below," "over," "too," "very," etc.) have been removed from the standard stopwords list available in NLTK. This adjustment is made because these words may influence sentiment in our review dataset.

stopset = set(stopwords.words('english')) - set(('over', 'under', 'below', 'more', 'most', 'no', 'not', 'only', 'such', 'few', 'so', 'too', 'very', 'just', 'any', 'once'))

One-third (1/3) of the data is allocated as the test feature set, while the remaining two-thirds (2/3) is used as the training feature set.

  • Test Set: 25% of positive data + 25% of negative data
  • Training Set: 75% of positive data + 75% of negative data
negfeats = [(featx(f), 'neg') for f in word_split(negdata)]
posfeats = [(featx(f), 'pos') for f in word_split(posdata)]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

Classification is performed with three distinct classifiers. In other words, the evaluation process involves training and assessing three separate classifiers.

Training Naive Bayes Classifier

classifier = NaiveBayesClassifier.train(trainfeats)

Training Maximum Entropy Classifier

I have utilized the Generalized Iterative Scaling (GIS) algorithm for classification. Other available algorithms include Improved Iterative Scaling (IIS) and LM-BFGS, with training conducted using Megam (megam). For further details, visit: NLTK MaxEnt Documentation.

classifier = MaxentClassifier.train(trainfeats, 'GIS', trace=0, encoding=None, labels=None, sparse=True, gaussian_prior_sigma=0, max_iter = 1)

Training Support Vector Machine Classifier

I have employed the Linear Support Vector Classification model. Alternatively, BernoulliNB and LogisticRegression can be used instead of LinearSVC. For additional information, visit: NLTK Scikit-Learn Documentation.

classifier = SklearnClassifier(LinearSVC(), sparse=False)
classifier.train(trainfeats)

The effectiveness of the classification is assessed through several metrics including overall Accuracy, Precision, Recall, and F-measure.

Additionally, cross-validation is employed for evaluation. This process involves combining the positive and negative features, then randomly shuffling them. Shuffling is crucial to ensure that each test set contains a mix of both positive and negative data. In the provided code, n represents the number of folds, where n = 5 indicates a 5-fold cross-validation.

trainfeats = negfeats + posfeats	
random.shuffle(trainfeats)	
n = 5

Evaluation can be performed using various feature sets, such as:

  • All words feature set
  • All words feature set with stopword filtering
  • Bigram word feature set
  • Bigram word feature set with stopword filtering
evaluate_classifier(word_feats)
evaluate_classifier(stopword_filtered_word_feats)
evaluate_classifier(bigram_word_feats)	
evaluate_classifier(bigram_word_feats_stopwords)

Here is the full code:

import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier, MaxentClassifier, SklearnClassifier
import csv
from sklearn import cross_validation
from sklearn.svm import LinearSVC, SVC
import random
from nltk.corpus import stopwords
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

posdata = []
with open('positive-data.csv', 'rb') as myfile:	
	reader = csv.reader(myfile, delimiter=',')
	for val in reader:
		posdata.append(val[0])		

negdata = []
with open('negative-data.csv', 'rb') as myfile:	
	reader = csv.reader(myfile, delimiter=',')
	for val in reader:
		negdata.append(val[0])			

def word_split(data):	
	data_new = []
	for word in data:
		word_filter = [i.lower() for i in word.split()]
		data_new.append(word_filter)
	return data_new

def word_split_sentiment(data):
	data_new = []
	for (word, sentiment) in data:
		word_filter = [i.lower() for i in word.split()]
		data_new.append((word_filter, sentiment))
	return data_new
	
def word_feats(words):	
	return dict([(word, True) for word in words])

stopset = set(stopwords.words('english')) - set(('over', 'under', 'below', 'more', 'most', 'no', 'not', 'only', 'such', 'few', 'so', 'too', 'very', 'just', 'any', 'once'))
     
def stopword_filtered_word_feats(words):
    return dict([(word, True) for word in words if word not in stopset])

def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    """
    print words
    for ngram in itertools.chain(words, bigrams): 
		if ngram not in stopset: 
			print ngram
    exit()
    """    
    return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])
    
def bigram_word_feats_stopwords(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    """
    print words
    for ngram in itertools.chain(words, bigrams): 
		if ngram not in stopset: 
			print ngram
    exit()
    """    
    return dict([(ngram, True) for ngram in itertools.chain(words, bigrams) if ngram not in stopset])

# Calculating Precision, Recall & F-measure
def evaluate_classifier(featx):
	
	negfeats = [(featx(f), 'neg') for f in word_split(negdata)]
	posfeats = [(featx(f), 'pos') for f in word_split(posdata)]
	    
	negcutoff = len(negfeats)*3/4
	poscutoff = len(posfeats)*3/4

	trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
	testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
	
	# using 3 classifiers
	classifier_list = ['nb', 'maxent', 'svm'] 	
		
	for cl in classifier_list:
		if cl == 'maxent':
			classifierName = 'Maximum Entropy'
			classifier = MaxentClassifier.train(trainfeats, 'GIS', trace=0, encoding=None, labels=None, sparse=True, gaussian_prior_sigma=0, max_iter = 1)
		elif cl == 'svm':
			classifierName = 'SVM'
			classifier = SklearnClassifier(LinearSVC(), sparse=False)
			classifier.train(trainfeats)
		else:
			classifierName = 'Naive Bayes'
			classifier = NaiveBayesClassifier.train(trainfeats)
			
		refsets = collections.defaultdict(set)
		testsets = collections.defaultdict(set)

		for i, (feats, label) in enumerate(testfeats):
				refsets[label].add(i)
				observed = classifier.classify(feats)
				testsets[observed].add(i)

		accuracy = nltk.classify.util.accuracy(classifier, testfeats)
		pos_precision = nltk.metrics.precision(refsets['pos'], testsets['pos'])
		pos_recall = nltk.metrics.recall(refsets['pos'], testsets['pos'])
		pos_fmeasure = nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
		neg_precision = nltk.metrics.precision(refsets['neg'], testsets['neg'])
		neg_recall = nltk.metrics.recall(refsets['neg'], testsets['neg'])
		neg_fmeasure =  nltk.metrics.f_measure(refsets['neg'], testsets['neg'])
		
		print ''
		print '---------------------------------------'
		print 'SINGLE FOLD RESULT ' + '(' + classifierName + ')'
		print '---------------------------------------'
		print 'accuracy:', accuracy
		print 'precision', (pos_precision + neg_precision) / 2
		print 'recall', (pos_recall + neg_recall) / 2
		print 'f-measure', (pos_fmeasure + neg_fmeasure) / 2	
				
		#classifier.show_most_informative_features()
	
	print ''
	
	## CROSS VALIDATION
	
	trainfeats = negfeats + posfeats	
	
	# SHUFFLE TRAIN SET
	# As in cross validation, the test chunk might have only negative or only positive data	
	random.shuffle(trainfeats)	
	n = 5 # 5-fold cross-validation	
	
	for cl in classifier_list:
		
		subset_size = len(trainfeats) / n
		accuracy = []
		pos_precision = []
		pos_recall = []
		neg_precision = []
		neg_recall = []
		pos_fmeasure = []
		neg_fmeasure = []
		cv_count = 1
		for i in range(n):		
			testing_this_round = trainfeats[i*subset_size:][:subset_size]
			training_this_round = trainfeats[:i*subset_size] + trainfeats[(i+1)*subset_size:]
			
			if cl == 'maxent':
				classifierName = 'Maximum Entropy'
				classifier = MaxentClassifier.train(training_this_round, 'GIS', trace=0, encoding=None, labels=None, sparse=True, gaussian_prior_sigma=0, max_iter = 1)
			elif cl == 'svm':
				classifierName = 'SVM'
				classifier = SklearnClassifier(LinearSVC(), sparse=False)
				classifier.train(training_this_round)
			else:
				classifierName = 'Naive Bayes'
				classifier = NaiveBayesClassifier.train(training_this_round)
					
			refsets = collections.defaultdict(set)
			testsets = collections.defaultdict(set)
			for i, (feats, label) in enumerate(testing_this_round):
				refsets[label].add(i)
				observed = classifier.classify(feats)
				testsets[observed].add(i)
			
			cv_accuracy = nltk.classify.util.accuracy(classifier, testing_this_round)
			cv_pos_precision = nltk.metrics.precision(refsets['pos'], testsets['pos'])
			cv_pos_recall = nltk.metrics.recall(refsets['pos'], testsets['pos'])
			cv_pos_fmeasure = nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
			cv_neg_precision = nltk.metrics.precision(refsets['neg'], testsets['neg'])
			cv_neg_recall = nltk.metrics.recall(refsets['neg'], testsets['neg'])
			cv_neg_fmeasure =  nltk.metrics.f_measure(refsets['neg'], testsets['neg'])
					
			accuracy.append(cv_accuracy)
			pos_precision.append(cv_pos_precision)
			pos_recall.append(cv_pos_recall)
			neg_precision.append(cv_neg_precision)
			neg_recall.append(cv_neg_recall)
			pos_fmeasure.append(cv_pos_fmeasure)
			neg_fmeasure.append(cv_neg_fmeasure)
			
			cv_count += 1
				
		print '---------------------------------------'
		print 'N-FOLD CROSS VALIDATION RESULT ' + '(' + classifierName + ')'
		print '---------------------------------------'
		print 'accuracy:', sum(accuracy) / n
		print 'precision', (sum(pos_precision)/n + sum(neg_precision)/n) / 2
		print 'recall', (sum(pos_recall)/n + sum(neg_recall)/n) / 2
		print 'f-measure', (sum(pos_fmeasure)/n + sum(neg_fmeasure)/n) / 2
		print ''
	
		
evaluate_classifier(word_feats)
#evaluate_classifier(stopword_filtered_word_feats)
#evaluate_classifier(bigram_word_feats)	
#evaluate_classifier(bigram_word_feats_stopwords)

Below are the results obtained by evaluating the classifier using the all words feature set with the evaluate_classifier(word_feats) function.

—————————————
SINGLE FOLD RESULT (Naive Bayes)
—————————————
accuracy: 0.712
precision 0.808857808858
recall 0.712
f-measure 0.6875
    
—————————————
SINGLE FOLD RESULT (Maximum Entropy)
—————————————
accuracy: 0.696
precision 0.801753867376
recall 0.696
f-measure 0.666806958474
    
—————————————
SINGLE FOLD RESULT (SVM)
—————————————
accuracy: 0.884
precision 0.884221311475
recall 0.884
f-measure 0.883983293594
    
—————————————
N-FOLD CROSS VALIDATION RESULT (Naive Bayes)
—————————————
accuracy: 0.742
precision 0.820669619013
recall 0.742023637087
f-measure 0.724301799825
    
—————————————
N-FOLD CROSS VALIDATION RESULT (Maximum Entropy)
—————————————
accuracy: 0.723
precision 0.808616815505
recall 0.725142220446
f-measure 0.702686890214
    
—————————————
N-FOLD CROSS VALIDATION RESULT (SVM)
—————————————
accuracy: 0.855
precision 0.854878928286
recall 0.855295825428
f-measure 0.854608585556

The results presented are from evaluating the classifier using the bigram words feature set with the evaluate_classifier(bigram_word_feats) function.

—————————————
SINGLE FOLD RESULT (Naive Bayes)
—————————————
accuracy: 0.812
precision 0.863372093023
recall 0.812
f-measure 0.805111874077
    
—————————————
SINGLE FOLD RESULT (Maximum Entropy)
—————————————
accuracy: 0.784
precision 0.849162011173
recall 0.784
f-measure 0.773429108485
    
—————————————
SINGLE FOLD RESULT (SVM)
—————————————
accuracy: 0.884
precision 0.884024577573
recall 0.884
f-measure 0.88399814397
    
—————————————
N-FOLD CROSS VALIDATION RESULT (Naive Bayes)
—————————————
accuracy: 0.827
precision 0.861070429762
recall 0.827596155604
f-measure 0.822565942413
    
—————————————
N-FOLD CROSS VALIDATION RESULT (Maximum Entropy)
—————————————
accuracy: 0.8
precision 0.84954715877
recall 0.802450691272
f-measure 0.792653687642
    
—————————————
N-FOLD CROSS VALIDATION RESULT (SVM)
—————————————
accuracy: 0.868
precision 0.86793244094
recall 0.868212258492
f-measure 0.867717178745

The results can potentially be enhanced by increasing the size of the training dataset. Currently, the dataset includes a total of 1000 reviews (500 positive and 500 negative). Expanding this dataset may help improve the accuracy of the results.

Additionally, accuracy can be improved by using a feature set that includes only the most significant words and bigrams, rather than all words and bigrams. 'Best' refers to the most frequently occurring words or bigrams.



Loading...

Talk to an Expert

Request a Free Quote and expert consultation.