How to Build a Recommender System with Python and the python-recsys Library

python-recsys is a Python library designed for implementing recommender systems.

Currently, python-recsys supports two recommender algorithms: Singular Value Decomposition (SVD) and Neighborhood SVD.

This QuickStart tutorial demonstrates how to use python-recsys for building recommender systems. It uses the MovieLens movie ratings dataset to show examples of computing item similarity and recommending movies to users.

In this article, I will provide code examples for using a custom CSV dataset and evaluating a recommender system with the SVD algorithm.

The data will be fetched from a CSV file containing three fields: user_id, item_id, and star_rating. The item_id can represent various entities like hotels, movies, books, etc. while star_rating indicates the user's rating for the item, ranging from 1 to 5, with 5 being the highest rating and 1 being the lowest.

For this demonstration, I have created a sample CSV file named dataset-recsys.csv, which includes the three columns (user_id, item_id, and star_rating).

Below is the code to create the CSV file:

import random 
import csv

fieldnames = ['user_id', 'item_id', 'star_rating']
with open('dataset-recsys.csv', "w") as myfile: # writing data to new csv file
	writer = csv.DictWriter(myfile, delimiter = ',', fieldnames = fieldnames)	
	writer.writeheader()	
	
	for x in range(1, 21):
		items = random.sample(list(range(1, 41)), 20)
		random.randint(1,5)
		for item in items:		
			writer.writerow({'user_id': x, 'item_id': item, 'star_rating': random.randint(1, 5)})

Creating a Data Model

The python-recsys library utilizes matrix factorization techniques, such as SVD and Neighborhood SVD. These algorithms take input data represented in matrix form and decompose it, reducing it to a lower-dimensional space.

import recsys.algorithm
recsys.algorithm.VERBOSE = True
    
from recsys.algorithm.factorize import SVD
svd = SVD()
svd.load_data(filename='./data/dataset-recsys.csv', sep=',', format={'col':0, 'row':1, 'value':2, 'ids': int})
    
# About format parameter:
    #   'row': 1 -> Rows in matrix come from second column in dataset-recsys.csv file
    #   'col': 0 -> Cols in matrix come from first column in dataset-recsys.csv file
    #   'value': 2 -> Values (Mij) in matrix come from third column in dataset-recsys.csv file
    #   'ids': int -> Ids (row and col ids) are integers (not strings)
    
train, test = data.split_train_test(percent=70) # 70% train, 30% test
    
svd = SVD()
svd.set_data(train)
    
k = 100
svd.compute(k=k, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True)
    
# min_values = 10 means those items that had less than 10 users who rated it, and those users that rated less than 10 items are removed
    
# Parameters:	
  # k (int) – number of dimensions
  # min_values (int) – min. number of non-zeros (or non-empty values) any row or col must have
  # pre_normalize (string) – normalize input matrix. Possible values are tfidf, rows, cols, all.
  # mean_center (Boolean) – centering the input matrix (aka mean substraction)
  # post_normalize (Boolean) – Normalize every row of U Sigma to be a unit vector. Thus, row similarity (using cosine distance) returns [-1.0 .. 1.0]
  # savefile (string) – path to save the SVD factorization (U, Sigma and V matrices)
    
# output SVD model can also be saved in a zip file
svd.compute(k=k, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True, savefile='/tmp/datamodel')
svd.similarity(ITEMID1, ITEMID2)
    
# and then the zipped model can be loaded
svd2 = SVD(filename='/tmp/movielens')
svd2.similarity(ITEMID1, ITEMID2)    

Computing Similarity

Similarity between two items:

svd.similarity(ITEMID1, ITEMID2)

Similar items to a particular item:

svd.similar(ITEMID1, 5) # show 5 similar items

The similar and similarity functions operate on the row values of the matrix M.

In the example below, while loading the data, we have designated the second column of our CSV dataset as the row index.

svd.load_data(filename='./data/dataset-recsys.csv', sep=',', format={'col':0, 'row':1, 'value':2, 'ids': int})    

In our CSV file, the first column represents user_id and the second column represents item_id. Therefore, we need to pass ITEMID as a parameter to the similar and similarity functions.

If we wish to compute similarity between users, we should first load the data by specifying the first column (user_id) as the row index, as shown below:

svd.load_data(filename='./data/dataset-recsys.csv', sep=',', format={'col':1, 'row':0, 'value':2, 'ids': int})    

Now we can compute similarity between users.

# Similarity between two users
svd.similarity(USERID1, USERID2)

# Similar users to a particular user 
svd.similar(USERID1, 5) # show 5 similar users

Predicting rating for a particular user and item

MIN_RATING = 0.0
MAX_RATING = 5.0
ITEMID = 1
USERID = 1
svd.predict(ITEMID, USERID, MIN_RATING, MAX_RATING) # predicted rating value
svd.get_matrix().value(ITEMID, USERID) # real rating value

Recommend items to a particular user

Recommend items to a user that he/she hasn’t rated before.

# cols are users and rows are items, thus we set is_row=False
# n = 5, recommend 5 items
# only_unknowns = True, only return unknown values in matrix M, i.e. items not rated by the user
svd.recommend(USERID, n=5, only_unknowns=True, is_row=False)

Evaluation

The following code demonstrates evaluation using two prediction-based metrics: Root Mean Square Error (RMSE) and Mean Absolute Error (MAE), along with two rank-based metrics: Spearman’s rho and Kendall-tau.

rmse = RMSE()
mae = MAE()
spearman = SpearmanRho()
kendall = KendallTau()
for rating, item_id, user_id in test.get():
    try:
        pred_rating = svd.predict(item_id, user_id)
        rmse.add(rating, pred_rating)
        mae.add(rating, pred_rating)
        spearman.add(rating, pred_rating)
        kendall.add(rating, pred_rating) 
    except KeyError:
        continue
    
print 'RMSE=%s' % rmse.compute()
print 'MAE=%s' % mae.compute()
print 'Spearman\'s rho=%s' % spearman.compute()
print 'Kendall-tau=%s' % kendall.compute()

Here’s the full source code:

import sys

#To show some messages:
import recsys.algorithm
#recsys.algorithm.VERBOSE = True

from recsys.algorithm.factorize import SVD
from recsys.datamodel.data import Data
from recsys.evaluation.prediction import RMSE, MAE
from recsys.evaluation.decision import PrecisionRecallF1
from recsys.evaluation.ranking import SpearmanRho, KendallTau

#Dataset
PERCENT_TRAIN = 70
data = Data()
data.load('./data/dataset-recsys.csv', sep=',', format={'col':0, 'row':1, 'value':2, 'ids':int})

#Train & Test data
train, test = data.split_train_test(percent=PERCENT_TRAIN)

#Create SVD
K=100
svd = SVD()
svd.set_data(train)

svd.compute(k=K, min_values=1, pre_normalize=None, mean_center=True, post_normalize=True)
#svd.compute(k=K, min_values=5, pre_normalize=None, mean_center=True, post_normalize=True)
#svd.compute(k=K, pre_normalize=None, mean_center=True, post_normalize=True)

print ''
print 'COMPUTING SIMILARITY'
print svd.similarity(1, 2) # similarity between items
print svd.similar(1, 5) # show 5 similar items

print ''
print 'GENERATING PREDICTION'
MIN_RATING = 0.0
MAX_RATING = 5.0
ITEMID = 1
USERID = 1
print svd.predict(ITEMID, USERID, MIN_RATING, MAX_RATING) # predicted rating value
print svd.get_matrix().value(ITEMID, USERID) # real rating value

print ''
print 'GENERATING RECOMMENDATION'
print svd.recommend(USERID, n=5, only_unknowns=True, is_row=False) 

#Evaluation using prediction-based metrics
rmse = RMSE()
mae = MAE()
spearman = SpearmanRho()
kendall = KendallTau()
#decision = PrecisionRecallF1()
for rating, item_id, user_id in test.get():
    try:
        pred_rating = svd.predict(item_id, user_id)
        rmse.add(rating, pred_rating)
        mae.add(rating, pred_rating)
        spearman.add(rating, pred_rating)
        kendall.add(rating, pred_rating)         
    except KeyError:
        continue

print ''
print 'EVALUATION RESULT'
print 'RMSE=%s' % rmse.compute()
print 'MAE=%s' % mae.compute()
print 'Spearman\'s rho=%s' % spearman.compute()
print 'Kendall-tau=%s' % kendall.compute()
#print decision.compute()
print ''


Loading...

Talk to an Expert

Request a Free Quote and expert consultation.