Making movie recommendations

July 17, 2016

I did my summer internship at Microsoft Technology Center, Bangalore. There, I worked on the problem of recommending movies. This post talks about how the solution that I developed works.

A screenshot of the final app displaying movies similar to Cinderella

The Problem Statement

Given a set of user ratings for a given set of movies, predict the rating of a user for a movie.

The problem has three cases:

Types of Recommender Systems

By the type of data, Recommender Systems can be classified into two types depending on the type of data that they use. They are :

I worked on Explicit Movie Rating Data from GroupLens. The solution has been tested on the same.

The Dataset

The GroupLens MovieLens 100k dataset consists of 1682 movies rated by 943 users. Each user has rated at least 20 movies, a total of 10,000 ratings. Ratings are positive integers from 1 to 5.

Sno. UserId MovieId Rating Timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596

The frequency distribution of the movie ratings data

The evaluation metric

The accuracy of the predictions was measured by the Root Mean Squared Error(RMSE).

The solution


The ratings table was converted into a movie by user matrix of dimension (1682, 943). Since, not every user rated every movie, there were a lot of missing values. They were designated by a 0.

As a result, more than 97 % of the matrix consists of zeros. So, to efficiently store the matrix, a sparse representation was used.

Predicting Movie Ratings

Now, given the matrix representation, by computing distance based scores, we can find :

Similarity is related to distance by :

Similar Users

If we can find users similar to the current one, we can recommend movies that the similar users liked and have not been watched by the current user.

Also, a reasonable way of predicting the current user’s rating is by taking an average of the ratings of k - similar users weighted by their similarity to current user.


This method has the following shortcomings :

Similar Movies

To overcome the shortcomings of user-user similarity based predictions, movie-movie similarity is used.

Here, the same similarity metric is used to compute similar movies. These similar movies can be shown along with the description of the movie.

To make a rating prediction, we can take an average of k - similar movies that have been rated by the current user weighted by their similarity to the current movie.


This approach is better as :

Choosing a similarity metric

Following distance metrics are widely used and were considered :

Euclidean Distance

The most common way of finding distance between two data points.

In two dimensions, distance between and ,

Generally, in n-dimensions, distance between and ,

This however does not work well in our case. This is because there are a large number of missing ratings which are represented by a zero. The euclidean distance interprets the zero as a rating that is lower than 1.

Manhattan Distance

This is another common way of finding distance.

In two dimensions, distance between and ,

Generally, in n-dimensions, distance between and ,

Manhattan distance fails too, due to the same reason.

Cosine Distance

This metric basically finds the cosine of the angle between the lines joining the data points to the origin.

In two dimensions, distance between and ,

Generally, in n-dimensions, distance between and ,

This cosine distance takes a different approach to the problem and relies on multiplication instead of addition or subtraction. Thus, if either of the movies have not been rated by a given user that set of data points is discarded due to incompleteness.

Cosine similarity can be defined as,


In the implementation, I used the cosine similarity metric to find movies similar to a given movie to improve discovery of movies. The metric was implented in python and deployed on Azure ML as a web service.

Movies Similar to Cinderella

For the movie Cinderella, the movie simililarity service predicts

as similar movies. This seems to works pretty well - even better than a genre match for surfacing movies.

The ratings predicted by this method gives an RMSE of 1.05 which, however, is not good enough.

Making movie recommendations - July 17, 2016 - Saurabh Mathur