How to Write Your Own Recommendation System, Part 1: Small Scale

Image for post
Image for post

A recommendation system is used to, well… recommend stuff. Netflix is a great example of this; after you rate several titles it will be able to recommend new titles to you. Have you ever wondered how this really works? Or even apply it to own site or application. Well… you’re about to find out.

I’ll be working through a complete, simple solution and provide code examples in Python. I will be using movie ratings (1–5 stars) as the data set, but this same method should work for percentages, or anything else that is can be numerically represented.

Small Scale

Before we can ramp it up into thousands of ratings and get some meaningful results we need to understand how it works on a smaller scale.

I will be using some sample data provided by this book. HPrepresents the trilogy of Harry Potter movies, TWis for Twilight and SWis for the first three Star Wars movies.

Image for post
Image for post

A to D are four users that have provided movie ratings. I’m not sure why D watched the second Harry Potter movie before watching the first, however let’s try and estimate how much he will like the first movie with the data provided. If you use this same process to fill in all of the missing values you could then make recommendations — but we will get into this later.

The simplest way to work out this missing rating (D, HP1) is to find the person with the most similar taste that did see HP1 and use their rating. This isn’t too accurate on a system with 4 users, but the more users and amount of data available then more accurate this becomes.

So how do we determine who has the most similar taste to D? There is a formula called the cosine similarity or distance that takes in the ratings of two users and outputs a number between 0.0 and 1.0 where 1.0 would represent that they have exactly the same ratings for all of their movies. So we’re looking for the user with the greatest cosine similarity.

Cosine Distance/Similarity

Image for post
Image for post

Where a and b represent the users. Let’s calculate the cosine similarity of A (red) and B (blue):

Image for post
Image for post

The zero values are all cancelled out since they have no effect on the result itself, so a more simple form is:

Image for post
Image for post

We can only compare with two other users: Aand Bbecause Cdoes not have a rating for HP1 so it would not be helpful. Using the cosine distance we can derive:

  • A and D= 0.0
  • B and D= 0.435

We don’t have a lot of choice but Bis the clear winner with the greatest cosine similarity. Here is the code to produce the answer:

_ = 0 # A missing value.
users = {
'A': [4, _, _, 5, 1, _, _],
'B': [5, 5, 4, _, _, _, _],
'C': [_, _, _, 2, 4, 5, _],
'D': [_, 3, _, _, _, _, 3],
def cosine_distance(user1, user2):
top = 0
user1bottom = 0
user2bottom = 0
for i in range(0, len(user1)):
top += user1[i] * user2[i]
user1bottom += user1[i] * user1[i]
user2bottom += user2[i] * user2[i]
return top / (sqrt(user1bottom) * sqrt(user2bottom))print cosine_distance(users['A'], users['D'])
print cosine_distance(users['B'], users['D'])

An important note here is that missing values are treated as 0. In reality this would mean the people that have not rated the movie reallydislike it. It is often better to substitute a balanced value for the missing values while calculating, in this case 2.5might be a good choice. After changing the value of _and rerunning it we get:

  • A and D= 0.915
  • B and D= 0.952

From this point we can lookup what score they (B) gave for the movie. In this case, 5 stars and say that Dshouldreally like it and we think they would rate it 5 stars.

Here is a more general function:

def estimate_rating(users, userName, movie):
user_best_match = None
user_best_match_dist = 0
for user in users:
# We don't want to calculate ourself.
if user == userName:
# Ignore users that haven't rated the movie.
if users[user][movie] == 0:
dist = cosine_distance(users[userName], users[user])
print '%s -> %s = %.3f' % (userName, user, dist)
if dist >= user_best_match_dist:
user_best_match_dist = dist
user_best_match = user
# Return the rating of the best matched user.
return users[user_best_match][movie]
HP1, HP2, HP3, TW, SW1, SW2, SW3 = range(0, 7)
print estimate_rating(users, 'D', HP1)

To get more accurate results (if we had more data) we would take an average of a fixed number of similar users rather than just the most similar.

Stay Tuned!

In part 2 we will be looking at much larger and real world data set. Tips on performance, increasing quality of the recommendations and being able to objectively calculate how accurate our algorithm is as we make changes to it.

Originally published at on July 16, 2016.

Written by

I’m a data nerd and TDD enthusiast originally from Sydney. Currently working for Uber in New York. My thoughts here are my own. 🤓

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store