pranav_shirole - Recommending movies to a user

Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous, and can commonly be seen in online stores, movie databases, and job finders. In this blog post, we will explore content-based and collaborative filtering recommendation systems.

The dataset we’ll be working on has been acquired from GroupLens. It consists of 27 million ratings and 1.1 million tag applications applied to 58,000 movies by 280,000 users.

# import libraries
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# store the movie information into a pandas dataframe
movies_df = pd.read_csv('movies1.csv')

# store the ratings information into a pandas dataframe
ratings_df = pd.read_csv('ratings.csv')

movies_df.head()

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

Each movie has a unique ID, a title with its release year along with it (which may contain unicode characters) and several different genres in the same field.

# dimensions of the dataframes
print(movies_df.shape)
print(ratings_df.shape)

(58097, 3)
(27753444, 4)

Preprocessing the data

Let’s remove the year from the ‘title’ column and store it in a new ‘year’ column.

# use regular expressions to find a year stored between parantheses
# we specify the parantheses so we don't conflict with movies that have years in their titles
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))', expand=False)

# remove the parentheses
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)', expand=False)

# remove the years from the 'title' column
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')

# apply the strip finction to get rid of any ending whitespace characters that may have appeared
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
movies_df.head()

	movieId	title	genres	year
0	1	Toy Story	Adventure\|Animation\|Children\|Comedy\|Fantasy	1995
1	2	Jumanji	Adventure\|Children\|Fantasy	1995
2	3	Grumpier Old Men	Comedy\|Romance	1995
3	4	Waiting to Exhale	Comedy\|Drama\|Romance	1995
4	5	Father of the Bride Part II	Comedy	1995

Let’s also split the values in the ‘genres’ column into a ‘list of genres’ to simplify future use. Apply Python’s split string function on the genres column.

# every genre is separated by a |. So call the split function on |.
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()

	movieId	title	genres	year
0	1	Toy Story	[Adventure, Animation, Children, Comedy, Fantasy]	1995
1	2	Jumanji	[Adventure, Children, Fantasy]	1995
2	3	Grumpier Old Men	[Comedy, Romance]	1995
3	4	Waiting to Exhale	[Comedy, Drama, Romance]	1995
4	5	Father of the Bride Part II	[Comedy]	1995

Since keeping genres in a list format isn’t optimal for the content-based recommendation system technique, we will use the One Hot Encoding technique to convert the list of genres to a vector where each column corresponds to one possible value of the feature. This encoding is needed for feeding categorical data.
In this case, we store every different genre in columns that contain either 1 or 0. 1 shows that a movie has that genre and 0 shows that it doesn’t. Let’s also store this dataframe in another variable since genres won’t be important for our first recommendation system.

# copy the movie dataframe into a new one
moviesWithGenres_df = movies_df.copy()

# for every row in the dataframe, iterate through the list of genres and place a 1 in the corresponding column
for index, row in movies_df.iterrows():
    for genre in row['genres']:
        moviesWithGenres_df.at[index, genre] = 1
        
# fill in the NaN values with 0 to show that a movie doesn't have that column's genre
moviesWithGenres_df = moviesWithGenres_df.fillna(0)
moviesWithGenres_df.head()

	movieId	title	genres	year	Adventure	Animation	Children	Comedy	Fantasy	Romance	…
0	1	Toy Story	[Adventure, Animation, Children, Comedy, Fantasy]	1995	1.0	1.0	1.0	1.0	1.0	0.0	…
1	2	Jumanji	[Adventure, Children, Fantasy]	1995	1.0	0.0	1.0	0.0	1.0	0.0	…
2	3	Grumpier Old Men	[Comedy, Romance]	1995	0.0	0.0	0.0	1.0	0.0	1.0	…
3	4	Waiting to Exhale	[Comedy, Drama, Romance]	1995	0.0	0.0	0.0	1.0	0.0	1.0	…
4	5	Father of the Bride Part II	[Comedy]	1995	0.0	0.0	0.0	1.0	0.0	0.0	…

5 rows × 24 columns

Now, let’s focus on the ratings dataframe.

ratings_df.head()

	userId	movieId	rating	timestamp
0	1	307	3.5	1256677221
1	1	481	3.5	1256677456
2	1	1091	1.5	1256677471
3	1	1257	4.5	1256677460
4	1	1449	4.5	1256677264

Every row in the ratings dataframe has a userId associated with at least one movie, a rating and a timestamp showing when they reviewed it. We won’t be needing the timestamp column, so let’s drop it.

ratings_df = ratings_df.drop('timestamp',1)
ratings_df.head()

	userId	movieId	rating
0	1	307	3.5
1	1	481	3.5
2	1	1091	1.5
3	1	1257	4.5
4	1	1449	4.5

Content-based recommendation system

This technique attempts to figure out what a user’s favorite aspects of an item are, and then recommends items that present those aspects. In our case, we’re going to try to figure out the input’s favorite genres from the movies and ratings given.

Advantages of content-based filtering: - it learns the user’s preferences. - it’s highly personalized for the user.

Disadvantages of content-based filtering: - it doesn’t take into account what others think of the item, so low quality item recommendations might happen. - Extracting data is not always intuitive. - Determining what characteristics of the item the user dislikes or likes is not always obvious.

Create an input to recommend movies to.

userInput = [
    {'title':'Mission: Impossible - Fallout', 'rating':5},
    {'title':'Top Gun', 'rating':4.5},
    {'title':'Jerry Maguire', 'rating':3},
    {'title':'Vanilla Sky', 'rating':2.5},
    {'title':'Minority Report', 'rating':4},
]
inputMovies = pd.DataFrame(userInput)
inputMovies

	title	rating
0	Mission: Impossible - Fallout	5.0
1	Top Gun	4.5
2	Jerry Maguire	3.0
3	Vanilla Sky	2.5
4	Minority Report	4.0

Add movieId to input user.
Extract the input movie’s ID from the movies dataframe and add it to the input.

# filter the movies by title
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]

# merge it to get the movieId
inputMovies = pd.merge(inputId, inputMovies)

# drop information we won't use from the input dataframe
inputMovies = inputMovies.drop('genres', 1).drop('year', 1)

# final input dataframe
inputMovies

	movieId	title	rating
0	1101	Top Gun	4.5
1	1393	Jerry Maguire	3.0
2	4975	Vanilla Sky	2.5
3	5445	Minority Report	4.0
4	189333	Mission: Impossible - Fallout	5.0

We will learn the input’s preferences. So let’s get the subset of movies that the input has watched from the dataframe containing genres defined with binary values.

# filter out the movies from the input
userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]
userMovies

	movieId	title	genres	year	Adventure	Romance	…	Mystery	Sci-Fi
1079	1101	Top Gun	[Action, Romance]	1986	0.0	1.0	…	0.0	0.0
1361	1393	Jerry Maguire	[Drama, Romance]	1996	0.0	1.0	…	0.0	0.0
4879	4975	Vanilla Sky	[Mystery, Romance, Sci-Fi, Thriller]	2001	0.0	1.0	…	1.0	1.0
5348	5445	Minority Report	[Action, Crime, Mystery, Sci-Fi, Thriller]	2002	0.0	0.0	…	1.0	1.0
56349	189333	Mission: Impossible - Fallout	[Action, Adventure, Thriller]	2018	1.0	0.0	…	0.0	0.0

5 rows × 24 columns

We only need the actual genre table. Reset the index and drop the unnecessary columns.

# reset the index
userMovies = userMovies.reset_index(drop=True)

# drop unnecessary columns
userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
userGenreTable

	Adventure	Romance	Drama	Action	Crime	Thriller	Mystery	Sci-Fi
0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0
1	0.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0
2	0.0	1.0	0.0	0.0	0.0	1.0	1.0	1.0
3	0.0	0.0	0.0	1.0	1.0	1.0	1.0	1.0
4	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0

Now we learn the input preferences.
We turn each genre into weights using the input’s reviews and multiplying them into the input’s genre table, and then summing up the resulting table by column.

inputMovies['rating']

0    4.5
1    3.0
2    2.5
3    4.0
4    5.0
Name: rating, dtype: float64

# dot product to get weights
userProfile = userGenreTable.transpose().dot(inputMovies['rating'])

# the user profile
userProfile

Adventure              5.0
Animation              0.0
Children               0.0
Comedy                 0.0
Fantasy                0.0
Romance               10.0
Drama                  3.0
Action                13.5
Crime                  4.0
Thriller              11.5
Horror                 0.0
Mystery                6.5
Sci-Fi                 6.5
IMAX                   0.0
Documentary            0.0
War                    0.0
Musical                0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

Now we have the weights for each of the user’s preferences. This is the User Profile. Using this, we can recommend movies that satisfy the user’s preferences.
Let’s start by extracting the genre table from the original dataframe.

# get the genre of every movie in our original dataframe
genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])

# drop unnecessary columns
genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
genreTable.head()

	Adventure	Animation	Children	Comedy	Fantasy	Romance	Drama	Action	Crime	Thriller	Horror	Mystery	Sci-Fi	IMAX	Documentary	War	Musical	Western	Film-Noir	(no genres listed)
movieId
1	1.0	1.0	1.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	1.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	1.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

genreTable.shape

(58097, 20)

With the input’s profile and the complete list of movies and their genres in hand, we’re going to take the weighted average of every movie based on the input profile and recommend the top twenty movies that most satisfy it.

# multiply the genre by the weights and then take the weighted average
recommendationTable_df = ((genreTable*userProfile).sum(axis=1)) / (userProfile.sum())
recommendationTable_df.head()

movieId
1    0.083333
2    0.083333
3    0.166667
4    0.216667
5    0.000000
dtype: float64

Here is the recommendation table.

movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]

	movieId	title	genres	year
0	1	Toy Story	[Adventure, Animation, Children, Comedy, Fantasy]	1995
1	2	Jumanji	[Adventure, Children, Fantasy]	1995
2	3	Grumpier Old Men	[Comedy, Romance]	1995
3	4	Waiting to Exhale	[Comedy, Drama, Romance]	1995
4	5	Father of the Bride Part II	[Comedy]	1995
5	6	Heat	[Action, Crime, Thriller]	1995
6	7	Sabrina	[Comedy, Romance]	1995
7	8	Tom and Huck	[Adventure, Children]	1995
8	9	Sudden Death	[Action]	1995
9	10	GoldenEye	[Action, Adventure, Thriller]	1995
10	11	American President, The	[Comedy, Drama, Romance]	1995
11	12	Dracula: Dead and Loving It	[Comedy, Horror]	1995
12	13	Balto	[Adventure, Animation, Children]	1995
13	14	Nixon	[Drama]	1995
14	15	Cutthroat Island	[Action, Adventure, Romance]	1995
15	16	Casino	[Crime, Drama]	1995
16	17	Sense and Sensibility	[Drama, Romance]	1995
17	18	Four Rooms	[Comedy]	1995
18	19	Ace Ventura: When Nature Calls	[Comedy]	1995
19	20	Money Train	[Action, Comedy, Crime, Drama, Thriller]	1995

These are the top 20 movies to recommend to the user based on a content-based recommendation system.

Collaborative Filtering

This technique uses other users to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. there are several methods of finding similar users, and the one we will be using here is going to be based on the Pearson Correlation Function.

The process for creating a user-based recommendation system is as follows: - Select a user with the movies the user has watched. - Based on his ratings of movies, find the top X neighbors. - Get the watched movie record of the user for each neighbor. - Calculate a similarity score using some formula. - Recommend the items with the highest score.

Advantages of collaborative filtering: - It takes other user’s ratings into consideration - It doesn’t need to study or extract information from the recommended item - It adapts to the user’s interests which might change over time

Disadvantages of collaborative filtering: - The approximation function can be slow. - There might be a low amount of users to approximate - There might be privacy issues when trying to learn the user’s experiences.

Let’s create an input user to recommend movies to.

userInput = [
    {'title':'Mission: Impossible - Fallout', 'rating':5},
    {'title':'Top Gun', 'rating':4.5},
    {'title':'Jerry Maguire', 'rating':3},
    {'title':'Vanilla Sky', 'rating':2.5},
    {'title':'Minority Report', 'rating':4},
]
inputMovies = pd.DataFrame(userInput)
inputMovies

	title	rating
0	Mission: Impossible - Fallout	5.0
1	Top Gun	4.5
2	Jerry Maguire	3.0
3	Vanilla Sky	2.5
4	Minority Report	4.0

# filter the movies by title
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]

# merge it to get the movieId
inputMovies = pd.merge(inputId, inputMovies)

# drop information we won't use from the input dataframe
inputMovies = inputMovies.drop('genres', 1).drop('year', 1)

# final input dataframe
inputMovies

	movieId	title	rating
0	1101	Top Gun	4.5
1	1393	Jerry Maguire	3.0
2	4975	Vanilla Sky	2.5
3	5445	Minority Report	4.0
4	189333	Mission: Impossible - Fallout	5.0

The users who have seen the same movies

Now, with the movie IDs in our input, we can get the subset of users that have watched and reviewd the movies in our input.

# filter out users that have watched movies that the input has watched and storing it
userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]
userSubset.head()

	userId	movieId	rating
214	4	1101	4.0
248	4	1393	2.5
586	4	4975	4.0
610	4	5445	4.5
935	8	1393	4.0

Group the rows by userId.

# groupby creates several sub dataframes where they all have the same value in the column specified as the parameter
userSubsetGroup = userSubset.groupby(['userId'])

Let’s look at one of these users - userId = 4

userSubsetGroup.get_group(4)

	userId	movieId	rating
214	4	1101	4.0
248	4	1393	2.5
586	4	4975	4.0
610	4	5445	4.5

Let’s sort these groups so the users that share the most movies in common with the input have higher priority. This provides a richer recommendation since we won’t go through every single user.

userSubsetGroup = sorted(userSubsetGroup, key=lambda x: len(x[1]), reverse=True)

Now let’s look at the first user.

userSubsetGroup[0:3]

[(214,
         userId  movieId  rating
  20548     214     1101     2.0
  20638     214     1393     3.0
  21122     214     4975     2.0
  21160     214     5445     4.0
  21933     214   189333     3.0),
 (6264,
          userId  movieId  rating
  616485    6264     1101     5.0
  616574    6264     1393     4.0
  617440    6264     4975     3.0
  617480    6264     5445     3.0
  618666    6264   189333     4.0),
 (19924,
           userId  movieId  rating
  1945179   19924     1101     3.5
  1945273   19924     1393     4.0
  1946065   19924     4975     2.0
  1946152   19924     5445     4.0
  1948193   19924   189333     3.5)]

Next, we are going to compare users to our specified user and find the one that is most similar.
We’re going to find out how similar each user is to the input through the Pearson Correlation Coefficient. It is used to measure the strength of a linear association between two variables.

We will select a subset of users to iterate through. The limit is imposed because we don’t want to waste too much time going through every single user.

userSubsetGroup = userSubsetGroup[0:100]

Calculate the Pearson Correlation between the input user and the subset group, and store it in a dictionary, where the key is the userId and the value is the coefficient.

pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0

pearsonCorrelationDict.items()

dict_items([(214, 0.23055616708169335), (6264, 0.518751375933811), (19924, 0.48424799847909467), (21962, 0.7190233885442843), (22361, 0.6163156344279349), (24518, -0.48424799847909017), (28244, -0.22258705026211378), (30387, 0.8339502495593619), (31727, -0.6163156344279349), (32728, -0.26413527189768593), (33550, 0.3774147062120368), (36202, 0.9510441892119876), (38778, 0.5906244232186185), (43227, -0.1968748077395395), (43264, -0.9021937088963177), (48109, 0.0), (50016, 0.04402254531627891), (59611, 0.24946109012559378), (62705, 0.5187513759338097), (63353, -0.4799585206127619), (64733, -0.8524929243380921), (69860, 0.43133109281375515), (70271, -0.08524929243380922), (71857, 0.7781270639007126), (72194, 0.24112141108520613), (75629, 0.6016946526766817), (77609, 0.48224282217041226), (80398, 0.7776587696250218), (81924, -0.32283199898606263), (93997, 0.7771889263740438), (94749, 0.0), (98561, -0.5619806572616304), (99014, -0.23055616708169688), (102101, 0.2516098041413576), (104322, -0.43133109281375515), (105397, 0.8859366348279278), (112491, -0.6469966392206334), (116632, 0.20173664619648324), (117053, 0.5917813771642448), (124357, 0.7635511351031528), (125365, 0.7009130258223497), (128610, 0.45109685444815883), (131687, -0.22874785549890708), (133546, 0.6163156344279386), (148144, 0.10783277320344019), (153921, 0.12056070554260306), (161582, -0.3616821166278092), (167427, 0.10783277320343922), (167835, 0.10783277320343156), (171745, -0.6995593008237843), (173280, -0.2516098041413576), (175811, 0.616315634427937), (184822, 0.07421560439929334), (186859, -0.6163156344279386), (187056, 0.8439249387982215), (189464, 0.2017366461964786), (194365, -0.05547950410915026), (195892, 0.17049858486761843), (199011, 0.6340294594746541), (205765, 0.6163156344279422), (209798, 0.836059669922064), (210651, -0.057639041770424365), (220709, 0.8364283610093444), (221882, -0.18485618263446638), (233580, 0.7009130258223497), (240712, 0.700913025822351), (242708, 0.04876920665717847), (247867, -0.4528033232531783), (248019, 0.393749615479079), (261170, 0.518751375933811), (261224, 0.5114957546028552), (263973, -0.12888481555661682), (267699, 0.17049858486761843), (271364, 0.7043607250605002), (275841, -0.8364283610093444), (280868, 0.09843740386976975), (4, 0.4216370213557839), (56, 0.12909944487358055), (81, 0.2581988897471611), (147, 0.5502760564641688), (235, 0.0), (239, -0.7302967433402214), (313, -0.7302967433402214), (332, 0.848528137423857), (458, 0.0), (601, 0.6708203932499369), (605, 0.32071349029490925), (864, -0.31622776601683794), (930, -0.5163977794943222), (1073, -0.4242640687119285), (1153, -0.5262348115842176), (1191, 0.0), (1263, 0.0), (1312, 0.9621404708847278), (1367, -0.1414213562373095), (1419, -0.3651483716701107), (1440, 0.6708203932499369), (1513, 0.3651483716701107), (1519, -0.38138503569823695), (1523, -0.38138503569823695)])

pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

	similarityIndex	userId
0	0.230556	214
1	0.518751	6264
2	0.484248	19924
3	0.719023	21962
4	0.616316	22361

The top x similar users to the input user

Let’s get the top 50 users that are most similar to the input.

topUsers = pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

	similarityIndex	userId
93	0.962140	1312
11	0.951044	36202
35	0.885937	105397
83	0.848528	332
54	0.843925	187056

Now let’s start recommending movies to the input user.

Rating of selected users to all movies

We’re going to do this by taking the weighted average of the ratings of the movies using the Pearson Correlation as the weight. But to do this, we first need to get the movies watched by the users in our pearsonDF from the ratings dataframe, and then store their correlation in a new column called ‘similarityIndex’.

# merge two tables
topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
topUsersRating.head()

	similarityIndex	userId	movieId	rating
0	0.96214	1312	6	3.0
1	0.96214	1312	19	3.5
2	0.96214	1312	32	2.5
3	0.96214	1312	110	2.5
4	0.96214	1312	150	3.0

Now we multiply the movie rating by its weight (the similarity index), then sum up the new ratings and divide it by the sum of the weights.
We can easily do this by simply multiplying two columns, then grouping up the dataframe by movieId and then dividing two columns.
It shows the idea of all similar users to candidate movies for the input user.

# multiply the similarity by the user's ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

	similarityIndex	userId	movieId	rating	weightedRating
0	0.96214	1312	6	3.0	2.886421
1	0.96214	1312	19	3.5	3.367492
2	0.96214	1312	32	2.5	2.405351
3	0.96214	1312	110	2.5	2.405351
4	0.96214	1312	150	3.0	2.886421

# apply a sum to the topUsers after grouping it by userId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

	sum_similarityIndex	sum_weightedRating
movieId
1	24.947499	100.721637
2	22.262128	70.826453
3	8.242517	25.223362
4	2.427828	6.840441
5	12.595882	33.904291

# create an empty dataframe
recommendation_df = pd.DataFrame()

# take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head()

	weighted average recommendation score	movieId
movieId
1	4.037344	1
2	3.181477	2
3	3.060153	3
4	2.817515	4
5	2.691696	5

Let’s sort this and see the top 20 movies that the algorithm recommended.

recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head()

	weighted average recommendation score	movieId
movieId
4863	5.0	4863
5641	5.0	5641
3777	5.0	3777
3205	5.0	3205
3847	5.0	3847

movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(20)['movieId'].tolist())]

	movieId	title	genres	year
3118	3205	Black Sunday (La maschera del demonio)	[Horror]	1960
3686	3777	Nekromantik	[Comedy, Horror]	1987
3754	3847	Ilsa, She Wolf of the SS	[Horror]	1974
3876	3970	Beyond, The (E tu vivrai nel terrore - L’aldilà)	[Horror]	1981
4767	4863	Female Trouble	[Comedy, Crime]	1975
5542	5641	Moderns, The	[Drama]	1988
5681	5780	Polyester	[Comedy]	1981
5810	5909	Visitor Q (Bizita Q)	[Comedy, Drama, Horror]	2001
8549	26007	Unknown Soldier, The (Tuntematon sotilas)	[Drama, War]	1955
12542	58425	Heima	[Documentary]	2007
12713	59684	Lake of Fire	[Documentary]	2006
16931	85181	Pooh’s Grand Adventure: The Search for Christo…	[Adventure, Animation, Children, Musical]	1997
20049	98198	OMG Oh My God!	[Comedy, Drama]	2012
21237	102666	Ivan Vasilievich: Back to the Future (Ivan Vas…	[Adventure, Comedy]	1973
22406	106561	Krrish 3	[Action, Adventure, Fantasy, Sci-Fi]	2013
46195	167248	Kedi	[(no genres listed)]	2016
50636	176753	Bingo - The King of the Mornings	[Comedy, Drama]	2017
51187	177951	Happy!	[Fantasy]	2017
53314	182723	Cosmos: A Spacetime Odissey	[(no genres listed)]	NaN
54462	185227	Brief History of Disbelief	[Documentary]	2004

These are the top 20 movies to recommend to the user based on a collaborative filtering recommendation system.