{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Lecture 8: Recommender Systems Part 2\n",
"\n",
"UBC Master of Data Science program, 2023-24\n",
"\n",
"Instructor: Varada Kolhatkar"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lecture plan, imports, and LOs"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"### Imports "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"import os\n",
"import random\n",
"import sys\n",
"import time\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"sys.path.append(os.path.join(os.path.abspath(\".\"), \"code\"))\n",
"from sklearn.decomposition import PCA\n",
"from sklearn.model_selection import cross_validate, train_test_split\n",
"\n",
"pd.set_option(\"display.max_colwidth\", 0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Lecture plan\n",
"\n",
"- Recap (~15 mins)\n",
"- Content-based filtering (25 mins)\n",
"- Break (~5 mins)\n",
"- Questions for class discussion ( ~5 mins)\n",
"- Miscellaneous topics (~10 mins)\n",
"- Final comments, summary, reflection (~5 mins)\n",
"- Course evaluations (time-permitting) (~5 mins)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"### Learning outcomes\n",
"\n",
"From this lecture, students are expected to be able to:\n",
"- Formulate the rating prediction problem as a supervised machine learning problem. \n",
"- Create a content-based filter given ratings data and item features to predict missing ratings in the utility matrix. \n",
"- Discuss differences, advantages and disadvantages between content-based filtering and collaborative filtering.\n",
"- Explain the idea of hybrid approaches at a high level. \n",
"- Create and work with a sparse utility matrix for datasets with large number of items and users. \n",
"- Name different kinds of data which occur in the context of recommendation systems. \n",
"- Discuss important considerations in recommendation systems beyond error rate. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 1. Recap\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 1.1 Recap: Recommender systems problem \n",
"\n",
"- We are usually given ratings data. \n",
"- We use this data to create **utility matrix** which encodes interactions between users and items. \n",
"- The utility matrix has many missing entries. \n",
"- We defined recommendation systems problem as **matrix completion problem**.\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's load some toy data. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" movie_id | \n",
" rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Sam | \n",
" Lion King | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" Sam | \n",
" Jerry Maguire | \n",
" 4 | \n",
"
\n",
" \n",
" 2 | \n",
" Sam | \n",
" Roman Holidays | \n",
" 5 | \n",
"
\n",
" \n",
" 3 | \n",
" Sam | \n",
" Downfall | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" Eva | \n",
" Titanic | \n",
" 2 | \n",
"
\n",
" \n",
" 5 | \n",
" Eva | \n",
" Jerry Maguire | \n",
" 1 | \n",
"
\n",
" \n",
" 6 | \n",
" Eva | \n",
" Inception | \n",
" 4 | \n",
"
\n",
" \n",
" 7 | \n",
" Eva | \n",
" Man on Wire | \n",
" 5 | \n",
"
\n",
" \n",
" 8 | \n",
" Eva | \n",
" The Social Dilemma | \n",
" 5 | \n",
"
\n",
" \n",
" 9 | \n",
" Pat | \n",
" Titanic | \n",
" 3 | \n",
"
\n",
" \n",
" 10 | \n",
" Pat | \n",
" Lion King | \n",
" 4 | \n",
"
\n",
" \n",
" 11 | \n",
" Pat | \n",
" Bambi | \n",
" 4 | \n",
"
\n",
" \n",
" 12 | \n",
" Pat | \n",
" Cast Away | \n",
" 3 | \n",
"
\n",
" \n",
" 13 | \n",
" Pat | \n",
" Jerry Maguire | \n",
" 5 | \n",
"
\n",
" \n",
" 14 | \n",
" Pat | \n",
" Downfall | \n",
" 2 | \n",
"
\n",
" \n",
" 15 | \n",
" Pat | \n",
" A Beautiful Mind | \n",
" 3 | \n",
"
\n",
" \n",
" 16 | \n",
" Jim | \n",
" Titanic | \n",
" 2 | \n",
"
\n",
" \n",
" 17 | \n",
" Jim | \n",
" Lion King | \n",
" 3 | \n",
"
\n",
" \n",
" 18 | \n",
" Jim | \n",
" The Social Dilemma | \n",
" 5 | \n",
"
\n",
" \n",
" 19 | \n",
" Jim | \n",
" Malcolm x | \n",
" 4 | \n",
"
\n",
" \n",
" 20 | \n",
" Jim | \n",
" Man on Wire | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id movie_id rating\n",
"0 Sam Lion King 4 \n",
"1 Sam Jerry Maguire 4 \n",
"2 Sam Roman Holidays 5 \n",
"3 Sam Downfall 1 \n",
"4 Eva Titanic 2 \n",
"5 Eva Jerry Maguire 1 \n",
"6 Eva Inception 4 \n",
"7 Eva Man on Wire 5 \n",
"8 Eva The Social Dilemma 5 \n",
"9 Pat Titanic 3 \n",
"10 Pat Lion King 4 \n",
"11 Pat Bambi 4 \n",
"12 Pat Cast Away 3 \n",
"13 Pat Jerry Maguire 5 \n",
"14 Pat Downfall 2 \n",
"15 Pat A Beautiful Mind 3 \n",
"16 Jim Titanic 2 \n",
"17 Jim Lion King 3 \n",
"18 Jim The Social Dilemma 5 \n",
"19 Jim Malcolm x 4 \n",
"20 Jim Man on Wire 5 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"toy_ratings = pd.read_csv(\"data/toy_ratings.csv\")\n",
"toy_ratings"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of users (N): 4\n",
"Number of movies (M): 12\n"
]
}
],
"source": [
"N = len(np.unique(toy_ratings[\"user_id\"]))\n",
"M = len(np.unique(toy_ratings[\"movie_id\"]))\n",
"print(f\"Number of users (N): {N}\")\n",
"print(f\"Number of movies (M): {M}\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"user_key = \"user_id\"\n",
"item_key = \"movie_id\"\n",
"user_mapper = dict(zip(np.unique(toy_ratings[user_key]), list(range(N))))\n",
"item_mapper = dict(zip(np.unique(toy_ratings[item_key]), list(range(M))))\n",
"user_inverse_mapper = dict(zip(list(range(N)), np.unique(toy_ratings[user_key])))\n",
"item_inverse_mapper = dict(zip(list(range(M)), np.unique(toy_ratings[item_key])))\n",
"\n",
"def create_Y_from_ratings(data, N, M):\n",
" Y = np.zeros((N, M))\n",
" Y.fill(np.nan)\n",
" for index, val in data.iterrows():\n",
" n = user_mapper[val[user_key]]\n",
" m = item_mapper[val[item_key]]\n",
" Y[n, m] = val[\"rating\"]\n",
"\n",
" return Y"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A Beautiful Mind | \n",
" Bambi | \n",
" Cast Away | \n",
" Downfall | \n",
" Inception | \n",
" Jerry Maguire | \n",
" Lion King | \n",
" Malcolm x | \n",
" Man on Wire | \n",
" Roman Holidays | \n",
" The Social Dilemma | \n",
" Titanic | \n",
"
\n",
" \n",
" \n",
" \n",
" Eva | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 4.0 | \n",
" 1.0 | \n",
" NaN | \n",
" NaN | \n",
" 5.0 | \n",
" NaN | \n",
" 5.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" Jim | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 3.0 | \n",
" 4.0 | \n",
" 5.0 | \n",
" NaN | \n",
" 5.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" Pat | \n",
" 3.0 | \n",
" 4.0 | \n",
" 3.0 | \n",
" 2.0 | \n",
" NaN | \n",
" 5.0 | \n",
" 4.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 3.0 | \n",
"
\n",
" \n",
" Sam | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 1.0 | \n",
" NaN | \n",
" 4.0 | \n",
" 4.0 | \n",
" NaN | \n",
" NaN | \n",
" 5.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A Beautiful Mind Bambi Cast Away Downfall Inception Jerry Maguire \\\n",
"Eva NaN NaN NaN NaN 4.0 1.0 \n",
"Jim NaN NaN NaN NaN NaN NaN \n",
"Pat 3.0 4.0 3.0 2.0 NaN 5.0 \n",
"Sam NaN NaN NaN 1.0 NaN 4.0 \n",
"\n",
" Lion King Malcolm x Man on Wire Roman Holidays The Social Dilemma \\\n",
"Eva NaN NaN 5.0 NaN 5.0 \n",
"Jim 3.0 4.0 5.0 NaN 5.0 \n",
"Pat 4.0 NaN NaN NaN NaN \n",
"Sam 4.0 NaN NaN 5.0 NaN \n",
"\n",
" Titanic \n",
"Eva 2.0 \n",
"Jim 2.0 \n",
"Pat 3.0 \n",
"Sam NaN "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Y = create_Y_from_ratings(toy_ratings, N, M)\n",
"utility_mat = pd.DataFrame(Y, columns=item_mapper.keys(), index=user_mapper.keys())\n",
"utility_mat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 1.2 Recap: Main approaches\n",
"\n",
"- Collaborative filtering (last lecture)\n",
" - \"Unsupervised\" learning \n",
" - We only have labels $y_{ij}$ (rating of user $i$ for item $j$). \n",
" - We learn latent features. \n",
"- **Content-based recommenders (today's focus)**\n",
" - Supervised learning\n",
" - Extract features $x_i$ of users and/or items building a model to predict rating $y_i$ given $x_i$. \n",
" - Apply model to predict for new users/items. \n",
"- Hybrid \n",
" - Combining collaborative filtering with content-based filtering\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.3 Recap: Collaborative filtering \n",
"In the last lecture we talked about collaborative filtering\n",
"- **People who agreed in the past are likely to agree again in future.**\n",
"- \"Unsupervised\" learning \n",
" - We only have labels $Y$ (ratings $y_{ij}$ for user $i$ and item $j$) but no features of items or users. \n",
"- We use PCA-like approach to learn the latent features of users and items.\n",
"- You can think of $Z$ matrix as user embeddings and $W$ matrix as item embeddings. \n",
"- Then the predicted rating is multiplication of a corresponding user and item embeddings. \n",
"- The key idea is that the loss function only includes the available ratings and regularization terms for $W$ and $Z$ to avoid overfitting. \n",
"- So instead of using the regular PCA or `TruncatedSVD`, we implement our own loss function or use a package which implements this loss function (e.g., surprise package). \n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.4 Measuring similarity/distances\n",
"\n",
"- After training a collaborative filtering model, we have user embeddings and item embeddings.\n",
"- We use these embeddings to make recommendations.\n",
"- In particular, these embeddings are used to match users with items or to find users with similar preferences and items with similar characteristics.\n",
"- So the notion of **similarity** or **distances** plays a crucial role in this context"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- There are two primary similarity measures used in this context: \n",
"\n",
" - **Dot products** measure the similarity through the alignment of vectors in a high-dimensional space. Higher dot product means a greater degree of similarity, assuming that the vectors are non-negative. \n",
" $$similarity_{dot\\_product}(u, v) = u.v$$ \n",
"\n",
" - **Cosine similarity** measures the cosine of the angle between two vectors\n",
" $$similarity_{cosine}(u,v) = cosine(u,v) = \\frac{u.v}{\\left\\lVert u\\right\\rVert_2 \\left\\lVert v\\right\\rVert_2}$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Discussion question**\n",
"\n",
"Suppose you are recommending items based on similarity between items. Given a query vector \"Query\" in the picture below and the three item vectors, determine the ranking of the items for the three similarity measures below: \n",
"- **Example: Similarity based on Euclidean distance: item 3 > item 1 > item 2**\n",
"- Similarity based on dot product: **item 2 > item 3 > item 1** \n",
"- Cosine similarity: **item 1 > item 2 > item 3**\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"- Adapted from [here](https://developers.google.com/machine-learning/recommendation/overview/candidate-generation)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Which similarity metric to use in what context?**\n",
"\n",
"- The choice of similarity metric can have a significant impact on the behavior of the recommendation system.\n",
"\n",
"**dot product similarity**\n",
"- Larger norms can lead to higher similarity scores, which means that items with larger norms are more likely to be recommended.\n",
"- If popularity is an important factor for recommendations in your system, this characteristic of the dot product can be beneficial because it naturally boosts recommendations of popular items.\n",
"- That said, this can also lead to a lack of diversity in recommendations because popular items might overshadow less popular, yet still relevant, items.\n",
" \n",
"**Cosine Similarity:**\n",
"\n",
"- Cosine similarity measures the cosine of the angle between two vectors, which corresponds to their directional alignment, independent of their magnitude.\n",
"- This means that even if an item has a large norm, it won't be recommended unless it's directionally similar to the user's preference vector.\n",
"- Cosine similarity is often used when the scale of the embeddings should not influence the recommendation, providing a more balanced field for both popular and less popular items."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 2. Content-based filtering\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- What if a new item or a new user shows up? \n",
" - You won't have any ratings information for that item or user\n",
"- Content-based filtering is suitable to predict ratings for new items and new users.\n",
"- Content-based filtering is a **upervised machine learning** approach to recommender systems. \n",
"- In collaborative filtering we assumed that we only have ratings data. \n",
"- Usually there is some information available about items and users. \n",
"- Examples\n",
" - Netflix can describe movies as action, romance, comedy, documentaries.\n",
" - Netflix has some demographic and preference information on users. \n",
" - Amazon could describe books according to topics: math, languages, history. \n",
" - Tinder could describe people according to age, location, employment.\n",
"- Can we use this information to predict ratings in the utility matrix? \n",
" - Yes! Using content-based filtering! "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Overview**\n",
"\n",
"In content-based filtering, \n",
"- We assume that we are given item or user feature. \n",
"- Given movie information, for instance, we **create user profile for each user**.\n",
"- We treat ratings prediction problem as **a set of regression problems** and build regression model for each user.\n",
"- Once we have trained regression models for each user, we **complete the utility matrix by predicting ratings for each user** using their corresponding models. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look into each of these steps one by one with a toy example. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 2.1 Movie features\n",
"\n",
"- Suppose we also have movie features. In particular, suppose we have information about the genre of each movie. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Action | \n",
" Romance | \n",
" Drama | \n",
" Comedy | \n",
" Children | \n",
" Documentary | \n",
"
\n",
" \n",
" \n",
" \n",
" A Beautiful Mind | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Bambi | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" Cast Away | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Downfall | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" Inception | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Jerry Maguire | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Lion King | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" Malcolm x | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" Man on Wire | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" Roman Holidays | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" The Social Dilemma | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" Titanic | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Action Romance Drama Comedy Children Documentary\n",
"A Beautiful Mind 0 1 1 0 0 0 \n",
"Bambi 0 0 1 0 1 0 \n",
"Cast Away 0 1 1 0 0 0 \n",
"Downfall 0 0 0 0 0 1 \n",
"Inception 1 0 1 0 0 0 \n",
"Jerry Maguire 0 1 1 1 0 0 \n",
"Lion King 0 0 1 0 1 0 \n",
"Malcolm x 0 0 0 0 0 1 \n",
"Man on Wire 0 0 0 0 0 1 \n",
"Roman Holidays 0 1 1 1 0 0 \n",
"The Social Dilemma 0 0 0 0 0 1 \n",
"Titanic 0 1 1 0 0 0 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movie_feats_df = pd.read_csv(\"data/toy_movie_feats.csv\", index_col=0)\n",
"movie_feats_df"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(12, 6)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Z = movie_feats_df.to_numpy()\n",
"Z.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- How can we use these features to predict missing ratings? \n",
"- Using the ratings data and movie features: \n",
" - Build **profiles for different users**.\n",
" - Train a **supervised machine learning model for each user**.\n",
" - Predict ratings using the trained models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Let's consider an example user **Pat**. \n",
"- We don't know anything about Pat but we know her ratings to movies. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"A Beautiful Mind 3.0\n",
"Bambi 4.0\n",
"Cast Away 3.0\n",
"Downfall 2.0\n",
"Inception NaN \n",
"Jerry Maguire 5.0\n",
"Lion King 4.0\n",
"Malcolm x NaN \n",
"Man on Wire NaN \n",
"Roman Holidays NaN \n",
"The Social Dilemma NaN \n",
"Titanic 3.0\n",
"Name: Pat, dtype: float64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"utility_mat.loc[\"Pat\"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- We also know about movies and their features. \n",
"- If Pat gave a high rating to _Lion King_, it means that she liked the features of the movie. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Action 0\n",
"Romance 0\n",
"Drama 1\n",
"Comedy 0\n",
"Children 1\n",
"Documentary 0\n",
"Name: Lion King, dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movie_feats_df.loc[\"Lion King\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 2.2 Building user profiles "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"For each user $i$ create a user profile as follows. \n",
"\n",
"- Consider all movies rated by $i$ and create `X` and `y` for the user: \n",
" - Each row in `X` contains the movie features of movie $j$ rated by $i$. \n",
" - Each value in `y` is the corresponding rating given to the movie $j$ by user $i$. \n",
"- Fit a regression model using `X` and `y`. \n",
"- Apply the model to predict ratings for new items! "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As an example, let's build a profile for pat."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" movie_id | \n",
" rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 9 | \n",
" Titanic | \n",
" 3 | \n",
"
\n",
" \n",
" 10 | \n",
" Lion King | \n",
" 4 | \n",
"
\n",
" \n",
" 11 | \n",
" Bambi | \n",
" 4 | \n",
"
\n",
" \n",
" 12 | \n",
" Cast Away | \n",
" 3 | \n",
"
\n",
" \n",
" 13 | \n",
" Jerry Maguire | \n",
" 5 | \n",
"
\n",
" \n",
" 14 | \n",
" Downfall | \n",
" 2 | \n",
"
\n",
" \n",
" 15 | \n",
" A Beautiful Mind | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" movie_id rating\n",
"9 Titanic 3 \n",
"10 Lion King 4 \n",
"11 Bambi 4 \n",
"12 Cast Away 3 \n",
"13 Jerry Maguire 5 \n",
"14 Downfall 2 \n",
"15 A Beautiful Mind 3 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Which movies are rated by Pat? \n",
"\n",
"movies_rated_by_pat = toy_ratings[toy_ratings['user_id']=='Pat'][['movie_id', 'rating']]\n",
"movies_rated_by_pat"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(12, 6)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Z.shape"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# Get feature vectors of movies rated by Pat. \n",
"\n",
"pat_X = []\n",
"pat_y = []\n",
"for (index, val) in movies_rated_by_pat.iterrows():\n",
" # Get the id of this movie rated by Pat \n",
" m = item_mapper[val['movie_id']]\n",
" \n",
" # Get the feature vector for the movie \n",
" pat_X.append(Z[m])\n",
" \n",
" # Get the rating for the movie\n",
" pat_y.append(val['rating'])"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Action | \n",
" Romance | \n",
" Drama | \n",
" Comedy | \n",
" Children | \n",
" Documentary | \n",
"
\n",
" \n",
" \n",
" \n",
" Titanic | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Lion King | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" Bambi | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" Cast Away | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Jerry Maguire | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Downfall | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" A Beautiful Mind | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Action Romance Drama Comedy Children Documentary\n",
"Titanic 0 1 1 0 0 0 \n",
"Lion King 0 0 1 0 1 0 \n",
"Bambi 0 0 1 0 1 0 \n",
"Cast Away 0 1 1 0 0 0 \n",
"Jerry Maguire 0 1 1 1 0 0 \n",
"Downfall 0 0 0 0 0 1 \n",
"A Beautiful Mind 0 1 1 0 0 0 "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(pat_X, index=movies_rated_by_pat['movie_id'].tolist(), columns = movie_feats_df.columns)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[3, 4, 4, 3, 5, 2, 3]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pat_y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similar to how we created `X` and `y` for Pat above, the function below builds `X` and `y` for all users. "
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"from collections import defaultdict\n",
"\n",
"def get_lr_data_per_user(ratings_df, d):\n",
" lr_y = defaultdict(list)\n",
" lr_X = defaultdict(list)\n",
" lr_items = defaultdict(list)\n",
"\n",
" for index, val in ratings_df.iterrows():\n",
" n = user_mapper[val[user_key]]\n",
" m = item_mapper[val[item_key]]\n",
" lr_X[n].append(Z[m])\n",
" lr_y[n].append(val[\"rating\"])\n",
" lr_items[n].append(m)\n",
"\n",
" for n in lr_X:\n",
" lr_X[n] = np.array(lr_X[n])\n",
" lr_y[n] = np.array(lr_y[n])\n",
"\n",
" return lr_X, lr_y, lr_items"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"d = movie_feats_df.shape[1]\n",
"X_train_usr, y_train_usr, rated_items = get_lr_data_per_user(toy_ratings, d)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(list,\n",
" {3: array([[0, 0, 1, 0, 1, 0],\n",
" [0, 1, 1, 1, 0, 0],\n",
" [0, 1, 1, 1, 0, 0],\n",
" [0, 0, 0, 0, 0, 1]]),\n",
" 0: array([[0, 1, 1, 0, 0, 0],\n",
" [0, 1, 1, 1, 0, 0],\n",
" [1, 0, 1, 0, 0, 0],\n",
" [0, 0, 0, 0, 0, 1],\n",
" [0, 0, 0, 0, 0, 1]]),\n",
" 2: array([[0, 1, 1, 0, 0, 0],\n",
" [0, 0, 1, 0, 1, 0],\n",
" [0, 0, 1, 0, 1, 0],\n",
" [0, 1, 1, 0, 0, 0],\n",
" [0, 1, 1, 1, 0, 0],\n",
" [0, 0, 0, 0, 0, 1],\n",
" [0, 1, 1, 0, 0, 0]]),\n",
" 1: array([[0, 1, 1, 0, 0, 0],\n",
" [0, 0, 1, 0, 1, 0],\n",
" [0, 0, 0, 0, 0, 1],\n",
" [0, 0, 0, 0, 0, 1],\n",
" [0, 0, 0, 0, 0, 1]])})"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train_usr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do you think the shape of `X` and `y` for all users would be the same?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"**Examining user profiles**\n",
"\n",
"- Let's examine some user profiles. "
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"def get_user_profile(user_name):\n",
" X = X_train_usr[user_mapper[user_name]]\n",
" y = y_train_usr[user_mapper[user_name]]\n",
" items = rated_items[user_mapper[user_name]]\n",
" movie_names = [item_inverse_mapper[item] for item in items]\n",
" print(\"Profile for user: \", user_name)\n",
" profile_df = pd.DataFrame(X, columns=movie_feats_df.columns, index=movie_names)\n",
" profile_df[\"ratings\"] = y\n",
" return profile_df"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Profile for user: Pat\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Action | \n",
" Romance | \n",
" Drama | \n",
" Comedy | \n",
" Children | \n",
" Documentary | \n",
" ratings | \n",
"
\n",
" \n",
" \n",
" \n",
" Titanic | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
"
\n",
" \n",
" Lion King | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 4 | \n",
"
\n",
" \n",
" Bambi | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 4 | \n",
"
\n",
" \n",
" Cast Away | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
"
\n",
" \n",
" Jerry Maguire | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 5 | \n",
"
\n",
" \n",
" Downfall | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" A Beautiful Mind | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Action Romance Drama Comedy Children Documentary \\\n",
"Titanic 0 1 1 0 0 0 \n",
"Lion King 0 0 1 0 1 0 \n",
"Bambi 0 0 1 0 1 0 \n",
"Cast Away 0 1 1 0 0 0 \n",
"Jerry Maguire 0 1 1 1 0 0 \n",
"Downfall 0 0 0 0 0 1 \n",
"A Beautiful Mind 0 1 1 0 0 0 \n",
"\n",
" ratings \n",
"Titanic 3 \n",
"Lion King 4 \n",
"Bambi 4 \n",
"Cast Away 3 \n",
"Jerry Maguire 5 \n",
"Downfall 2 \n",
"A Beautiful Mind 3 "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_user_profile(\"Pat\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Pat seems to like Children's movies and movies with Comedy. \n",
"- Seems like she's not so much into romantic movies. \n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Profile for user: Eva\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Action | \n",
" Romance | \n",
" Drama | \n",
" Comedy | \n",
" Children | \n",
" Documentary | \n",
" ratings | \n",
"
\n",
" \n",
" \n",
" \n",
" Titanic | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
"
\n",
" \n",
" Jerry Maguire | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" Inception | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 4 | \n",
"
\n",
" \n",
" Man on Wire | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 5 | \n",
"
\n",
" \n",
" The Social Dilemma | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Action Romance Drama Comedy Children Documentary \\\n",
"Titanic 0 1 1 0 0 0 \n",
"Jerry Maguire 0 1 1 1 0 0 \n",
"Inception 1 0 1 0 0 0 \n",
"Man on Wire 0 0 0 0 0 1 \n",
"The Social Dilemma 0 0 0 0 0 1 \n",
"\n",
" ratings \n",
"Titanic 2 \n",
"Jerry Maguire 1 \n",
"Inception 4 \n",
"Man on Wire 5 \n",
"The Social Dilemma 5 "
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_user_profile(\"Eva\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Eva hasn't rated many movies. There are not many rows. \n",
"- Eva seems to like documentaries and action movies. \n",
"- Seems like she's not so much into romantic movies. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 2.3 Supervised approach to rating prediction"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Given `X` and `y` for each user, we can now build a regression model for each user. "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"from sklearn.linear_model import Ridge\n",
"\n",
"\n",
"def train_for_usr(user_name, model=Ridge()):\n",
" X = X_train_usr[user_mapper[user_name]]\n",
" y = y_train_usr[user_mapper[user_name]]\n",
" model.fit(X, y)\n",
" return model\n",
"\n",
"\n",
"def predict_for_usr(model, movie_names):\n",
" feat_vecs = movie_feats_df.loc[movie_names].values\n",
" preds = model.predict(feat_vecs)\n",
" return preds"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"**A regression model for Pat**"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"user_name = \"Pat\"\n",
"pat_model = train_for_usr(user_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Since we are training ridge model, we can examine the coefficients \n",
"- What are the regression weights learned for Pat? "
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Coefficients for Pat | \n",
"
\n",
" \n",
" \n",
" \n",
" Action | \n",
" 0.000000 | \n",
"
\n",
" \n",
" Romance | \n",
" -0.020833 | \n",
"
\n",
" \n",
" Drama | \n",
" 0.437500 | \n",
"
\n",
" \n",
" Comedy | \n",
" 0.854167 | \n",
"
\n",
" \n",
" Children | \n",
" 0.458333 | \n",
"
\n",
" \n",
" Documentary | \n",
" -0.437500 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Coefficients for Pat\n",
"Action 0.000000 \n",
"Romance -0.020833 \n",
"Drama 0.437500 \n",
"Comedy 0.854167 \n",
"Children 0.458333 \n",
"Documentary -0.437500 "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"col = \"Coefficients for %s\" % user_name\n",
"pd.DataFrame(pat_model.coef_, index=movie_feats_df.columns, columns=[col])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- How would Pat rate some movies she hasn't seen? "
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Action | \n",
" Romance | \n",
" Drama | \n",
" Comedy | \n",
" Children | \n",
" Documentary | \n",
"
\n",
" \n",
" \n",
" \n",
" Roman Holidays | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Malcolm x | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Action Romance Drama Comedy Children Documentary\n",
"Roman Holidays 0 1 1 1 0 0 \n",
"Malcolm x 0 0 0 0 0 1 "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies_to_pred = [\"Roman Holidays\", \"Malcolm x\"]\n",
"pred_df = movie_feats_df.loc[movies_to_pred]\n",
"pred_df"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Action | \n",
" Romance | \n",
" Drama | \n",
" Comedy | \n",
" Children | \n",
" Documentary | \n",
" Pat's predicted ratings | \n",
"
\n",
" \n",
" \n",
" \n",
" Roman Holidays | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 4.145833 | \n",
"
\n",
" \n",
" Malcolm x | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 2.437500 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Action Romance Drama Comedy Children Documentary \\\n",
"Roman Holidays 0 1 1 1 0 0 \n",
"Malcolm x 0 0 0 0 0 1 \n",
"\n",
" Pat's predicted ratings \n",
"Roman Holidays 4.145833 \n",
"Malcolm x 2.437500 "
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"user_name = \"Pat\"\n",
"preds = predict_for_usr(pat_model, movies_to_pred)\n",
"pred_df[user_name + \"'s predicted ratings\"] = preds\n",
"pred_df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"**A regression model for Eva**"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Coefficients for Eva | \n",
"
\n",
" \n",
" \n",
" \n",
" Action | \n",
" 0.333333 | \n",
"
\n",
" \n",
" Romance | \n",
" -1.000000 | \n",
"
\n",
" \n",
" Drama | \n",
" -0.666667 | \n",
"
\n",
" \n",
" Comedy | \n",
" -0.666667 | \n",
"
\n",
" \n",
" Children | \n",
" 0.000000 | \n",
"
\n",
" \n",
" Documentary | \n",
" 0.666667 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Coefficients for Eva\n",
"Action 0.333333 \n",
"Romance -1.000000 \n",
"Drama -0.666667 \n",
"Comedy -0.666667 \n",
"Children 0.000000 \n",
"Documentary 0.666667 "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"user_name = \"Eva\"\n",
"eva_model = train_for_usr(user_name)\n",
"col = \"Coefficients for %s\" % user_name\n",
"pd.DataFrame(eva_model.coef_, index=movie_feats_df.columns, columns=[col])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- What are the predicted ratings for Eva for a list of movies?"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Action | \n",
" Romance | \n",
" Drama | \n",
" Comedy | \n",
" Children | \n",
" Documentary | \n",
" Pat's predicted ratings | \n",
" Eva's predicted ratings | \n",
"
\n",
" \n",
" \n",
" \n",
" Roman Holidays | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 4.145833 | \n",
" 1.666667 | \n",
"
\n",
" \n",
" Malcolm x | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 2.437500 | \n",
" 4.666667 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Action Romance Drama Comedy Children Documentary \\\n",
"Roman Holidays 0 1 1 1 0 0 \n",
"Malcolm x 0 0 0 0 0 1 \n",
"\n",
" Pat's predicted ratings Eva's predicted ratings \n",
"Roman Holidays 4.145833 1.666667 \n",
"Malcolm x 2.437500 4.666667 "
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"user_name = \"Eva\"\n",
"preds = predict_for_usr(eva_model, movies_to_pred)\n",
"pred_df[user_name + \"'s predicted ratings\"] = preds\n",
"pred_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4 Completing the utility matrix with content-based filtering\n",
"\n",
"Here is the original utility matrix. "
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A Beautiful Mind | \n",
" Bambi | \n",
" Cast Away | \n",
" Downfall | \n",
" Inception | \n",
" Jerry Maguire | \n",
" Lion King | \n",
" Malcolm x | \n",
" Man on Wire | \n",
" Roman Holidays | \n",
" The Social Dilemma | \n",
" Titanic | \n",
"
\n",
" \n",
" \n",
" \n",
" Eva | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 4.0 | \n",
" 1.0 | \n",
" NaN | \n",
" NaN | \n",
" 5.0 | \n",
" NaN | \n",
" 5.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" Jim | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 3.0 | \n",
" 4.0 | \n",
" 5.0 | \n",
" NaN | \n",
" 5.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" Pat | \n",
" 3.0 | \n",
" 4.0 | \n",
" 3.0 | \n",
" 2.0 | \n",
" NaN | \n",
" 5.0 | \n",
" 4.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 3.0 | \n",
"
\n",
" \n",
" Sam | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 1.0 | \n",
" NaN | \n",
" 4.0 | \n",
" 4.0 | \n",
" NaN | \n",
" NaN | \n",
" 5.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A Beautiful Mind Bambi Cast Away Downfall Inception Jerry Maguire \\\n",
"Eva NaN NaN NaN NaN 4.0 1.0 \n",
"Jim NaN NaN NaN NaN NaN NaN \n",
"Pat 3.0 4.0 3.0 2.0 NaN 5.0 \n",
"Sam NaN NaN NaN 1.0 NaN 4.0 \n",
"\n",
" Lion King Malcolm x Man on Wire Roman Holidays The Social Dilemma \\\n",
"Eva NaN NaN 5.0 NaN 5.0 \n",
"Jim 3.0 4.0 5.0 NaN 5.0 \n",
"Pat 4.0 NaN NaN NaN NaN \n",
"Sam 4.0 NaN NaN 5.0 NaN \n",
"\n",
" Titanic \n",
"Eva 2.0 \n",
"Jim 2.0 \n",
"Pat 3.0 \n",
"Sam NaN "
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"utility_mat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Using predictions per user, we can fill in missing entries in the utility matrix. "
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import Ridge\n",
"\n",
"models = dict()\n",
"pred_lin_reg = np.zeros((N, M))\n",
"\n",
"for n in range(N):\n",
" models[n] = Ridge()\n",
" models[n].fit(X_train_usr[n], y_train_usr[n])\n",
" pred_lin_reg[n] = models[n].predict(Z)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A Beautiful Mind | \n",
" Bambi | \n",
" Cast Away | \n",
" Downfall | \n",
" Inception | \n",
" Jerry Maguire | \n",
" Lion King | \n",
" Malcolm x | \n",
" Man on Wire | \n",
" Roman Holidays | \n",
" The Social Dilemma | \n",
" Titanic | \n",
"
\n",
" \n",
" \n",
" \n",
" Eva | \n",
" 2.333333 | \n",
" 3.333333 | \n",
" 2.333333 | \n",
" 4.666667 | \n",
" 3.666667 | \n",
" 1.666667 | \n",
" 3.333333 | \n",
" 4.666667 | \n",
" 4.666667 | \n",
" 1.666667 | \n",
" 4.666667 | \n",
" 2.333333 | \n",
"
\n",
" \n",
" Jim | \n",
" 2.575000 | \n",
" 3.075000 | \n",
" 2.575000 | \n",
" 4.450000 | \n",
" 3.150000 | \n",
" 2.575000 | \n",
" 3.075000 | \n",
" 4.450000 | \n",
" 4.450000 | \n",
" 2.575000 | \n",
" 4.450000 | \n",
" 2.575000 | \n",
"
\n",
" \n",
" Pat | \n",
" 3.291667 | \n",
" 3.770833 | \n",
" 3.291667 | \n",
" 2.437500 | \n",
" 3.312500 | \n",
" 4.145833 | \n",
" 3.770833 | \n",
" 2.437500 | \n",
" 2.437500 | \n",
" 4.145833 | \n",
" 2.437500 | \n",
" 3.291667 | \n",
"
\n",
" \n",
" Sam | \n",
" 3.810811 | \n",
" 3.675676 | \n",
" 3.810811 | \n",
" 1.783784 | \n",
" 3.351351 | \n",
" 4.270270 | \n",
" 3.675676 | \n",
" 1.783784 | \n",
" 1.783784 | \n",
" 4.270270 | \n",
" 1.783784 | \n",
" 3.810811 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A Beautiful Mind Bambi Cast Away Downfall Inception \\\n",
"Eva 2.333333 3.333333 2.333333 4.666667 3.666667 \n",
"Jim 2.575000 3.075000 2.575000 4.450000 3.150000 \n",
"Pat 3.291667 3.770833 3.291667 2.437500 3.312500 \n",
"Sam 3.810811 3.675676 3.810811 1.783784 3.351351 \n",
"\n",
" Jerry Maguire Lion King Malcolm x Man on Wire Roman Holidays \\\n",
"Eva 1.666667 3.333333 4.666667 4.666667 1.666667 \n",
"Jim 2.575000 3.075000 4.450000 4.450000 2.575000 \n",
"Pat 4.145833 3.770833 2.437500 2.437500 4.145833 \n",
"Sam 4.270270 3.675676 1.783784 1.783784 4.270270 \n",
"\n",
" The Social Dilemma Titanic \n",
"Eva 4.666667 2.333333 \n",
"Jim 4.450000 2.575000 \n",
"Pat 2.437500 3.291667 \n",
"Sam 1.783784 3.810811 "
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(pred_lin_reg, columns=item_mapper.keys(), index=user_mapper.keys())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- In this toy example, we assumed access to item features. Frequently, we also have access to user features, including demographic information.\n",
"\n",
"- With this data, we can construct item profiles similar to user profiles and train a unique regression model for each item.\n",
"\n",
"- These models enable us to predict ratings for each item individually.\n",
"\n",
"- Typically, the final rating is derived from a weighted average that combines the ratings suggested by both item features and user features."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- In this toy example, we assumed that we had item features. Often we also have access to user features such as their demographic information.\n",
" \n",
"- When such information is available, we can create item profiles similar to user profiles and train a regression model per item.\n",
" \n",
"- We can then predict ratings for each item using these models.\n",
"- Often a weighted average of ratings given by item features and user features is used as the final rating. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 2.5 Miscellaneous comments on content-based filtering\n",
"\n",
"**Collaborative filtering vs. content-based filtering**\n",
"\n",
"- Latent-factor approach to collaborative filtering, where we reconstruct rating for user $i$ and item $j$ as: \n",
"$$\\hat{y}_{ij} = w_j^T z_{i}$$\n",
"\n",
" - $w_j^T$ are \"hidden\" features or embedding of item $j$\n",
" - $z_i$ are \"hidden\" features or embedding of user $i$\n",
"\n",
"\n",
"- A linear model approach to content-based filtering, where we reconstruct rating for user $i$ and item $j$ as: \n",
"$$\\hat{y}_{ij} = w_i^T x_{ij}$$\n",
" - $x_{ij}$ is a feature vector for user $i$ and item $j$\n",
" - $w$ are the weights learned for user $i$\n",
" - Our usual supervised learning setup for linear regression. \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"**Fine-tuning your regression models**\n",
"\n",
"- The feature matrix for movies can contain different types of features.\n",
" - Example: Plot of the movie (text features), actors (categorical features), year of the movie, budget and revenue of the movie (numerical features). \n",
" - You'll apply our usual preprocessing techniques to these features. \n",
"- If you have enough data, you could also carry out hyperparameter tuning with cross-validation for each model.\n",
"- Finally, although we have been talking about linear models above, you can use any regression model of your choice. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"**Advantages of content-based filtering**\n",
"\n",
"- We don't need many users to provide ratings for an item. \n",
"- Each user is modeled separately, so you might be able to capture uniqueness of taste. \n",
"- Since you can obtain the features of the items, you can immediately recommend new items. \n",
" - This would not have been possible with collaborative filtering. \n",
"- Recommendations are more interpretable (if you use linear models)\n",
" - You can explain to the user why you are recommending an item because you have learned weights."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"**Disadvantages of content-based filtering**\n",
"\n",
"- Feature acquisition and feature engineering\n",
" - What features should we use to explain the difference in ratings? \n",
" - Obtaining those features for each item might be very expensive. \n",
"- Less diversity: hardly recommend an item outside the user's profile. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 2.6 (Optional) Hybrid approaches"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Both collaborative filtering and content-based filtering have their own advantages and suffer from their shortcomings. \n",
"- Collaborative filtering exploits social information but it does not predict well for new movies/users.\n",
" - New movies don't yet have ratings, and new users haven't rated anything.\n",
"- Content-based approaches do not have this problem but they are less diverse and do not exploit information about similarity between users. \n",
"- Can we combine the best of the two worlds? \n",
"- There are several ways to combine collaborative filtering and content-based filtering: \n",
" - Build separate models for collaborative filtering and content-based filtering and combine their results. For example, the predicted rating can be a weighted average of ratings predicted by each model. \n",
" - Include content-based item features in the collaborative filtering loss function. (Check out [SVDfeature](https://www.jmlr.org/papers/v13/chen12a.html) (won \"KDD Cup\" in 2011 and 2012).) "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## ❓❓ Questions for you\n",
"**iClicker cloud join link: https://join.iclicker.com/NGJD**\n",
"\n",
"### Exercise 8.1 Select all of the following statements which are **True** (iClicker)\n",
"\n",
"- (A) In content-based filtering we leverage available item features in addition to similarity between users.\n",
"- (B) In content-based filtering you represent each user in terms of **known** features of items whereas in collaborative filtering each user is represented with **latent** features of items. \n",
"- (C) In the set up of content-based filtering we discussed, if you have a new movie, you would have problems predicting ratings for that movie. \n",
"- (D) Interpretation of recommendations might be easier with content-based filtering compared to collaborative filtering. \n",
"- (E) In content-based filtering if a user has a number of ratings in the training utility matrix but does not have any ratings in the validation utility matrix then we won't be able to calculate RMSE for the validation utility matrix.\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{admonition} V's Solutions!\n",
":class: tip, dropdown\n",
"- B, D\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Miscellaneous topics\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 3.1 Types of data \n",
"\n",
"- Explicit data: ratings, thumbs up, etc. \n",
"- Implicit data: collected from the users' behaviour (e.g., mouse clicks, purchases, time spent doing something)\n",
"- Trust implicit data that costs something, like time or even money. \n",
" - this makes it harder to fraud"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 3.2 Sparse utility matrix"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Recommender systems work best when there is a large amount of data. \n",
"- So far we've been working with small datasets. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Let's use the [Amazon product data set](http://jmcauley.ucsd.edu/data/amazon/). The authors of the data set have asked for the following citations:\n",
"\n",
"> Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering.\n",
"> R. He, J. McAuley.\n",
"> WWW, 2016.\n",
"> \n",
"> Image-based recommendations on styles and substitutes.\n",
"> J. McAuley, C. Targett, J. Shi, A. van den Hengel.\n",
"> SIGIR, 2015.\n",
"\n",
"We will focus on the Patio, Lawn, and Garden section. You can download the [ratings here](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Patio_Lawn_and_Garden.csv). "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Let's load the data. "
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" item | \n",
" rating | \n",
" timestamp | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" A2VNYWOPJ13AFP | \n",
" 0981850006 | \n",
" 5.0 | \n",
" 1259798400 | \n",
"
\n",
" \n",
" 1 | \n",
" A20DWVV8HML3AW | \n",
" 0981850006 | \n",
" 5.0 | \n",
" 1371081600 | \n",
"
\n",
" \n",
" 2 | \n",
" A3RVP3YBYYOPRH | \n",
" 0981850006 | \n",
" 5.0 | \n",
" 1257984000 | \n",
"
\n",
" \n",
" 3 | \n",
" A28XY55TP3Q90O | \n",
" 0981850006 | \n",
" 5.0 | \n",
" 1314144000 | \n",
"
\n",
" \n",
" 4 | \n",
" A3VZW1BGUQO0V3 | \n",
" 0981850006 | \n",
" 5.0 | \n",
" 1308268800 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user item rating timestamp\n",
"0 A2VNYWOPJ13AFP 0981850006 5.0 1259798400\n",
"1 A20DWVV8HML3AW 0981850006 5.0 1371081600\n",
"2 A3RVP3YBYYOPRH 0981850006 5.0 1257984000\n",
"3 A28XY55TP3Q90O 0981850006 5.0 1314144000\n",
"4 A3VZW1BGUQO0V3 0981850006 5.0 1308268800"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"filename = \"ratings_Patio_Lawn_and_Garden.csv\"\n",
"\n",
"with open(os.path.join(\"data\", filename), \"rb\") as f:\n",
" ratings = pd.read_csv(f, names=(\"user\", \"item\", \"rating\", \"timestamp\"))\n",
"ratings.head()"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 993490 entries, 0 to 993489\n",
"Data columns (total 4 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 user 993490 non-null object \n",
" 1 item 993490 non-null object \n",
" 2 rating 993490 non-null float64\n",
" 3 timestamp 993490 non-null int64 \n",
"dtypes: float64(1), int64(1), object(2)\n",
"memory usage: 30.3+ MB\n"
]
}
],
"source": [
"ratings.info()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"We'd like to construct the utility matrix `Y`. However, let's see how big it would be:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of ratings: 993490\n",
"The average rating: 4.006400668350965\n",
"Number of users: 714791\n",
"Number of items: 105984\n",
"Fraction nonzero: 1.3114269915944552e-05\n",
"Size of full Y matrix (GB): 606.051274752\n"
]
}
],
"source": [
"def get_stats(ratings, item_key=\"item\", user_key=\"user\"):\n",
" print(\"Number of ratings:\", len(ratings))\n",
" print(\"The average rating:\", np.mean(ratings[\"rating\"]))\n",
" N = len(set(ratings[user_key]))\n",
" M = len(set(ratings[item_key]))\n",
" print(\"Number of users:\", N)\n",
" print(\"Number of items:\", M)\n",
" print(\"Fraction nonzero:\", len(ratings) / (N * M))\n",
" print(\"Size of full Y matrix (GB):\", (N * M) * 8 / 1e9)\n",
" return N, M\n",
"\n",
"\n",
"N, M = get_stats(ratings)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"606 GB! That is way too big. We don't want to create that matrix! On the other hand, we see that we only have about 1 million ratings, which would be 8 MB or so ($10^6$ numbers $\\times$ 8 bytes per number). Much more manageable!"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Let's create a sparse representation of our utility matrix $Y$. "
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"from scipy.sparse import csr_matrix as sparse_matrix\n",
"\n",
"user_key = \"user\"\n",
"item_key = \"item\"\n",
"user_mapper = dict(zip(np.unique(ratings[user_key]), list(range(N))))\n",
"item_mapper = dict(zip(np.unique(ratings[item_key]), list(range(M))))\n",
"\n",
"user_inverse_mapper = dict(zip(list(range(N)), np.unique(ratings[user_key])))\n",
"item_inverse_mapper = dict(zip(list(range(M)), np.unique(ratings[item_key])))"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"def create_Y(ratings, N, M, user_key=\"user\", item_key=\"item\"):\n",
" \"\"\"\n",
" Creates a sparse matrix using scipy.csr_matrix and mappers to relate indexes to items' id.\n",
"\n",
" Parameters:\n",
" -----------\n",
" ratings: pd.DataFrame\n",
" the ratings to be stored in the matrix;\n",
" N: int\n",
" the number of users\n",
" M: int\n",
" the number of items\n",
" user_key: string\n",
" the column in ratings that contains the users id\n",
" item_key: string\n",
" the column in ratings that contains the items id\n",
"\n",
" Returns:\n",
" --------\n",
" Y: np.sparse\n",
" the sparse matrix containing the ratings.\n",
" \"\"\"\n",
" user_ind = [user_mapper[i] for i in ratings[user_key]]\n",
" item_ind = [item_mapper[i] for i in ratings[item_key]]\n",
" Y = sparse_matrix((ratings[\"rating\"], (user_ind, item_ind)), shape=(N, M))\n",
" return Y"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<714791x105984 sparse matrix of type ''\n",
"\twith 993490 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Y = create_Y(ratings, N, M)\n",
"Y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note the shape of `Y`: our rows are the users, and the columns are products."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(714791, 105984)\n",
"993490\n",
"Using sparse matrix data structure, the size of X is: 7.94792mb\n"
]
}
],
"source": [
"# sanity check\n",
"print(Y.shape) # should be number of items by number of users\n",
"print(Y.nnz) # number of nonzero elements -- should equal number of ratings\n",
"print(f\"Using sparse matrix data structure, the size of X is: {Y.data.nbytes/1e6}mb\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Let's try `surprise` package on this."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"ratings = ratings.drop(columns=[\"timestamp\"])"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Evaluating RMSE, MAE of algorithm SVD on 5 split(s).\n",
"\n",
" Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std \n",
"RMSE (testset) 1.2882 1.2891 1.2916 1.2912 1.2945 1.2909 0.0022 \n",
"MAE (testset) 1.0183 1.0186 1.0204 1.0200 1.0230 1.0200 0.0017 \n",
"Fit time 4.32 4.61 4.74 4.53 4.50 4.54 0.14 \n",
"Test time 0.81 0.56 0.89 0.84 0.83 0.79 0.11 \n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" test_rmse | \n",
" test_mae | \n",
" fit_time | \n",
" test_time | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.288163 | \n",
" 1.018274 | \n",
" 4.317557 | \n",
" 0.812768 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.289117 | \n",
" 1.018571 | \n",
" 4.605982 | \n",
" 0.561459 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.291561 | \n",
" 1.020410 | \n",
" 4.735909 | \n",
" 0.886149 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.291203 | \n",
" 1.020018 | \n",
" 4.530619 | \n",
" 0.840300 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.294515 | \n",
" 1.022971 | \n",
" 4.504056 | \n",
" 0.830336 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" test_rmse test_mae fit_time test_time\n",
"0 1.288163 1.018274 4.317557 0.812768 \n",
"1 1.289117 1.018571 4.605982 0.561459 \n",
"2 1.291561 1.020410 4.735909 0.886149 \n",
"3 1.291203 1.020018 4.530619 0.840300 \n",
"4 1.294515 1.022971 4.504056 0.830336 "
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import surprise\n",
"from surprise import SVD, Dataset, Reader\n",
"from surprise.model_selection import cross_validate\n",
"\n",
"reader = Reader()\n",
"data = Dataset.load_from_df(ratings, reader) # Load the data\n",
"k = 10\n",
"algo = SVD(n_factors=k, random_state=42)\n",
"pd.DataFrame(cross_validate(algo, data, measures=[\"RMSE\", \"MAE\"], cv=5, verbose=True))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's get some predictions. "
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Prediction(uid='A2VNYWOPJ13AFP', iid='B00IJB5MCS', r_ui=None, est=4.771523282269237, details={'was_impossible': False})"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"user_id = 'A2VNYWOPJ13AFP'\n",
"item_id = 'B00IJB5MCS'\n",
"pred = algo.predict(user_id, item_id)\n",
"pred"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"It's hard to interpret the recommendations simply based on the item ids. How can we examine goodness of these recommendations? We can just append the id to amazon.com and view the product. "
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"#from IPython.core.display import HTML, display\n",
"from IPython.display import display\n",
"\n",
"url_amazon = \"https://www.amazon.com/dp/%s\"\n",
"\n",
"def disp_url(item_id):\n",
" url = url_amazon % item_id\n",
" display(url)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'https://www.amazon.com/dp/B00IJB5MCS'"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"grill_spatula = \"B00IJB5MCS\"\n",
"grill_spatula_ind = item_mapper[grill_spatula]\n",
"grill_spatula_vec = Y[grill_spatula_ind]\n",
"disp_url(grill_spatula)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the product. \n",
"- [Mr Grill - 18\" Luxury Oak Barbecue Spatula / Turner](https://www.amazon.com/dp/B00IJB5MCS). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**(Optional) Practice exercises for you**\n",
"\n",
"Use scikit-learn's [NearestNeighbors](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html) object (which uses Euclidean distance by default) to find the 10 most similar items to [Mr Grill - 18\" Luxury Oak Barbecue Spatula / Turner](https://www.amazon.com/dp/B00IJB5MCS) using Euclidean distance and cosine distance. Which distance metric is giving you better recommendations? \n",
"\n",
"> Try it out on your own or with your friends. I might not get a chance to post solutions for these questions. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 4. Beyond error rate in recommendation systems \n",
"\n",
"- If a system gives the best RMSE it doesn't necessarily mean that it's going to give best recommendations. \n",
"- In recommendation systems we do not have ground truth in the sense that there is no notion of \"perfect\" recommendations. \n",
"- Training your model and evaluating it offline is not ideal. \n",
"- Other aspects such as simplicity, interpretation, code maintainability are equally (if not more) important than best validation error. \n",
"- Winning system of Netflix Challenge was never adopted.\n",
" - Big mess of ensembles was not really maintainable \n",
"- There are other considerations. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Diversity**\n",
"\n",
"Are these good recommendations? \n",
"\n",
"You are looking at [Education Solar Robot Toy](https://www.amazon.ca/Sillbird-Education-Building-Science-Experiment/dp/B07XRN6TJ8), are these good recommendations? \n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now suppose you've recently bought Education Solar Robot Toy and rated them highly. Are these good recommendations now? \n",
"\n",
"- Not really. Even though you really liked the item you don't need similar items anymore. \n",
"- **Diversity** is about how different are the recommendations. \n",
" - Another example: Even if you really really like Star Wars, you might want non-Star-Wars suggestions. \n",
"- But be careful. We need a balance here. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Freshness**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Are these good recommendations? \n",
"\n",
"\n",
"\n",
"- Some of these books don't have many ratings but it might be a good idea to recommend \"fresh\" things. \n",
"- **Freshness**: people tend to get more excited about new/surprising things. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Trust**\n",
"\n",
"- But again you need a balance here. What would happen if you keep surprising users all the time? \n",
"- There might be **trust** issues. \n",
"- Another aspect of trust is explaining your recommendation, i.e., telling the user why you made a recommendation. This gives the user an opportunity to understand why your recommendations could be interesting to them. \n",
"- [Injecting GPT-4's reasoning into recommendation systems](https://www.linkedin.com/pulse/injecting-gpt-4s-reasoning-recommendation-algorithms-peter-gostev/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Persistence**: \n",
"\n",
"- How long should recommendations last?\n",
"- If the user does not click on a recommendation for a while, should it remain a recommendation?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"**Social recommendation**: \n",
"\n",
"- What did your friends watch?\n",
"- Many recommenders\tare\tnow\tconnected to social\tnetworks.\n",
"- \"Login using you Facebook\taccount\".\n",
"- Often, people\tlike similar movies\tto their friends.\n",
"- If we get a new user, then recommendations are based on friend's preferences. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Final comments and summary"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### What did we cover? \n",
"\n",
"- There is a big world of recommendation systems out there. We talked about some basic traditional approaches to recommender systems. \n",
" - collaborative filtering \n",
" - content-based filtering "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to know more advanced approaches to recommender systems, watch this 4-hour summer school tutorial by Xavier Amatriain, Research/Engineering Director @ Netflix. \n",
"\n",
"- [Part1](https://www.youtube.com/watch?v=bLhq63ygoU8)\n",
"- [Part2](https://www.youtube.com/watch?v=mRToFXlNBpQ)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Reminder\n",
"\n",
"- Recommendation systems can have terrible consequences, especially in the context of politics and extremism.\n",
"- They can cause the phenomenon called \"filter bubbles\".\n",
"- Ask hard and uncomfortable questions to yourself (and to your employer if possible) before implementing and deploying a recommendation system. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ❓❓ Questions for you (time-permitting)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Discuss memory-related problems that may cause when dealing with large number of users and items. \n",
"- We have been ignoring the timestamp column in ratings datasets. How you might use this information when making recommendations? "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Course roadmap\n",
"\n",
"- Week 1 ✅\n",
" - Clustering \n",
"- Week 2 ✅\n",
" - Dimensionality reduction\n",
"- Week 3 ✅\n",
" - Word embeddings, t-SNE\n",
"- Week 4 ✅\n",
" - Recommendation systems "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Time for course evaluations\n",
"\n",
"That's all for this course! It was fun teaching you this material. Thanks for your support, feedback, and great questions ❤️!\n",
"\n",
"I would love to hear your thought on this course. When you get a chance, it'll be great if you fill in the evaluation survey for this course on [Canvas](https://canvas.ubc.ca/courses/106525/external_tools/4732). \n",
"\n",
"The evaluation closing date is: **March 24, 2023**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Resources\n",
"\n",
"- If you want to know more advanced approaches to recommender systems, watch this 4-hour summer school tutorial by Xavier Amatriain, Research/Engineering Director @ Netflix. ([Part1](https://www.youtube.com/watch?v=bLhq63ygoU8), [Part2](https://www.youtube.com/watch?v=mRToFXlNBpQ))\n",
"\n",
"- [10 lessons of the Quora recommendation system](https://sudonull.com/post/65548-10-lessons-of-the-Quora-recommendation-system-Retail-Rocket-Blog)"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python [conda env:563]",
"language": "python",
"name": "conda-env-563-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
}
},
"nbformat": 4,
"nbformat_minor": 4
}