In this project, we look at the relationship between movie reviews and their IMDB scores (ranging from 0 ~ 10). Positive reviews are often related to high IMDB scores, while negative reviews indicate the opposite. While it is easy for humans to understand a piece of review and guess the scores, we wonder if machines could understand it as well. Furthermore, we would like to automate this process, so that given a bunch of movie reviews, we are able to predict their corresponding IMDB scores easily.

In this project, most of the data wrangling in this project is powered by the pandas (McKinney 2010) library. The machine learning models and frameworks utilize sklearn (Pedregosa et al. 2011). We use Ridge Regressor as our prediction model.

II. Where Do We Get the Data? ¶

We obtained our data from an open-sourced github repository:

https://github.com/nproellochs/SentimentDictionaries/blob/master/Dataset_IMDB.csv

The repository was originally used for sentiment analysis related to movie reviews. Here we are using the Dataset_IMDB.csv as our main data source.

Automation¶

To automate data retrieval, we have written a script to obtain the dataset with Python. The script can be accessed here.

III. What does the data look like? ¶

Let's look into the dataset by performing some Exploratory Data Analysis (EDA).

1. Data Columns:¶

Column Name	Column Type	Description
Id	Numeric	Unique ID assigned to each observation.
Text	Free Text	Body of the review content.
Author	Categorical	Author's name of the review
Rating	Numeric	Ratings given along with the review (normalized)

For this project, we look primarily into the Text and Rating columns.

We realized that Author may have significant relation to ratings, but since we are making a generalized model for reviews from any audience, we have decided to discard it for this analysis. Therefore, we will drop both the Author and Id columns.

2. The `Text` feature¶

The Text feature contains all the movie reviews. This will be our primarily input feature.
Below are the top 10 most frequent words in the reviews:

In [1]:

import pandas as pd

freq = pd.read_csv('../results/top_20_frequent_words.csv')
freq[:10]

Out[1]:

	Word	Count
0	film	17220
1	movie	9868
2	like	7170
3	story	6923
4	director	5525
5	time	5373
6	just	4689
7	life	4302
8	good	3464
9	man	3392

Note that when we use sklearn's CountVectorizer to generate the word counts, we discarded the most frequent literals such as "the", "an", and so on by using the stop_words='English' argument. This allows us to ignore frequently used but meaningless words in English, as they have little implication to our ML model and could cause overfitting.

3. The `Rating` Class¶

Ratings will be our target class. Let's look at a distribution of Rating.

ratings_histo

The ratings seem roughly normally distributed, with a little skewness to the left. Most of the ratings cluster around 5 ~ 8.

4. Correlation between `Text` length and `Rating`¶

We suspect that people more passionate about certain movies tend to write longer reviews to express feelings. This could also be true for very negative reviews.

A bar plot of Text length vs Rating is presented below.

textlength_vs_rating

There doesn't seem to be a strong correlation between reviews length and rating. However, it is notable that for the most positive ratings (from 7 ~ 10), the reviews tend to be higher.

IV. Pre-processing ¶

New Columns¶

In addition to Text feature alone, we extracted two potentially useful columns that could enhance our machine learning model.

n_words: As mentioned above, we suspect some correlation between review lengths and ratings. Therefore we created an n_words feature, which counts the number of words in each review.
sentiment: We utilized the NLTK (Bird 2009) package to assist us in extracting the sentiment of each review. This sentiment feature will have four ordinal categories - ['neg', 'compound', 'neu', 'pos'].

Column Transformers¶

Now we have three columns to transform before fitting - Review, n_words and sentiment.

Review: since it is text feature, we will use sklearn's CountVectorizer to transform the text into bag of words. As mentioned in the EDA, some of the most frequent words are often meaningless. We added stop_words='English' to ignore these common English words.
n_words: we standardize the data with sklearn's StandardScaler to avoid its scaling effect on the estimator.
sentiment: since it is ordinal data, we utilize sklearn's OrdinalEncoder to encode the data with one-hot method.

V. Model Selection & Fitting ¶

Now we have our training data ready, we will consider which model to use to fit.
We selected 3 candidate models that are suitable for regression prediction:

Support Vector Regressor (SVR)
Random Forest Regressor
Ridge Regressor

We tuned the hyperparameters of each model using sklearn's GridSearchCV.
Below are comparative cross-validation scores of the 3 models.

model	fit_time	score_time	test_score	train_score
Ridge	2.198422	0.511951	0.533365	0.841871
SVR	28.988741	8.086322	0.453478	0.714765
RandomForestRegressor	287.661862	0.499031	0.359660	0.906179

Winner: `Ridge`¶

We selected Ridge with a combindation of two considerations - cross-validation scores and the risk of overfitting. Ridge was selected because it has the highest CV score. Although all three models show risk of overfitting, the degree to which Ridge and SVR overfit are around the same, while RandomForest is the worst. Therefore, we consider Ridge as best estimator. For more details on the model selection process, see the model comparison report.

Below is a detailed specification of our final model.

Model Name	Hyperparameter - alpha	Mean Fit Time	Mean Scoring Time	Mean CV Score
Ridge	500	0.52s	0.53s	0.533365

VI. Prediction Results ¶

Below is our model's prediction scores on the test dataset.

	r2	rmse
model	0.543102	1.233982

1. $R^2$, RMSE Scores¶

$R^2$ and RMSE (Root-mean-squared-error) are two common metrics for evaluating a regression model's accuracy.

The obtained $R^2$ score for the test set is 0.543102, which was comparable to, and even better than our validation score.
The RMSE score was 1.234. This could be large because it means our predicted score can have a margin of error of over 1.2 points when our overall scale is only 0-10.

2. Prediction vs. True Ratings¶

We have also created a scatterplot to compare our predicted ratings vs. the true ratings.

true_vs_predict

There is an obvious difference between the predicted ratings and true ratings. In the true ratings, people tend to give whole number rating, i.e. 3 instead of 3.247. Our model did not capture that.

Despite that, most of the points are somewhat clustered around the identity line. This indicates that our model didn't seem to under-fit or over-fit.

VII. Reflection and Improvements ¶

Now that we are done with our prediction and analysis, we can examine the quality of our work. In fact, there are a few areas of improvement.

We discarded the Author and id columns at the beginning. In fact, these columns could be influential features, especially Author. However, an inherent issue with our dataset is that all reviews come from four critics (authors). Including it could add to the difficulties of generalizing our model to the broader audience. It would be ideal if we can obtain reviews from the general audience.

There is risk of overfitting in our models. Based on the model selection results, all three candidate models have large gaps between training score and test scores. We have tuned the hyperparameters the best we can, yet the gap is still apparent. This might have to do with the nature of our dataset and model compatibility, and it requires further analysis.

As mentioned above, our model did not capture the pattern where humans tend to give whole number scores. In the future, we can probably give more emphasis to predicting whole number scores, so that our model resembles more human behaviors.

The sentiment feature generated with the NLTK package contained only neu and compound in our dataset. This is confusing and we have yet to understand this behavior. We included the feature regardless because it may still provide useful information. However, this is definitely a place we need to investigate further into.

References ¶

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. (http://archive.ics.uci.edu/ml)
Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python. O'Reilly Media Inc. (https://www.nltk.org/)
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. (https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html)
McKinney, W., & others. (2010). Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference (Vol. 445, pp. 51–56). (https://pandas.pydata.org/)
Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani, 2009, “An Introduction to Statistical Learning with Application in R”, Springer Publishing

Predicting IMDB Ratings Based on Movie Reviews¶

Table of Contents¶

I. Project Overview ¶

II. Where Do We Get the Data? ¶

Automation¶

III. What does the data look like? ¶

1. Data Columns:¶

2. The `Text` feature¶

3. The `Rating` Class¶

4. Correlation between `Text` length and `Rating`¶

IV. Pre-processing ¶

New Columns¶

Column Transformers¶

V. Model Selection & Fitting ¶

Winner: `Ridge`¶

VI. Prediction Results ¶

1. $R^2$, RMSE Scores¶

2. Prediction vs. True Ratings¶

VII. Reflection and Improvements ¶

References ¶

Predicting IMDB Ratings Based on Movie Reviews¶

Table of Contents¶

I. Project Overview ¶

II. Where Do We Get the Data? ¶

Automation¶

III. What does the data look like? ¶

1. Data Columns:¶

2. The Text feature¶

3. The Rating Class¶

4. Correlation between Text length and Rating¶

IV. Pre-processing ¶

New Columns¶

Column Transformers¶

V. Model Selection & Fitting ¶

Winner: Ridge¶

VI. Prediction Results ¶

1. $R^2$, RMSE Scores¶

2. Prediction vs. True Ratings¶

VII. Reflection and Improvements ¶

References ¶

2. The `Text` feature¶

3. The `Rating` Class¶

4. Correlation between `Text` length and `Rating`¶

Winner: `Ridge`¶