imdb_rating

In [1]:

# Import libraries
import re
import sys
from hashlib import sha1
from pandas_profiling import ProfileReport
import altair as alt

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

# train test split and cross validation
from sklearn.model_selection import (
    train_test_split,
)

from IPython.display import IFrame

Here, we are importing the train data set outputted from the processing script for analysis purpose.

Data Format¶

Column Name	Column Type	Description	Target?
Id	Numeric	Unique ID assigned to each observation.	No
Text	Free Text	Body of the review content.	No
Author	Categorical	Author's name of the review	No
Rating	Numeric	Ratings given along with the review	Yes

The data contains 4 columns, as documented in the table above. Specifically, Rating is our target column, and we want to use the rest of the information to predict rating.

Profiling report¶

The pandas profiling report give us a detailed overview of how our data looks like. Particularly, we can utilize it to view the general distribution as well as correlation of each of the columns.

Parituclarly, we are interested in the Interaction and Correlation between all these features. As expected, we are seeing a both weak correlation and interaction between all these features, because most of the information are embeded in the Text column, which needs to be further extracted.

In [4]:

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4004 entries, 0 to 4003
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Id         4004 non-null   int64  
 1   Text       4004 non-null   object 
 2   Author     4004 non-null   object 
 3   Rating     4004 non-null   float64
 4   n_words    4004 non-null   int64  
 5   sentiment  4004 non-null   object 
dtypes: float64(1), int64(2), object(3)
memory usage: 187.8+ KB

Identify drop features ¶

Both id and Author are features that we don't want to include into our training model:

id is just a random indentification for each row. This is aligned with our observation from the profiling report, that this feature have almost no correlation with the target.
We intentionally drop Author because we don't want it to affect our prediction. Our goal is to build a model that can predict ratings by just looking at the review text, so adding author would unnecessarily make the model more bias.

Visualizing features ¶

The distribution of ratings (target)¶

The distribution of ratings

Ratings seems to follow a bell-shaped distribution and slightly left-skewed.

Most of the values seems to cluster between 0.4 and 0.9. This suggested some imbalance in the target distribution.

The relationship between text length and ratings¶

The relationship between text length and ratings

The relationship between rating and the associated review text length is another interesting plot. We are seeing a trend of mean of review text length increasing as the rating increased.

This might be an indication that when people rate a movie high, they tends to write longer reviews. So text length can potentially by one useful feature that we can utilize.

The list of top 20 frequent word appeared in text¶

For Text, we want to use bag of words techique to further extract useful features from this raw feature. Therefore, we used CountVectorizer(stop_words='english') to break each of the raw texts into multiple word features. The following table shows the top 20 most frequent words extracted by CountVectorizer(stop_words='english').

In [6]:

pd.read_csv('../results/top_20_frequent_words.csv', index_col=0)

Out[6]:

	Count
Word
film	17220
movie	9868
like	7170
story	6923
director	5525
time	5373
just	4689
life	4302
good	3464
man	3392
little	3386
way	3340
make	3218
films	3185
love	3143
characters	3108
best	3031
new	3002
doesn	2875
does	2874

From the table we can observe that many of the words appeared seem to make sense in a movie review. This is probably an indication to us that bag of words technique would provide us with useful features that can help with our prediction.

Exploratory data analysis¶

Table of Contents¶

Data import ¶