# Import libraries
import re
import sys
from hashlib import sha1
from pandas_profiling import ProfileReport
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
# train test split and cross validation
from sklearn.model_selection import (
train_test_split,
)
from IPython.display import IFrame
Here, we are importing the train data set outputted from the processing script for analysis purpose.
train_df = pd.read_csv('../data/processed/train.csv')
Column Name | Column Type | Description | Target? |
---|---|---|---|
Id | Numeric | Unique ID assigned to each observation. | No |
Text | Free Text | Body of the review content. | No |
Author | Categorical | Author's name of the review | No |
Rating | Numeric | Ratings given along with the review | Yes |
The data contains 4 columns, as documented in the table above. Specifically, Rating
is our target column, and we want to use the rest of the information to predict rating.
The pandas profiling report give us a detailed overview of how our data looks like. Particularly, we can utilize it to view the general distribution as well as correlation of each of the columns.
Parituclarly, we are interested in the Interaction
and Correlation
between all these features.
As expected, we are seeing a both weak correlation and interaction between all these features, because most of the information are embeded in the Text
column, which needs to be further extracted.
IFrame(src='../results/profiling_report.html', width=900, height=700)
train_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4004 entries, 0 to 4003 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 4004 non-null int64 1 Text 4004 non-null object 2 Author 4004 non-null object 3 Rating 4004 non-null float64 4 n_words 4004 non-null int64 5 sentiment 4004 non-null object dtypes: float64(1), int64(2), object(3) memory usage: 187.8+ KB
drop_features = ['id', 'Author']
Both id
and Author
are features that we don't want to include into our training model:
id
is just a random indentification for each row. This is aligned with our observation from the profiling report, that this feature have almost no correlation with the target.Author
because we don't want it to affect our prediction. Our goal is to build a model that can predict ratings by just looking at the review text, so adding author would unnecessarily make the model more bias.Ratings seems to follow a bell-shaped distribution and slightly left-skewed.
Most of the values seems to cluster between 0.4 and 0.9. This suggested some imbalance in the target distribution.
The relationship between rating and the associated review text length is another interesting plot. We are seeing a trend of mean of review text length increasing as the rating increased.
This might be an indication that when people rate a movie high, they tends to write longer reviews. So text length can potentially by one useful feature that we can utilize.
For Text
, we want to use bag of words techique to further extract useful features from this raw feature.
Therefore, we used CountVectorizer(stop_words='english')
to break each of the raw texts into multiple word features.
The following table shows the top 20 most frequent words extracted by CountVectorizer(stop_words='english')
.
pd.read_csv('../results/top_20_frequent_words.csv', index_col=0)
Count | |
---|---|
Word | |
film | 17220 |
movie | 9868 |
like | 7170 |
story | 6923 |
director | 5525 |
time | 5373 |
just | 4689 |
life | 4302 |
good | 3464 |
man | 3392 |
little | 3386 |
way | 3340 |
make | 3218 |
films | 3185 |
love | 3143 |
characters | 3108 |
best | 3031 |
new | 3002 |
doesn | 2875 |
does | 2874 |
From the table we can observe that many of the words appeared seem to make sense in a movie review. This is probably an indication to us that bag of words technique would provide us with useful features that can help with our prediction.