Exploratory data analysis

Table of Contents

Data import


Here, we are importing the train data set outputted from the processing script for analysis purpose.

Data overview


Data Format

Column Name Column Type Description Target?
Id Numeric Unique ID assigned to each observation. No
Text Free Text Body of the review content. No
Author Categorical Author's name of the review No
Rating Numeric Ratings given along with the review Yes

The data contains 4 columns, as documented in the table above. Specifically, Rating is our target column, and we want to use the rest of the information to predict rating.

Profiling report

The pandas profiling report give us a detailed overview of how our data looks like. Particularly, we can utilize it to view the general distribution as well as correlation of each of the columns.

Parituclarly, we are interested in the Interaction and Correlation between all these features. As expected, we are seeing a both weak correlation and interaction between all these features, because most of the information are embeded in the Text column, which needs to be further extracted.

Identify drop features


Both id and Author are features that we don't want to include into our training model:

Visualizing features


The distribution of ratings (target)

The distribution of ratings

Ratings seems to follow a bell-shaped distribution and slightly left-skewed.

Most of the values seems to cluster between 0.4 and 0.9. This suggested some imbalance in the target distribution.

The relationship between text length and ratings

The relationship between text length and ratings

The relationship between rating and the associated review text length is another interesting plot. We are seeing a trend of mean of review text length increasing as the rating increased.

This might be an indication that when people rate a movie high, they tends to write longer reviews. So text length can potentially by one useful feature that we can utilize.

The list of top 20 frequent word appeared in text

For Text, we want to use bag of words techique to further extract useful features from this raw feature. Therefore, we used CountVectorizer(stop_words='english') to break each of the raw texts into multiple word features. The following table shows the top 20 most frequent words extracted by CountVectorizer(stop_words='english').

From the table we can observe that many of the words appeared seem to make sense in a movie review. This is probably an indication to us that bag of words technique would provide us with useful features that can help with our prediction.