Model Comparison

Authors: Yuanzhe Marco Ma, Arash Shamseddini, Kaicheng Tan, Zhenrui Yu

Table of Contents

I. General Overview

To predict the IMDB scores (ranging from 0 ~ 10), we decided to choose the best model from three sklearn (Pedregosa et al. 2011) candiatates, sklearn linear_model Ridge, sklearn SVR(RBF kernel) estimator, and sklearn ensemble RandomForestRegressor. The score we would like to use is $R^2$.

II. Data preprocessing

III.Ridge

We tuned the Alpha hyperparameter of Ridge using sklearn's GridSearchCV. The Alpha hyperparameter controls the complexity of our model. We want to pick the best Alpha so that our model does a decent job in predicting while avoiding over-fitting. Based on the tuning, we found that when alpha=800, we could get the best result. The result is shown above.

IV.SVR

We tuned our model with hyper-parameter optimization. Specifically, we tuned the Gamma hyperparameter of SVR using sklearn's GridSearchCV. The Gamma hyperparameter controls the complexity of our model. We want to pick the best Gamma so that our model does a decent job in predicting while avoiding over-fitting.

Below is our tuned, best performing model based on cross-validation score.

Model Name Hyperparameter - Gamma Mean Fit Time Mean Scoring Time Mean CV Score
SVR 0.0007000000000000001 43.80s 11.40s 0.4700

For a more detailed GridSearchCV result, see this file.

V. Random Forest

We tuned two major hyperparameters of the RandomForest regressor, max_depth and n_estimators, simultaneously. The results of hyperparameter optimization indicate that increasing both max_depth and n_estimators of the model improves the performance of the model over the training set continuously, however the validation scores are not improved significantly (changes in the order of $10^{-4}$ ). According to above results, the model is overfitting with our data as the gap between the train and test scores is large.

VI. Conclusion

Obviously, Random Forest is not a suitable model for our project due to low validation score and severe overfitting. Between Ridge and SVR, Ridge has the highest validation score. Both Ridge and SVR appear to overfit, but the degree to which they overfit is similar.
As a result, we choose Ridge as our best model among the three.

References

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. (https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html)