# Red Wine Quality Prediction 

by Nicole Bidwell, Ruocong Sun, Alysen Townsley, Hongyang Zhang

In [67]:
import pandas as pd 
from myst_nb import glue
import pickle

In [109]:
comparison_df = pd.read_csv("../results/tables/comparison_df.csv", index_col=0).round(3)
wine_quality_df = pd.read_csv('../data/winequality-red.csv', sep = ';')
dummy_df = pd.read_csv('../results/tables/cv_results.csv').round(3)

glue('test_set_score', (pd.read_csv("../results/tables/test_set_score.csv", index_col=0).round(3)).loc[0, 'test_set_score'] * 100, 
     display=False)
glue("wine_quality_df", wine_quality_df.head(), display=False)
glue('wine_quality_df_nrows', wine_quality_df.shape[0], display=False)
glue('wine_quality_df_nfeatures', wine_quality_df.shape[1] - 1, display=False)
glue('min_wine_quality', wine_quality_df['quality'].min(), display=False)
glue('max_wine_quality', wine_quality_df['quality'].max(), display=False)
glue('dummy_df', dummy_df, display=False)
glue('dummy_valid_score', (dummy_df['test_score'].mean() * 100).round(3), display=False)
glue('comparison_df', comparison_df, display=False)
glue('logistic_gs_score', (comparison_df.loc['logistic', 'mean_test_score'] * 100).round(3), display=False)
glue('decision_tree_gs_score', (comparison_df.loc['decision_tree', 'mean_test_score'] * 100).round(3), display=False)
glue('knn_gs_score', (comparison_df.loc['knn', 'mean_test_score'] * 100).round(3), display=False)
glue('svc_gs_score', (comparison_df.loc['svc', 'mean_test_score'] * 100).round(3), display=False)

## Summary 

In this project our group seeks to use machine learning algorithms to predict wine quality (scale of 0 to 10) using physiochemical properties of the liquid. We use a train-test split and cross-validation to simulate the model encountering unseen data. We use and tune the parameters of several classification models: logistic regression, decision tree, kNN, and SVM RBF to see which one has the highest accuracy, and then deploy the winner onto the test set. The final test set accuracy is around {glue:text}`test_set_score` percent. Depending on the standard, this can be decent or poor. However, a more important note is that for the really extreme quality ones (below 5 or above 6), the model was unable to identify quite a few of them correctly, suggesting that it is not very robust to outliers. We include a final discussion section on some of the potential causes for this performance as well as proposed solutions for any future analysis.

## Introduction 

Red wines have a long history that can be traced all the way back to the ancient Greeks. Today, they are more accessible to an average person than ever and the entire industry is estimated to be worth around 109.5 billion USD {cite}`market_trends`. Despite its ubiquity, most people can barely tell the difference between a good and a bad wine, to the point where we need trained professionals (sommeliers) to understand the difference. In this project, we seek to use machine learning algorithms to predict the quality of the wine based on the physiochemical properties of the liquid.  This model, if effective, could allow manufactures and suppliers to have a more robust understanding of the wine quality based on measurable properties.

## Methods & Results

### EDA
#### Dataset Description
The dataset is the "winequality-red.csv" file from the UC Irvine Machine Learning Repository {cite}`misc_wine_quality_186`, which was originally referenced from Decision Support Systems, Elsevier {cite}`cortez2009modeling`. The dataset contains physiochemical proprties (features) of red vinho verde wine samples from the north of Portugal, along with an associated wine quality score from 0 (worst) to 10 (best). 

```{glue:figure} wine_quality_df
:width: 400px
:height: 400px
:name: "wine_quality_df"
:align: left

First five rows of the red wine dataframe.
```

There are {glue:text}`wine_quality_df_nfeatures` feature columns representing physiochemical characteristics of the wines, such as fixed acidity, residual sugar, chlorides, density, etc. There are {glue:text}`wine_quality_df_nrows` rows or observations in the dataset, with no missing values. The target is the quality column which is listed as a set of ordinal values from {glue:text}`min_wine_quality` to {glue:text}`max_wine_quality`, although they could go as low as 0 or as high as 10 (this data set does not contain observations across the entire range). Most observations have an "average" quality of 5 or 6, with fewer below a score of 5 or above a score of 6.

#### Columns
- `fixed acidity`: grams of tartaric acid per cubic decimeter.
- `volatile acidity`: grams of acetic acid per cubic decimeter.
- `citric acid`: grams of citric acid per cubic decimeter.
- `residual sugar`: grams of residual sugar per cubic decimeter.
- `chlorides`: grams of sodium chloride per cubic decimeter.
- `free sulfur dioxide`: grams of unreacted sulfur dioxide per cubic decimeter. 
- `total sulfur dioxide`: grams of total sulfur dioxide per cubic decimeter. 
- `density`: density of the wine in grams per cubic decimeter.
- `pH`: pH value of the wine
- `sulphates`: grams of potassium sulphate per cubic decimeter
- `alcohol` : percentage volume of alcohol content. 
- `quality` : integer range from 0 (representing low-quality) to 10 (representing high-quality).

#### Visualization

We first observe the distribution of the features using their statistical summaries and a histogram. We can see that the majority of features have a skewed distribution, with many containing outliers. Volatile acidity, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, and sulphates all have very extreme outliers.

```{figure} ../results/figures/repeating_hists_plot.png
---
width: 1000px
name: repeating_hists_plot
---
Histograms showing the distrbution of each feature in the red wine dataframe.
```

### Model Training
#### Model Selection and Hyperparameter Tuning

Our method for model selection involves using 5-fold cross-validation and hyperparameter tuning on several models: logistic regression, decision tree, kNN and SVM RBF. We use validation accuracy as our metric. Below we first use a dummy classifier to establish the baseline.

```{glue:figure} dummy_df
:figwidth: 400px
:name: "dummy_df"

Cross valdidation results for the Dummy Classifier baseline model.
```

As we can see, the baseline obtains an accuracy of around {glue:text}`dummy_valid_score` percent. We now use cross cross validation paired with hyperparameter tuning to identify a model that performs the best. 

```{glue:figure} comparison_df
:width: 400px
:name: "comparison_df"

Grid search results for the four models: Logistic Regression, Decision Tree, kNN, and SVC.
```

We see that logistic regression has a best validation score of {glue:text}`logistic_gs_score` percent; decision tree is {glue:text}`decision_tree_gs_score` percent; kNN is {glue:text}`knn_gs_score` percent, and RBF SVM is {glue:text}`svc_gs_score` percent. As a result, we will use the tuned RBF SVM as our model on the test set.

#### Test Set Deployment

The best model's score on the test set is around {glue:text}`test_set_score`, which shows a slight improvement compared with the validation score. We want to further probe into its performance by looking at the confusion matrix.

```{figure} ../results/figures/confusion_matrix_plot.png
---
width: 600px
name: confusion_matrix_plot
---
Confusion matrix of the SVC model performance on the test data.
```

For the really mediocre wines (5 and 6), the model can predict most of them correctly, but the model fails to predict a large proportion of extreme ones correctly, suggesting that the model is not too robust against outliers.

## Discussion 

In this project, we built several machine learning classification models seeking to predict the wine quality based on the physiochemical properties of the liquid. By trying out different models with different hyperparameters, we have found that for our data set, the best performing model is RBF SVM. However, despite being the best, the accuracy is only around {glue:text}`test_set_score` percent. Depending on the situation this can be poor or decent. More importantly, the algorithm seems to not be able to identify the outliers precisely, and in the case where people want to be able to find really good or bad wines, this model's performance would not be able to meet people's expectations. Our group's discussion has concluded that there might be several factors leading to this phenomenon:

### High correlations:

```{figure} ../results/figures/correlation_matrix_plot.png
---
width: 1000px
name: correlation_matrix_plot
---
Correlation matrix for all red wine physiochemical features in the dataframe.
```

Several variables in the data set appear to have a substantial amount of correlation (in the range of 0.6) and this collinearity could have potentially caused problems with some of our models. Given this and the high dimensionality, we could have implemented a dimensionality reduction algorithm (such as PCA) to reduce the number of features and therefore eliminate some of the collinearity.

### Potential Interactions:
In our logistic regression model we did not take any of the potential interaction into the account. With this many qualities it is possible that some of the features affect the effect of others {cite}`log_regression_PCA` {cite}`deciphering_interactions`.

### Problem Formulation:
The response variable could be treated as a number instead and an approach of regression question could have better captured the nature of our problem and produced a better model. Additionally, due to the limited scope of our data set (no observation below {glue:text}`min_wine_quality` or above {glue:text}`max_wine_quality`), a classification model trained on this data set would not be able to identify any observation outside of the scope correctly. A regression algorithm is more immune to this kind of problem. 

### Infeasibility of the Problem
Despite the potential improvements we have identified (or not) for our project, there still exists a possibility that even with all these improvements, the accuracy would not improve that much. And that is not due to the incorrect setup for the analyses, but rather the fact that some of the underlying uncontrollable factors in the process of wine making simply makes it impossible to detect patterns for really good or bad wines, and their qualities can only determined by actually tasting rather than prediction using numerical representations of some of its properties. However, among all the possible problems we have identified, this is the only one where we have zero proposed solutions for.

## Software Attributions

To complete this analysis, visualize data, and build the machine learning model, Python {cite}`10.5555/1593511` and associated libraries, including Pandas {cite}`mckinney2010data`, NumPy {cite}`numpy`, scikit-learn {cite}`scikit-learn`, Altair {cite}`vanderplas2018altair`, Seaborn {cite}`Waskom2021`, and Matplotlib {cite}`Hunter:2007` were used. 

We acknowledge the contributions of the open-source community and developers behind these tools, which significantly facilitated our analysis.

## References 

```{bibliography}
```