{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In determining the metric we use for our model, we need to first consider the context of Type I and Type II errors for our problem:\n",
    "\n",
    "- A type I (false positive) error would be predicting that a customer will make a purchase, when they in fact do not.\n",
    "- A type II (false negative) error would be predicting that a customer will not make a purchase, when in fact they do.\n",
    "\n",
    "Further, we need to consider the business objective of an e-commerce company who would potentially be using this model. We assume that the company will have the following objectives: \n",
    "\n",
    "1. Maximize revenue by increasing purchase conversion rate\n",
    "2. Minimize disruption to customer experience from targeted nudges\n",
    "\n",
    "Based on this, we note that relevant metrics include precision and recall. To be more specific, we want to maximize recall, while keeping a minimum threshold of 60% precision (threshold based on business requirement and tolerance)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Base model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our baseline model will be the `DummyClassifier` model from `sklearn` with the default parameters.  From the `sklearn` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html), the default `strategy` parameter is `prior` which always predicts the class that maximizes the class prior."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Models tested"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In performing model selection, we will consider the following models:\n",
    "\n",
    "- Logistic Regression\n",
    "- Support Vector Machine w/ an RBF kernel\n",
    "- Random Forest Classifier\n",
    "- XGBoost Classifier\n",
    "\n",
    "In assessing these models we will:\n",
    "\n",
    "1. Perform five fold cross validation and look at the mean results\n",
    "2. Generate and analyze confuision matrices with cross validated predictions on the train set\n",
    "3. Generate and analyze precision recall curves with cross validated predictions (probabilities in this case) on the train set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We performed 5 fold cross validation for the above models on our training data and observed the following metrics:\n",
    "\n",
    "_Please note that the following metrics are the mean values over the 5 folds of cross validation_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "remove_input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style type=\"text/css\">\n",
       "#T_835f9_ td:hover {\n",
       "  background-color: #ffffb3;\n",
       "}\n",
       "#T_835f9_ .index_name {\n",
       "  font-style: italic;\n",
       "  color: darkgrey;\n",
       "  font-weight: normal;\n",
       "}\n",
       "#T_835f9_ th:not(.index_name) {\n",
       "  background-color: #000066;\n",
       "  color: white;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_835f9_\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th class=\"blank level0\" >&nbsp;</th>\n",
       "      <th class=\"col_heading level0 col0\" >DummyClassifier</th>\n",
       "      <th class=\"col_heading level0 col1\" >LogisticRegression</th>\n",
       "      <th class=\"col_heading level0 col2\" >SVC</th>\n",
       "      <th class=\"col_heading level0 col3\" >RandomForest</th>\n",
       "      <th class=\"col_heading level0 col4\" >XGBoost</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th id=\"T_835f9_level0_row0\" class=\"row_heading level0 row0\" >fit_time</th>\n",
       "      <td id=\"T_835f9_row0_col0\" class=\"data row0 col0\" >0.00 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row0_col1\" class=\"data row0 col1\" >0.21 (+/- 0.02)</td>\n",
       "      <td id=\"T_835f9_row0_col2\" class=\"data row0 col2\" >1.27 (+/- 0.19)</td>\n",
       "      <td id=\"T_835f9_row0_col3\" class=\"data row0 col3\" >0.80 (+/- 0.06)</td>\n",
       "      <td id=\"T_835f9_row0_col4\" class=\"data row0 col4\" >0.66 (+/- 0.01)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_835f9_level0_row1\" class=\"row_heading level0 row1\" >score_time</th>\n",
       "      <td id=\"T_835f9_row1_col0\" class=\"data row1 col0\" >0.01 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row1_col1\" class=\"data row1 col1\" >0.01 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row1_col2\" class=\"data row1 col2\" >1.46 (+/- 0.16)</td>\n",
       "      <td id=\"T_835f9_row1_col3\" class=\"data row1 col3\" >0.06 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row1_col4\" class=\"data row1 col4\" >0.01 (+/- 0.00)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_835f9_level0_row2\" class=\"row_heading level0 row2\" >test_accuracy</th>\n",
       "      <td id=\"T_835f9_row2_col0\" class=\"data row2 col0\" >0.85 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row2_col1\" class=\"data row2 col1\" >0.88 (+/- 0.02)</td>\n",
       "      <td id=\"T_835f9_row2_col2\" class=\"data row2 col2\" >0.89 (+/- 0.03)</td>\n",
       "      <td id=\"T_835f9_row2_col3\" class=\"data row2 col3\" >0.89 (+/- 0.03)</td>\n",
       "      <td id=\"T_835f9_row2_col4\" class=\"data row2 col4\" >0.88 (+/- 0.04)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_835f9_level0_row3\" class=\"row_heading level0 row3\" >train_accuracy</th>\n",
       "      <td id=\"T_835f9_row3_col0\" class=\"data row3 col0\" >0.85 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row3_col1\" class=\"data row3 col1\" >0.89 (+/- 0.01)</td>\n",
       "      <td id=\"T_835f9_row3_col2\" class=\"data row3 col2\" >0.91 (+/- 0.01)</td>\n",
       "      <td id=\"T_835f9_row3_col3\" class=\"data row3 col3\" >1.00 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row3_col4\" class=\"data row3 col4\" >0.99 (+/- 0.00)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_835f9_level0_row4\" class=\"row_heading level0 row4\" >test_precision</th>\n",
       "      <td id=\"T_835f9_row4_col0\" class=\"data row4 col0\" >0.00 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row4_col1\" class=\"data row4 col1\" >0.73 (+/- 0.15)</td>\n",
       "      <td id=\"T_835f9_row4_col2\" class=\"data row4 col2\" >0.71 (+/- 0.13)</td>\n",
       "      <td id=\"T_835f9_row4_col3\" class=\"data row4 col3\" >0.69 (+/- 0.13)</td>\n",
       "      <td id=\"T_835f9_row4_col4\" class=\"data row4 col4\" >0.65 (+/- 0.16)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_835f9_level0_row5\" class=\"row_heading level0 row5\" >train_precision</th>\n",
       "      <td id=\"T_835f9_row5_col0\" class=\"data row5 col0\" >0.00 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row5_col1\" class=\"data row5 col1\" >0.77 (+/- 0.02)</td>\n",
       "      <td id=\"T_835f9_row5_col2\" class=\"data row5 col2\" >0.78 (+/- 0.01)</td>\n",
       "      <td id=\"T_835f9_row5_col3\" class=\"data row5 col3\" >1.00 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row5_col4\" class=\"data row5 col4\" >1.00 (+/- 0.00)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_835f9_level0_row6\" class=\"row_heading level0 row6\" >test_recall</th>\n",
       "      <td id=\"T_835f9_row6_col0\" class=\"data row6 col0\" >0.00 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row6_col1\" class=\"data row6 col1\" >0.38 (+/- 0.08)</td>\n",
       "      <td id=\"T_835f9_row6_col2\" class=\"data row6 col2\" >0.48 (+/- 0.11)</td>\n",
       "      <td id=\"T_835f9_row6_col3\" class=\"data row6 col3\" >0.55 (+/- 0.11)</td>\n",
       "      <td id=\"T_835f9_row6_col4\" class=\"data row6 col4\" >0.56 (+/- 0.12)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_835f9_level0_row7\" class=\"row_heading level0 row7\" >train_recall</th>\n",
       "      <td id=\"T_835f9_row7_col0\" class=\"data row7 col0\" >0.00 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row7_col1\" class=\"data row7 col1\" >0.41 (+/- 0.04)</td>\n",
       "      <td id=\"T_835f9_row7_col2\" class=\"data row7 col2\" >0.55 (+/- 0.07)</td>\n",
       "      <td id=\"T_835f9_row7_col3\" class=\"data row7 col3\" >1.00 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row7_col4\" class=\"data row7 col4\" >0.94 (+/- 0.02)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_835f9_level0_row8\" class=\"row_heading level0 row8\" >test_f1</th>\n",
       "      <td id=\"T_835f9_row8_col0\" class=\"data row8 col0\" >0.00 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row8_col1\" class=\"data row8 col1\" >0.50 (+/- 0.10)</td>\n",
       "      <td id=\"T_835f9_row8_col2\" class=\"data row8 col2\" >0.57 (+/- 0.12)</td>\n",
       "      <td id=\"T_835f9_row8_col3\" class=\"data row8 col3\" >0.61 (+/- 0.12)</td>\n",
       "      <td id=\"T_835f9_row8_col4\" class=\"data row8 col4\" >0.60 (+/- 0.13)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_835f9_level0_row9\" class=\"row_heading level0 row9\" >train_f1</th>\n",
       "      <td id=\"T_835f9_row9_col0\" class=\"data row9 col0\" >0.00 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row9_col1\" class=\"data row9 col1\" >0.53 (+/- 0.04)</td>\n",
       "      <td id=\"T_835f9_row9_col2\" class=\"data row9 col2\" >0.64 (+/- 0.05)</td>\n",
       "      <td id=\"T_835f9_row9_col3\" class=\"data row9 col3\" >1.00 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row9_col4\" class=\"data row9 col4\" >0.97 (+/- 0.01)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_835f9_level0_row10\" class=\"row_heading level0 row10\" >test_average_precision</th>\n",
       "      <td id=\"T_835f9_row10_col0\" class=\"data row10 col0\" >0.15 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row10_col1\" class=\"data row10 col1\" >0.61 (+/- 0.14)</td>\n",
       "      <td id=\"T_835f9_row10_col2\" class=\"data row10 col2\" >0.65 (+/- 0.17)</td>\n",
       "      <td id=\"T_835f9_row10_col3\" class=\"data row10 col3\" >0.68 (+/- 0.18)</td>\n",
       "      <td id=\"T_835f9_row10_col4\" class=\"data row10 col4\" >0.65 (+/- 0.18)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_835f9_level0_row11\" class=\"row_heading level0 row11\" >train_average_precision</th>\n",
       "      <td id=\"T_835f9_row11_col0\" class=\"data row11 col0\" >0.15 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row11_col1\" class=\"data row11 col1\" >0.67 (+/- 0.04)</td>\n",
       "      <td id=\"T_835f9_row11_col2\" class=\"data row11 col2\" >0.78 (+/- 0.03)</td>\n",
       "      <td id=\"T_835f9_row11_col3\" class=\"data row11 col3\" >1.00 (+/- 0.00)</td>\n",
       "      <td id=\"T_835f9_row11_col4\" class=\"data row11 col4\" >1.00 (+/- 0.00)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n"
      ],
      "text/plain": [
       "<pandas.io.formats.style.Styler at 0x10b32bdc0>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model_selection_results_df = pd.read_csv(\"../results/model_selection/model_selection_results.csv\", index_col=0)\n",
    "\n",
    "# set pandas table styles\n",
    "s = model_selection_results_df.style\n",
    "\n",
    "cell_hover = { \n",
    "    'selector': 'td:hover',\n",
    "    'props': [('background-color', '#ffffb3')]\n",
    "}\n",
    "\n",
    "index_names = {\n",
    "    'selector': '.index_name',\n",
    "    'props': 'font-style: italic; color: darkgrey; font-weight:normal;'\n",
    "}\n",
    "headers = {\n",
    "    'selector': 'th:not(.index_name)',\n",
    "    'props': 'background-color: #000066; color: white;'\n",
    "}\n",
    "\n",
    "s.set_table_styles([cell_hover, index_names, headers])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Table.2 Cross-validation results of models\n",
    "\n",
    "From Table 2 we can see that the:\n",
    "\n",
    "- Logistic regression model had the best precision scores on the validation tests during cross validation.  However, the recall scores of this model were quite poor.\n",
    "- Support vector machine had similar precision scores to the logistic regression model, and slightly better recall scores.\n",
    "- Random forest model and XGBoost models both severely overfit the training set, which can be seen by perfect accuracy and precision scores on the training set.  However we note that it is common for ensemble models to overfit the training set with no hyperparameter tuning.\n",
    "- The test set recall scores of the Random forest and XGBoost models are higher than the logistic regression and support vector machine models.\n",
    "- The F1 scores of the models are consistent with the analysis above.\n",
    "\n",
    "Based on the above cross validation results, the Random Forest model or XGBoost model look promising."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Confusion matrices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We used sklearn's `cross_val_predict` in combination with the `ConfusionMatrixDisplay.from_predictions` method to generate the Figure 5, which shows the confusion matrices for the models of interest.  We did not generate a confusion matrix for `DummyClassifier` as this information is not useful."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![ConfusionMatrix](images/model_cm.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fig.5 - Confusion Matrix for various models\n",
    "\n",
    "From the confusion matrices above, we can see that the:\n",
    "- Logistic regression model is outputing 218 false positives, and 928 false negatives.\n",
    "- Support vector machine is outputing 288 false positives, and 779 false negatives.\n",
    "- Random forest is outputing 384 false positives, and 674 false negatives.\n",
    "- XGBoost model is outputing 478 false positives, and 657 false negatives.\n",
    "\n",
    "Considering our goals of maximizing recall, with a budget of of 60% for precision, the Random forest and XGBoost model both look promising based on the confusion matrices."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Precision Recall Curves"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We used sklearn's `cross_val_predict` in combination with the `PrecisionRecallDisplay.from_predictions` method to generate Figure 6, which shows the precision recall curves for the models of interest."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![PRCurves](images/model_pr_curves.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fig.6 - Precision recall curve for various models\n",
    "\n",
    "In assessing precision and recall curves, the following traits indicate a \"better\" curve:\n",
    "\n",
    "- A perfect curve would go from the top left of the plot, to the top right, and then to the bottom right.  This would indicate a perfect scenario where a model can attain both 100% precision and recall (mostly impossible in real life)\n",
    "- A higher area under the curve\n",
    "- A higher average precision value (this is an average of the precision values at the different operating points in the curve and is shown in the legend as `AP`)\n",
    "\n",
    "Keeping the above points in mind, we can see that the Random forest and XGboost models appear to be able to generate our minimum budget of 60% precision while acheiving a 80% recall.  We note that these curves have been generated with no hyperparameter tuning.  In the model tuning section, we will determine if we need to manually set the operating point of our final model in order to increase or decrease a precision/recall score."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model selected for further tuning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, we analyzed what can be considered to be \"quick and dirty\" models.  This is the case since we simply fit these models to our pre-processed training data with no hyperparameter tuning, and then analyzed their results.  The main purpose of training models in this fashion is to identify a promising model to tune further in order to solve our classification problem.\n",
    "\n",
    "Based on these results alone, the Random forest and XGBoost models both appear to be promising.  We note that both of these models are ensemble models, and share some similarities.  As our dataset is not that large, we will select the Random forest model to tune further.\n",
    "\n",
    "Finally, we note that Random forests are often a great model for classification problems as they inject randomness into a problem in the form of bagging and random features {cite}`breiman2001random`."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "interpreter": {
   "hash": "224f349dae0cd7f483ba6ec89ede81d9b2e295608a2b61b0158a425d5ea0bda6"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}