# Bank Marketing Analysis

by Runtian Li, Rafe Chang, Sid Grover, Anu Banga

**Repo Link:** https://github.com/UBC-MDS/dsci_522_group_8.git

In [1]:
## Import necessary Packages
import pandas as pd
import numpy as np
from myst_nb import glue

## Summary
Here we build a model of balanced SVC to try to predict if a new client will subscribe to a term deposit. We tested five different classification models, including dummy classifier, unbalanced/balanced logistic regression, and unbalanced/balanced SVC, and chose the optimal model of balanced SVC based on how the model scored on the test data; the model has the highest test recall score of 0.82, which indicates that the model makes the least false negative predictions among all five models. 

The balanced support vector machines model considers 13 different numerical/ categorical features of customers. After hyperparameter optimization, the model's test accuracy increased from 0.82 to 0.875. The results were somewhat expected, given SVC's known efficacy in classification tasks, particularly when there's a clear margin of separation. The high recall score of 0.875 indicates that the model is particularly adept at identifying clients likely to subscribe, which was the primary goal. It's noteworthy that such a high recall was achieved, as it suggests the model is highly sensitive to true positive cases.

## Introduction

Term deposit is valuable to banks because it ensures a stable stream of income that banks can utilize. Banks usually invest in higher-return financial products or lend money to other customers with a higher interest rate to make a profit. With term deposits, banks can better predict their cash flow.

While banks's marketing strategies nowadays are usually focused on attracting new customers, the banks must target the right potential customers. This research is aimed at identifying the correct audience for banks to further design marketing strategies {cite}`dooley2023what`.

### Background
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. The data set used in this project was created by Moro, S. and Rita, P. and Cortez, P {cite}`moro2012bank`.It was sourced from the UCI Machine Learning Repository {cite}`moro2012bank`. We will be using bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). The raw data file can be found [here](https://archive.ics.uci.edu/dataset/222/bank+marketing).

### Research Question

We are working on a binary classification model. The classification goal is to predict if the client will subscribe a term deposit: "yes" for will subscribe and "no" for won't subscribe.

## Analysis

### Data Preprocessing
Initially, we ensured our data was complete by dealing with missing values and removing unnecessary columns like "contact," "day," and "month." This streamlined the dataset, making it ready for analysis. We used StandardScaler to standardize numerical features such as "age" and "balance" and applied one-hot encoding for categorical attributes, making the data compatible with different machine learning models.

### Model Selection and Evaluation
We used five models for classification, starting with a basic Dummy Classifier. Following models included Logistic Regression and Support Vector Classifier (SVC), each showing strengths in accuracy and recall {cite}`moro2014data`. Notably, our balanced models—Balanced Logistic Regression and Balanced Support Vector Classifier (svc_bal)—performed best, especially in identifying clients likely to subscribe to a term deposit.

### Model Comparison
An extensive evaluation, considering accuracy, precision, recall, and F1 scores, highlighted the Balanced Support Vector Classifier (svc_bal) as the standout performer. This model excelled with high recall, crucial for identifying potential term deposit subscribers in our specific context.

### Hyperparameter Optimization
Optimizing model performance, especially for the Support Vector Classifier (SVC) using a reduced dataset, resulted in a final model with an impressive 86% accuracy and a notable recall of 87%. This optimization strategy enhances efficiency and fine-tunes the model for better results.

### Recall - The Preferred Metric
In our bank marketing dataset, we prioritize recall. Recall indicates the model's ability to identify true positive cases—clients subscribing to a term deposit. In our context, missing a potential positive case is more significant than false positives, leading to potential losses and missed opportunities. Prioritizing recall ensures a finely tuned model capturing all potential clients interested in term deposits, aligning with our main goal.

## Modeling and Results

### Exploratory Data Analysis 

According to the discussion above, we decided to keep the following features as numerical features: "age", "balance", "duration", "campaign", "pdays", "previous". See the distribution as below: {numref}`Figure {number} <numerical_features_distribution>`. {cite}`vajiramedhin2014feature`.

```{figure} ../results/figures/numerical_dist_by_feat.png
---
width: 800px
name: numerical_features_distribution
---
Distribution of all the numerical features after feature selection
```

We decided to keep the following features as categorical features: "job", "marital", "education", "default", "housing", "loan", "poutcome". See the distribution as below: {numref}`Figure {number} <categorical_features_distribution>`

```{figure} ../results/figures/categorical_dist_by_feat.png
---
width: 800px
name: categorical_features_distribution
---
Distribution of all the categorical features after feature selection
```

In the plot below, we explore the spearman correlation between numerical features. {numref}`Figure {number} <corr_matx>`

```{figure} ../results/figures/corr_matx.png
---
width: 800px
name: corr_matx
---
Correlation Matrix of numerical features
```

### Preprocessing

<div class="alert alert-info">
    
- Since there is no missing values in our dataset, we don't need to do imputation or drop NAs.   
- We are going to drop "contact", "day" and "month" column here since they are not helping us in identifying useful underlying pattern in the model.    
- We take "age", "balance", "duration", "campaign", "pdays", "previous" as numerical features and we are doing StandardScaler transformation on them.
- We take "job", "marital", "education", "default", "housing", "loan", "poutcome" as categorical features and we are doing one hot encoding on them. We dropped columns only if the categorical is binary.
    
</div>

The transformed dataframe after doing `StandardScale` on numerical features and `OneHotEncoder` on categorical features is shown as below. The number of features after preprocessing is 32.

In [2]:
# Lists of feature names
numerical_features = ["age", "balance", "duration", "campaign", "pdays", "previous"]
categorical_features = ["job", "marital", "education", "default", "housing", "loan", "poutcome"]
drop_features = ["contact", "day", "month"]

In [3]:
# Display the transformed X_train
X_train_enc = pd.read_csv("../data/processed/X_train_enc.csv")
X_train_enc

Unnamed: 0,age,balance,duration,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,...,education_secondary,education_tertiary,education_unknown,default_yes,housing_yes,loan_yes,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,-1.125848,-0.525607,-0.250585,-0.568295,-0.411533,-0.245565,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1.136220,-0.457253,0.100475,-0.245219,-0.411533,-0.245565,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,1.324725,0.405335,0.266360,-0.245219,0.537396,0.600341,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,-1.031595,-0.457253,-0.173429,-0.245219,-0.411533,-0.245565,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,-1.031595,-0.280868,-0.586213,0.077857,-0.411533,-0.245565,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36163,0.853461,0.767775,2.419011,0.077857,-0.411533,-0.245565,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
36164,-0.466078,-0.245524,0.385952,-0.568295,-0.411533,-0.245565,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
36165,0.193692,0.764441,0.058039,-0.568295,-0.411533,-0.245565,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
36166,1.324725,2.405259,-0.223580,-0.245219,-0.411533,-0.245565,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


### Model Selection

In [4]:
scoring_df=pd.read_csv("../results/metrics/model_selection_scores.csv")
scoring_df = scoring_df.rename(columns={'Unnamed: 0': 'scoring'})
scoring_df

Unnamed: 0,scoring,dummy,dummy.1,logreg,logreg.1,svc,svc.1,logreg_bal,logreg_bal.1,svc_bal,svc_bal.1
0,,mean,std,mean,std,mean,std,mean,std,mean,std
1,fit_time,0.119,0.014,0.652,0.038,16.451,1.298,0.804,0.067,30.036,1.635
2,score_time,0.072,0.013,0.099,0.013,1.78,0.076,0.134,0.007,3.353,0.167
3,test_accuracy,0.883,0.0,0.9,0.004,0.899,0.003,0.829,0.005,0.814,0.007
4,train_accuracy,0.883,0.0,0.9,0.0,0.906,0.0,0.83,0.001,0.824,0.001
5,test_precision,0.0,0.0,0.653,0.041,0.657,0.035,0.386,0.007,0.369,0.011
6,train_precision,0.0,0.0,0.655,0.003,0.724,0.004,0.387,0.001,0.387,0.001
7,test_recall,0.0,0.0,0.314,0.022,0.291,0.016,0.778,0.013,0.823,0.016
8,train_recall,0.0,0.0,0.314,0.004,0.324,0.004,0.779,0.001,0.863,0.004
9,test_f1,0.0,0.0,0.423,0.025,0.403,0.016,0.516,0.007,0.509,0.013


<div class="alert alert-info">

`Dummy Classifier` has low accuracy and zero precision, recall, and F1 scores, indicating it never predicts the positive class (in this case the client subscribed a term deposit). This is expected as it always predicts the most frequent class.

`logreg` shows improved accuracy over the dummy model. However, its recall is low, suggesting it misses a significant number of true positive cases. `svc` performed almost the same as logistic regression model among all metrics.

`logreg_bal` and `svc_bal` have lower accuracy compared to their unbalanced counterparts but significantly higher recall. This indicates they are better at identifying positive cases but at the cost of making more false positive errors.

Given the context of our bank marketing data set, we aim to detect the clients who will subscribe a term deposit given the features. Missing a potential "yes" could be more costly than false positives, as it represents a lost opportunity for the sales team to transform this potential customer. Therefore, we chose `svc_bal` as the model has the highest `test_recall` score. 
    
</div>

In [5]:
scoring_metric_df=pd.read_csv("../results/metrics/scoring_metrics.csv")
scoring_metric_df

Unnamed: 0,train_accuracy,test_accuracy,train_precision,test_precision,train_recall,test_recall,train_f1,test_f1,fit_time,score_time
0,0.82385,0.815659,0.386597,0.369435,0.861531,0.816462,0.533704,0.508694,59.114527,102.086006


### Hyperparameter Optimization

<div class="alert alert-info">

Optimizing hyperparameters in SVC with a smaller sample size of 10,000 instances is a strategy aimed at enhancing computational efficiency. This approach expedites the exploration of hyperparameter possibilities, aiding in the discovery of potential configurations. While the outcomes validate the concept, it's crucial to recognize and manage the constraints stemming from the smaller dataset size when interpreting the results.
    
</div>

In [8]:
Best_Hyperparameters  = pd.read_csv("../results/metrics/best_params.csv")
bh = Best_Hyperparameters[['svc__C', 'svc__gamma', 'svc__kernel']]
bh = bh.style.format().hide()
glue("bh", bh, display=False)

```{glue:figure} bh
:figwidth: 300px
:name: "Best Hyperparameters"

Best C, gamma and kernel parameters for svc_balanced model
```

Random tested the SVC model with C values ranging from 0.1 to 10, gamma values ranging from 0.001 to 0.1, and kernels of rbf, sigmoid, and linear to Random tested the SVC model with C values ranging from 0.1 to 10, gamma values ranging from 0.001 to 0.1, and kernels of rbf, sigmoid, and linear to optimize the model's performance. With 25 random combinations with 5 folds of cross-validation, the best hyperparameter combination is approximately 4.33 for C, and approximately 0.01 for gamma, with the rbf kernel.  

### Test results after hyperparameter optimization

In [12]:
accuracyNrecall = pd.read_csv("../results/metrics/model_scores.csv")
aNr = accuracyNrecall[['Accuracy', 'Recall']]
aNr = aNr.style.format().hide()
glue("aNr", aNr, display=False)

```{glue:figure} aNr
:figwidth: 300px
:name: "Test Results"

Accuracy and recall metrics on test data
```

After fitting the model with the training data, and optimizing it with the hyperparameters found above, the model is used to score on the test data. The accuracy of the model is 0.86 while the recall ( True Positive / Actual Positive ) is 0.88. With optimization, the model performed well on unseen data. 

## Discussions

### Key Findings

In this bank marketing analysis project, we aimed to develop a binary classification model to predict client subscription to term deposits. We tested Logistic Regression and Support Vector Classifier (SVC) models, focusing on recall as a key performance metric. The SVC model outperformed Logistic Regression in recall, and after hyperparameter optimization, it achieved a recall score of 0.875 on the test dataset, which is quite promising!

### Reflection on Expectations

The results were somewhat expected, given SVC's known efficacy in classification tasks, particularly when there's a clear margin of separation. The high recall score of 0.875 indicates that the model is particularly adept at identifying clients likely to subscribe, which was the primary goal. It's noteworthy that such a high recall was achieved, as it suggests the model is highly sensitive to true positive cases.

### Impact of Finding

The high recall score of this model has significant implications for targeted marketing strategies. It suggests that the bank can confidently use the model's predictions to focus its marketing efforts on clients predicted to subscribe, potentially increasing the efficiency and effectiveness of its campaigns {cite}`moura2020optimization`. This targeted approach could lead to higher conversion rates with lower marketing expenses. However, it's important to balance such a high recall with precision to ensure that the bank doesn't unnecessarily target unlikely prospects.

### Future Improvements

The success of this model leads to several potential areas for further exploration:

- Balancing Precision and Recall: Investigating methods to enhance precision without substantially reducing recall.
- Feature Analysis: Identifying which features most significantly influence subscription predictions.
Model Interpretability: Improving the model's interpretability to better understand the basis for its predictions.
- Temporal Adaptability: Assessing the model's adaptability to evolving trends and customer behaviors over time.
- Testing Alternative Models: Exploring whether ensemble methods or more advanced machine learning algorithms could yield better or comparable results.
- Customer Segmentation: Evaluating the model's performance across different customer segments to tailor more specific marketing strategies.

## References

```{bibliography}
```