In the field of risk management, one of the most common problems is default prediction. This allows companies to predict the credibility of each person, analyze the risk level and optimize decisions for better business economics. In this project, we aim to learn and predict a credit card holder’s credibility based on his/her basic personal information (gender, education, age, history of past payment etc. ).
Our final classifier using the Logistic Regression model did not perform as well as we hoped on our unseen test data, with a final f1 score of 0.47. Of the 6000 clients in our test data, our model correctly predicted the default status of 4141 clients correctly. There were 1859 incorrect predictions, either a client will default on their payment when they have not or a client will not default when they have. Incorrect predictions of either type can be costly for financial institutions and thus we will continue to study our data and improve our model before it is put into production.
Through this project, we aim to answer the predictive question:
Given a credit card holder’s basic personal information (gender, education, age, history of past payment etc.), will the person default on next month’s payment?
A credit default is defined as the behavior when someone who borrowed the money stops making the required payments. In the data set, the target class 1 indicates that the person has committed a credit default (fails to pay) while 0 indicates the person is paying the debt as required. Our evaluation is of great importance because it helps to understand which sets of attributes relate to credibility. We would also aim to perform a comparative study of the mainstream machine learning classification models to be able to identify the best performing model in predicting credit default.
We use a dataset hosted by the UCI machine learning repository (Dua and Graff (2017)). Originally it is collected by researchers from Chung Hua University and Tamkang University (Yeh and Lien (2009), (n.d.)). As the probability of default cannot be actually acquired, the targets are obtained through estimation as stated by the authors of this dataset. The dataset consists of 30000 instances, where each observation consists of 23 attributes and a target. The raw dataset is about 5.5 MB large, and we split it into the training set (80%) and testing set (20%) for further use. The data attributes range from client’s gender, age, education, previous payment history, credit amount etc. You can access this data set by clicking here.
Education
: 1 = graduate school; 2 = university; 3 =
high school; 4 = other
Marital status
: 1 = married; 2 = single; 3 = others
PAY_X
, the history of monthly payment tracked from
April to September, 2005 :
PAY_0 = repayment status in September, 2005;
PAY_2 = repayment status in August, 2005;
PAY_3 = repayment status in July, 2005;
PAY_4 = repayment status in June, 2005;
PAY_5 = the repayment status in May, 2005;
PAY_6 = the repayment status in April, 2005
Scale for PAY_X
:
-2 for no payment required;
-1 = pay duly;
1 = payment delay for one month;
2 = payment delay for two months;
… 9 = payment delay for nine months and above
Sex
: 1 = male; 2 = female
LIMIT_BAL
: the amount of given credit (in New Taiwan
dollar), includes both the individual consumer credit and his/her family
(supplementary) credit.
Age
: the age of the individual (years).
BILL_AMTX
: the amount of bill statement (NT
dollar).
BILL_AMT1 = amount of bill statement in September, 2005;
BILL_AMT2 = amount of bill statement in August, 2005;
BILL_AMT3 = amount of bill statement in July, 2005;
BILL_AMT4 = amount of bill statement in June, 2005;
BILL_AMT5 = amount of bill statement in May, 2005;
BILL_AMT6 = amount of bill statement in April, 2005
PAY_AMTX
: Amount of previous payment (NT dollar)
PAY_AMT1 = amount paid in September, 2005;
PAY_AMT2 = amount paid in August, 2005;
PAY_AMT2 = amount paid in July, 2005;
PAY_AMT2 = amount paid in June, 2005;
PAY_AMT2 = amount paid in May, 2005;
PAY_AMT2 = amount paid in April, 2005
Our data has been split into training and testing splits, with 80% of the data (24000) in the training set and 20% (6000) in the test data.
There are no missing values in any rows or columns.
Upon our first look at the data, we found some features containing ambiguous categories, such as unlabeled feature categories. We cleaned up the data to keep categories that were more meaningful.
After data cleaning, we identified 24 meaningful features, with one
binary feature, eight categorical features, and fourteen numerical
features. Our target is default_payment_next_month
that has
two classes: class 0 representing the client paying their bill in the
next month and class 1 representing a client choosing to default on
their bill next month.
There is a class imbalance in our data, with 77.8% of examples as target class 0 and 22.2% as target class 1.
We have categorical features such as marriage, education, and monthly payment history. Below is the distribution of our target class according to the various categories. From these visualizations, we can see that the proportion of default (class 1) is similar in most categories, except in the all PAY features, high proportion of default occurred in labels 2 or above (meaning the person missed at least two months of payment at the time of data collection).
Figure 1. Distribution of Categorical Features.
There is one binary feature in our data set: sex of the client. There is a higher number of female clients who have chosen to default on their payment.
Figure 2. Distribution of Binary Feature
Numeric features include bill amounts, payment amounts, and age of the client. From the visualizations, we get an idea that the default is not dependent on the months. However, we will verify this using our prediction model. We also see that there is a slight increase in the default ratio in the middle to late age group of clients.
Figure 3. Distribution of Numeric Feature
Below is the correlation matrix for all of our features. We see a positive correlation between the history of missing payment (all PAY categorical features) and defaulting, and a negative correlation between the credit limit offered to the client and defaulting. Furthermore, we see negative correlations between past payment amount (PAY_AMT features) and defaulting. These correlations somewhat make sense.
X | ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default.payment.next.month |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | 1.00 | 0.03 | 0.02 | 0.03 | -0.03 | 0.03 | -0.02 | 0.00 | -0.01 | 0.00 | -0.01 | 0.00 | 0.02 | 0.01 | 0.02 | 0.04 | 0.02 | 0.02 | 0.02 | 0.06 | 0.09 | 0.02 | 0.01 | 0.04 | -0.01 |
LIMIT_BAL | 0.03 | 1.00 | 0.06 | -0.27 | -0.12 | 0.19 | -0.30 | -0.34 | -0.33 | -0.31 | -0.28 | -0.26 | 0.06 | 0.05 | 0.06 | 0.08 | 0.09 | 0.09 | 0.28 | 0.28 | 0.29 | 0.29 | 0.30 | 0.32 | -0.17 |
SEX | 0.02 | 0.06 | 1.00 | 0.02 | -0.03 | -0.09 | -0.06 | -0.08 | -0.07 | -0.07 | -0.06 | -0.05 | -0.05 | -0.05 | -0.04 | -0.03 | -0.02 | -0.01 | -0.01 | 0.00 | 0.02 | 0.01 | 0.01 | 0.03 | -0.04 |
EDUCATION | 0.03 | -0.27 | 0.02 | 1.00 | -0.16 | 0.16 | 0.13 | 0.17 | 0.16 | 0.15 | 0.14 | 0.12 | 0.09 | 0.09 | 0.08 | 0.07 | 0.06 | 0.05 | -0.05 | -0.05 | -0.05 | -0.05 | -0.05 | -0.05 | 0.05 |
MARRIAGE | -0.03 | -0.12 | -0.03 | -0.16 | 1.00 | -0.47 | 0.02 | 0.04 | 0.05 | 0.05 | 0.05 | 0.05 | 0.01 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | 0.00 | -0.02 | -0.01 | -0.02 | -0.01 | -0.02 | -0.02 |
AGE | 0.03 | 0.19 | -0.09 | 0.16 | -0.47 | 1.00 | -0.07 | -0.09 | -0.09 | -0.08 | -0.09 | -0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.05 | 0.03 | 0.04 | 0.04 | 0.04 | 0.00 |
PAY_0 | -0.02 | -0.30 | -0.06 | 0.13 | 0.02 | -0.07 | 1.00 | 0.63 | 0.55 | 0.52 | 0.48 | 0.46 | 0.31 | 0.33 | 0.31 | 0.30 | 0.30 | 0.29 | -0.10 | -0.07 | -0.06 | -0.04 | -0.03 | -0.05 | 0.29 |
PAY_2 | 0.00 | -0.34 | -0.08 | 0.17 | 0.04 | -0.09 | 0.63 | 1.00 | 0.80 | 0.71 | 0.67 | 0.63 | 0.57 | 0.55 | 0.52 | 0.49 | 0.48 | 0.46 | 0.02 | 0.08 | 0.08 | 0.09 | 0.10 | 0.08 | 0.21 |
PAY_3 | -0.01 | -0.33 | -0.07 | 0.16 | 0.05 | -0.09 | 0.55 | 0.80 | 1.00 | 0.80 | 0.72 | 0.67 | 0.52 | 0.59 | 0.55 | 0.53 | 0.51 | 0.48 | 0.21 | 0.03 | 0.10 | 0.12 | 0.12 | 0.10 | 0.19 |
PAY_4 | 0.00 | -0.31 | -0.07 | 0.15 | 0.05 | -0.08 | 0.52 | 0.71 | 0.80 | 1.00 | 0.82 | 0.73 | 0.51 | 0.56 | 0.62 | 0.59 | 0.56 | 0.53 | 0.18 | 0.24 | 0.07 | 0.14 | 0.16 | 0.14 | 0.17 |
PAY_5 | -0.01 | -0.28 | -0.06 | 0.14 | 0.05 | -0.09 | 0.48 | 0.67 | 0.72 | 0.82 | 1.00 | 0.82 | 0.50 | 0.53 | 0.58 | 0.65 | 0.62 | 0.58 | 0.17 | 0.22 | 0.26 | 0.11 | 0.19 | 0.17 | 0.16 |
PAY_6 | 0.00 | -0.26 | -0.05 | 0.12 | 0.05 | -0.08 | 0.46 | 0.63 | 0.67 | 0.73 | 0.82 | 1.00 | 0.48 | 0.52 | 0.56 | 0.60 | 0.67 | 0.63 | 0.17 | 0.20 | 0.23 | 0.28 | 0.14 | 0.20 | 0.14 |
BILL_AMT1 | 0.02 | 0.06 | -0.05 | 0.09 | 0.01 | 0.00 | 0.31 | 0.57 | 0.52 | 0.51 | 0.50 | 0.48 | 1.00 | 0.91 | 0.86 | 0.80 | 0.77 | 0.73 | 0.50 | 0.47 | 0.44 | 0.44 | 0.43 | 0.41 | -0.03 |
BILL_AMT2 | 0.01 | 0.05 | -0.05 | 0.09 | 0.01 | 0.00 | 0.33 | 0.55 | 0.59 | 0.56 | 0.53 | 0.52 | 0.91 | 1.00 | 0.91 | 0.85 | 0.80 | 0.76 | 0.63 | 0.50 | 0.47 | 0.46 | 0.45 | 0.43 | -0.02 |
BILL_AMT3 | 0.02 | 0.06 | -0.04 | 0.08 | 0.00 | 0.00 | 0.31 | 0.52 | 0.55 | 0.62 | 0.58 | 0.56 | 0.86 | 0.91 | 1.00 | 0.90 | 0.85 | 0.81 | 0.55 | 0.64 | 0.49 | 0.49 | 0.48 | 0.46 | -0.01 |
BILL_AMT4 | 0.04 | 0.08 | -0.03 | 0.07 | 0.01 | 0.00 | 0.30 | 0.49 | 0.53 | 0.59 | 0.65 | 0.60 | 0.80 | 0.85 | 0.90 | 1.00 | 0.90 | 0.85 | 0.51 | 0.55 | 0.63 | 0.51 | 0.50 | 0.48 | -0.01 |
BILL_AMT5 | 0.02 | 0.09 | -0.02 | 0.06 | 0.01 | 0.00 | 0.30 | 0.48 | 0.51 | 0.56 | 0.62 | 0.67 | 0.77 | 0.80 | 0.85 | 0.90 | 1.00 | 0.90 | 0.48 | 0.51 | 0.55 | 0.65 | 0.52 | 0.51 | -0.01 |
BILL_AMT6 | 0.02 | 0.09 | -0.01 | 0.05 | 0.01 | 0.00 | 0.29 | 0.46 | 0.48 | 0.53 | 0.58 | 0.63 | 0.73 | 0.76 | 0.81 | 0.85 | 0.90 | 1.00 | 0.45 | 0.48 | 0.52 | 0.57 | 0.67 | 0.53 | 0.00 |
PAY_AMT1 | 0.02 | 0.28 | -0.01 | -0.05 | 0.00 | 0.04 | -0.10 | 0.02 | 0.21 | 0.18 | 0.17 | 0.17 | 0.50 | 0.63 | 0.55 | 0.51 | 0.48 | 0.45 | 1.00 | 0.51 | 0.52 | 0.49 | 0.47 | 0.46 | -0.17 |
PAY_AMT2 | 0.06 | 0.28 | 0.00 | -0.05 | -0.02 | 0.05 | -0.07 | 0.08 | 0.03 | 0.24 | 0.22 | 0.20 | 0.47 | 0.50 | 0.64 | 0.55 | 0.51 | 0.48 | 0.51 | 1.00 | 0.52 | 0.52 | 0.50 | 0.49 | -0.15 |
PAY_AMT3 | 0.09 | 0.29 | 0.02 | -0.05 | -0.01 | 0.03 | -0.06 | 0.08 | 0.10 | 0.07 | 0.26 | 0.23 | 0.44 | 0.47 | 0.49 | 0.63 | 0.55 | 0.52 | 0.52 | 0.52 | 1.00 | 0.52 | 0.54 | 0.51 | -0.15 |
PAY_AMT4 | 0.02 | 0.29 | 0.01 | -0.05 | -0.02 | 0.04 | -0.04 | 0.09 | 0.12 | 0.14 | 0.11 | 0.28 | 0.44 | 0.46 | 0.49 | 0.51 | 0.65 | 0.57 | 0.49 | 0.52 | 0.52 | 1.00 | 0.54 | 0.55 | -0.13 |
PAY_AMT5 | 0.01 | 0.30 | 0.01 | -0.05 | -0.01 | 0.04 | -0.03 | 0.10 | 0.12 | 0.16 | 0.19 | 0.14 | 0.43 | 0.45 | 0.48 | 0.50 | 0.52 | 0.67 | 0.47 | 0.50 | 0.54 | 0.54 | 1.00 | 0.55 | -0.12 |
PAY_AMT6 | 0.04 | 0.32 | 0.03 | -0.05 | -0.02 | 0.04 | -0.05 | 0.08 | 0.10 | 0.14 | 0.17 | 0.20 | 0.41 | 0.43 | 0.46 | 0.48 | 0.51 | 0.53 | 0.46 | 0.49 | 0.51 | 0.55 | 0.55 | 1.00 | -0.12 |
default payment next month | -0.01 | -0.17 | -0.04 | 0.05 | -0.02 | 0.00 | 0.29 | 0.21 | 0.19 | 0.17 | 0.16 | 0.14 | -0.03 | -0.02 | -0.01 | -0.01 | -0.01 | 0.00 | -0.17 | -0.15 | -0.15 | -0.13 | -0.12 | -0.12 | 1.00 |
We are interested in finding clients who are likely to default on their next payment. We identified that we need to reduce both false positive (misidentifying a client will default) and false negatives (misidentifying a client will not default) in our prediction as these are important for client loyalty and for the bank to not lose money.
Therefore, we chose to evaluate our model using the f1 score as our metric. The f1 score is calculated by:
\[\text{f1 score} = \frac{2\times precision \times recall}{precision + recall}\]
The following models were included in initial model screening with default hyperparameters:
The cross-validation scores are shown below:
Decision Tree | KNN | RBF SVM | Logistic Regression | Ridge_cla | RandomForest_cla | |
---|---|---|---|---|---|---|
fit_time | 0.27 | 0.01 | 4.78 | 0.07 | 0.01 | 3.33 |
score_time | 0.01 | 0.12 | 1.25 | 0.01 | 0.01 | 0.07 |
test_accuracy | 0.72 | 0.79 | 0.82 | 0.81 | 0.80 | 0.82 |
train_accuracy | 1.00 | 0.84 | 0.82 | 0.81 | 0.80 | 1.00 |
test_precision | 0.38 | 0.55 | 0.68 | 0.71 | 0.72 | 0.65 |
train_precision | 1.00 | 0.72 | 0.70 | 0.72 | 0.72 | 1.00 |
test_recall | 0.41 | 0.36 | 0.34 | 0.24 | 0.15 | 0.38 |
train_recall | 1.00 | 0.47 | 0.35 | 0.24 | 0.15 | 1.00 |
test_f1 | 0.40 | 0.43 | 0.46 | 0.36 | 0.24 | 0.48 |
train_f1 | 1.00 | 0.57 | 0.47 | 0.36 | 0.24 | 1.00 |
The mean cross-validation score is the highest for the Random Forest Classifier but it is over-fit with train accuracy of 1.00. The SVM RBF and KNN models are analogy-based methods that don’t support our evaluation of the feature importance. Ridge and Decision Tree Classifier are both good models but we chose Logistic Regression Classifier in the end as it can easily export feature importance and have a decent f1 score compared to the other models.
We continued performed hyperparameter optimization for the Logistic Regression model to find our optimal hyperparamaters:
C
value as 0.438 and class_weight
as
balanced.
Using our model with these optimized hyperparameters, the mean f1 score in cross-validation was 0.48.
We used our optimized Logistic Regression model to predict the test data of 6000 clients. The f1 score on the test data was 0.47, which is comparable to the f1 score of the cross-validation data.
We continued to export the regression coefficients for our features:
Feature | Coefficient |
---|---|
PAY_0 | 0.528 |
BILL_AMT3 | 0.145 |
BILL_AMT2 | 0.111 |
EDUCATION | 0.074 |
AGE | 0.073 |
PAY_3 | 0.069 |
PAY_2 | 0.065 |
BILL_AMT4 | 0.049 |
MARRIAGE_1 | 0.048 |
PAY_5 | 0.034 |
PAY_4 | 0.016 |
BILL_AMT6 | 0.002 |
PAY_6 | -0.015 |
BILL_AMT5 | -0.027 |
PAY_AMT6 | -0.029 |
PAY_AMT4 | -0.038 |
PAY_AMT5 | -0.038 |
MARRIAGE_2 | -0.097 |
PAY_AMT3 | -0.097 |
SEX_2 | -0.103 |
LIMIT_BAL | -0.108 |
MARRIAGE_3 | -0.165 |
PAY_AMT1 | -0.188 |
PAY_AMT2 | -0.205 |
BILL_AMT1 | -0.363 |
Our most positive coefficient was PAY_0
: the default
history of the client. This is expected because the longer a client has
delayed their payments as of September 2005, when the data was
collected, the more likely they are to default.
BILL_AMT3
and BILL_AMT2
also have positive
coefficients. The higher the statement amount in the previous months
(July and August), the more likely the client will default.
The most negative coefficient was BILL_AMT1
which is
expected as the higher the amount due in the September statement, the
higher the likely a client will default that same month.
Also, PAY_AMT1
and PAY_AMT2
have negative
coefficient in our model. This also makes sense as higher payments in
recent months (August and September) will result in less likelihood of a
client defaulting.
When evaluating with the default status of the test data, our model made 4141 correct predictions for our clients, out of 6000 (69%).
Figure 4. Confusion Matrix
We falsely predicted 494 clients would not default and make their payment when in fact, they would not (false negative). These false predictions would be costly for the institution in terms of opportunity cost as they could be charging a higher interest rate on these clients.
On the other hand, we made 1365 false predictions on clients and predicted they would default, when they will not (false positive). This is costly because a false labeling and a possible unjustified interest rate increase can lead to client dissatisfaction.
Our model did not perform as well as we hoped with an f1 test score of 0.47. The data contained a lot of noise, missing labels, and potentially non-linear relations that our model was not able to fit well. Further improvements are needed before this model can be put into production.
Several things could be done to further improve this model. First of all, further optimization through feature engineering and feature selection may be beneficial. We may be able to get rid of some features that are noisy and have low correlation to our target value. As well, as mentioned earlier, there were features that contained ambiguous categories and our model is not capturing the data that was sorted into “other” categories.
Proper data labelling needs to be done to account for this ambiguous data. If possible, we can consult the company that made the data collection for the missing labels. Lastly, more useful features would improve this model, such as income, size of the household, and amount of debt. With more relevant features for our model to fit to, the data and our prediction accuracy will improve.
This report was constructed using Rmarkdown (Allaire et al. 2021), ReadR(Wickham, Hester, and Bryan 2022), Knitr (Xie 2022), kableExtra (Zhu 2021), and tidyverse (Wickham et al. 2019) in R (R Core Team 2019) and the following python (Python 2021) packages: pandas(Snider and Swedo 2004), numpy(Bressert 2012), scikit-learn(Pedregosa et al. 2011), altair(VanderPlas et al. 2018), matplotlib (Bisong 2019), and uci_ml_data_set (n.d.).