Summary

In the field of risk management, one of the most common problems is default prediction. This allows companies to predict the credibility of each person, analyze the risk level and optimize decisions for better business economics. In this project, we aim to learn and predict a credit card holder’s credibility based on his/her basic personal information (gender, education, age, history of past payment etc. ).

Our final classifier using the Logistic Regression model did not perform as well as we hoped on our unseen test data, with a final f1 score of 0.47. Of the 6000 clients in our test data, our model correctly predicted the default status of 4141 clients correctly. There were 1859 incorrect predictions, either a client will default on their payment when they have not or a client will not default when they have. Incorrect predictions of either type can be costly for financial institutions and thus we will continue to study our data and improve our model before it is put into production.

Introduction

Through this project, we aim to answer the predictive question:

Given a credit card holder’s basic personal information (gender, education, age, history of past payment etc.), will the person default on next month’s payment?

A credit default is defined as the behavior when someone who borrowed the money stops making the required payments. In the data set, the target class 1 indicates that the person has committed a credit default (fails to pay) while 0 indicates the person is paying the debt as required. Our evaluation is of great importance because it helps to understand which sets of attributes relate to credibility. We would also aim to perform a comparative study of the mainstream machine learning classification models to be able to identify the best performing model in predicting credit default.

Methods

Dataset

We use a dataset hosted by the UCI machine learning repository (Dua and Graff (2017)). Originally it is collected by researchers from Chung Hua University and Tamkang University (Yeh and Lien (2009), (n.d.)). As the probability of default cannot be actually acquired, the targets are obtained through estimation as stated by the authors of this dataset. The dataset consists of 30000 instances, where each observation consists of 23 attributes and a target. The raw dataset is about 5.5 MB large, and we split it into the training set (80%) and testing set (20%) for further use. The data attributes range from client’s gender, age, education, previous payment history, credit amount etc. You can access this data set by clicking here.

Feature Descriptions

Categorical Features

Education : 1 = graduate school; 2 = university; 3 = high school; 4 = other

Marital status : 1 = married; 2 = single; 3 = others

PAY_X , the history of monthly payment tracked from April to September, 2005 :

PAY_0 = repayment status in September, 2005;
PAY_2 = repayment status in August, 2005;
PAY_3 = repayment status in July, 2005;
PAY_4 = repayment status in June, 2005;
PAY_5 = the repayment status in May, 2005;
PAY_6 = the repayment status in April, 2005

Scale for PAY_X :

-2 for no payment required;
-1 = pay duly;
1 = payment delay for one month;
2 = payment delay for two months;
… 9 = payment delay for nine months and above

Binary Features

Sex : 1 = male; 2 = female

Numeric features

LIMIT_BAL : the amount of given credit (in New Taiwan dollar), includes both the individual consumer credit and his/her family (supplementary) credit.

Age : the age of the individual (years).

BILL_AMTX : the amount of bill statement (NT dollar).

BILL_AMT1 = amount of bill statement in September, 2005;
BILL_AMT2 = amount of bill statement in August, 2005;
BILL_AMT3 = amount of bill statement in July, 2005;
BILL_AMT4 = amount of bill statement in June, 2005;
BILL_AMT5 = amount of bill statement in May, 2005;
BILL_AMT6 = amount of bill statement in April, 2005

PAY_AMTX : Amount of previous payment (NT dollar)

PAY_AMT1 = amount paid in September, 2005;
PAY_AMT2 = amount paid in August, 2005;
PAY_AMT2 = amount paid in July, 2005;
PAY_AMT2 = amount paid in June, 2005;
PAY_AMT2 = amount paid in May, 2005;
PAY_AMT2 = amount paid in April, 2005 

Link to Source Data.

Analysis

EDA

Our data has been split into training and testing splits, with 80% of the data (24000) in the training set and 20% (6000) in the test data.

There are no missing values in any rows or columns.

Upon our first look at the data, we found some features containing ambiguous categories, such as unlabeled feature categories. We cleaned up the data to keep categories that were more meaningful.

After data cleaning, we identified 24 meaningful features, with one binary feature, eight categorical features, and fourteen numerical features. Our target is default_payment_next_month that has two classes: class 0 representing the client paying their bill in the next month and class 1 representing a client choosing to default on their bill next month.

There is a class imbalance in our data, with 77.8% of examples as target class 0 and 22.2% as target class 1.

We have categorical features such as marriage, education, and monthly payment history. Below is the distribution of our target class according to the various categories. From these visualizations, we can see that the proportion of default (class 1) is similar in most categories, except in the all PAY features, high proportion of default occurred in labels 2 or above (meaning the person missed at least two months of payment at the time of data collection).

**Figure 1.** Distribution of Categorical Features.

Figure 1. Distribution of Categorical Features.

There is one binary feature in our data set: sex of the client. There is a higher number of female clients who have chosen to default on their payment.

**Figure 2.** Distribution of Binary Feature

Figure 2. Distribution of Binary Feature

Numeric features include bill amounts, payment amounts, and age of the client. From the visualizations, we get an idea that the default is not dependent on the months. However, we will verify this using our prediction model. We also see that there is a slight increase in the default ratio in the middle to late age group of clients.

**Figure 3.** Distribution of Numeric Feature

Figure 3. Distribution of Numeric Feature

Below is the correlation matrix for all of our features. We see a positive correlation between the history of missing payment (all PAY categorical features) and defaulting, and a negative correlation between the credit limit offered to the client and defaulting. Furthermore, we see negative correlations between past payment amount (PAY_AMT features) and defaulting. These correlations somewhat make sense.

Table 1. Correlation Matrix of all features
X ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default.payment.next.month
ID 1.00 0.03 0.02 0.03 -0.03 0.03 -0.02 0.00 -0.01 0.00 -0.01 0.00 0.02 0.01 0.02 0.04 0.02 0.02 0.02 0.06 0.09 0.02 0.01 0.04 -0.01
LIMIT_BAL 0.03 1.00 0.06 -0.27 -0.12 0.19 -0.30 -0.34 -0.33 -0.31 -0.28 -0.26 0.06 0.05 0.06 0.08 0.09 0.09 0.28 0.28 0.29 0.29 0.30 0.32 -0.17
SEX 0.02 0.06 1.00 0.02 -0.03 -0.09 -0.06 -0.08 -0.07 -0.07 -0.06 -0.05 -0.05 -0.05 -0.04 -0.03 -0.02 -0.01 -0.01 0.00 0.02 0.01 0.01 0.03 -0.04
EDUCATION 0.03 -0.27 0.02 1.00 -0.16 0.16 0.13 0.17 0.16 0.15 0.14 0.12 0.09 0.09 0.08 0.07 0.06 0.05 -0.05 -0.05 -0.05 -0.05 -0.05 -0.05 0.05
MARRIAGE -0.03 -0.12 -0.03 -0.16 1.00 -0.47 0.02 0.04 0.05 0.05 0.05 0.05 0.01 0.01 0.00 0.01 0.01 0.01 0.00 -0.02 -0.01 -0.02 -0.01 -0.02 -0.02
AGE 0.03 0.19 -0.09 0.16 -0.47 1.00 -0.07 -0.09 -0.09 -0.08 -0.09 -0.08 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.05 0.03 0.04 0.04 0.04 0.00
PAY_0 -0.02 -0.30 -0.06 0.13 0.02 -0.07 1.00 0.63 0.55 0.52 0.48 0.46 0.31 0.33 0.31 0.30 0.30 0.29 -0.10 -0.07 -0.06 -0.04 -0.03 -0.05 0.29
PAY_2 0.00 -0.34 -0.08 0.17 0.04 -0.09 0.63 1.00 0.80 0.71 0.67 0.63 0.57 0.55 0.52 0.49 0.48 0.46 0.02 0.08 0.08 0.09 0.10 0.08 0.21
PAY_3 -0.01 -0.33 -0.07 0.16 0.05 -0.09 0.55 0.80 1.00 0.80 0.72 0.67 0.52 0.59 0.55 0.53 0.51 0.48 0.21 0.03 0.10 0.12 0.12 0.10 0.19
PAY_4 0.00 -0.31 -0.07 0.15 0.05 -0.08 0.52 0.71 0.80 1.00 0.82 0.73 0.51 0.56 0.62 0.59 0.56 0.53 0.18 0.24 0.07 0.14 0.16 0.14 0.17
PAY_5 -0.01 -0.28 -0.06 0.14 0.05 -0.09 0.48 0.67 0.72 0.82 1.00 0.82 0.50 0.53 0.58 0.65 0.62 0.58 0.17 0.22 0.26 0.11 0.19 0.17 0.16
PAY_6 0.00 -0.26 -0.05 0.12 0.05 -0.08 0.46 0.63 0.67 0.73 0.82 1.00 0.48 0.52 0.56 0.60 0.67 0.63 0.17 0.20 0.23 0.28 0.14 0.20 0.14
BILL_AMT1 0.02 0.06 -0.05 0.09 0.01 0.00 0.31 0.57 0.52 0.51 0.50 0.48 1.00 0.91 0.86 0.80 0.77 0.73 0.50 0.47 0.44 0.44 0.43 0.41 -0.03
BILL_AMT2 0.01 0.05 -0.05 0.09 0.01 0.00 0.33 0.55 0.59 0.56 0.53 0.52 0.91 1.00 0.91 0.85 0.80 0.76 0.63 0.50 0.47 0.46 0.45 0.43 -0.02
BILL_AMT3 0.02 0.06 -0.04 0.08 0.00 0.00 0.31 0.52 0.55 0.62 0.58 0.56 0.86 0.91 1.00 0.90 0.85 0.81 0.55 0.64 0.49 0.49 0.48 0.46 -0.01
BILL_AMT4 0.04 0.08 -0.03 0.07 0.01 0.00 0.30 0.49 0.53 0.59 0.65 0.60 0.80 0.85 0.90 1.00 0.90 0.85 0.51 0.55 0.63 0.51 0.50 0.48 -0.01
BILL_AMT5 0.02 0.09 -0.02 0.06 0.01 0.00 0.30 0.48 0.51 0.56 0.62 0.67 0.77 0.80 0.85 0.90 1.00 0.90 0.48 0.51 0.55 0.65 0.52 0.51 -0.01
BILL_AMT6 0.02 0.09 -0.01 0.05 0.01 0.00 0.29 0.46 0.48 0.53 0.58 0.63 0.73 0.76 0.81 0.85 0.90 1.00 0.45 0.48 0.52 0.57 0.67 0.53 0.00
PAY_AMT1 0.02 0.28 -0.01 -0.05 0.00 0.04 -0.10 0.02 0.21 0.18 0.17 0.17 0.50 0.63 0.55 0.51 0.48 0.45 1.00 0.51 0.52 0.49 0.47 0.46 -0.17
PAY_AMT2 0.06 0.28 0.00 -0.05 -0.02 0.05 -0.07 0.08 0.03 0.24 0.22 0.20 0.47 0.50 0.64 0.55 0.51 0.48 0.51 1.00 0.52 0.52 0.50 0.49 -0.15
PAY_AMT3 0.09 0.29 0.02 -0.05 -0.01 0.03 -0.06 0.08 0.10 0.07 0.26 0.23 0.44 0.47 0.49 0.63 0.55 0.52 0.52 0.52 1.00 0.52 0.54 0.51 -0.15
PAY_AMT4 0.02 0.29 0.01 -0.05 -0.02 0.04 -0.04 0.09 0.12 0.14 0.11 0.28 0.44 0.46 0.49 0.51 0.65 0.57 0.49 0.52 0.52 1.00 0.54 0.55 -0.13
PAY_AMT5 0.01 0.30 0.01 -0.05 -0.01 0.04 -0.03 0.10 0.12 0.16 0.19 0.14 0.43 0.45 0.48 0.50 0.52 0.67 0.47 0.50 0.54 0.54 1.00 0.55 -0.12
PAY_AMT6 0.04 0.32 0.03 -0.05 -0.02 0.04 -0.05 0.08 0.10 0.14 0.17 0.20 0.41 0.43 0.46 0.48 0.51 0.53 0.46 0.49 0.51 0.55 0.55 1.00 -0.12
default payment next month -0.01 -0.17 -0.04 0.05 -0.02 0.00 0.29 0.21 0.19 0.17 0.16 0.14 -0.03 -0.02 -0.01 -0.01 -0.01 0.00 -0.17 -0.15 -0.15 -0.13 -0.12 -0.12 1.00

Predictive Model

We are interested in finding clients who are likely to default on their next payment. We identified that we need to reduce both false positive (misidentifying a client will default) and false negatives (misidentifying a client will not default) in our prediction as these are important for client loyalty and for the bank to not lose money.

Therefore, we chose to evaluate our model using the f1 score as our metric. The f1 score is calculated by:

\[\text{f1 score} = \frac{2\times precision \times recall}{precision + recall}\]

The following models were included in initial model screening with default hyperparameters:

The cross-validation scores are shown below:

Table 2. Cross-Validation Scores during Initial Model Screening
Decision Tree KNN RBF SVM Logistic Regression Ridge_cla RandomForest_cla
fit_time 0.27 0.01 4.78 0.07 0.01 3.33
score_time 0.01 0.12 1.25 0.01 0.01 0.07
test_accuracy 0.72 0.79 0.82 0.81 0.80 0.82
train_accuracy 1.00 0.84 0.82 0.81 0.80 1.00
test_precision 0.38 0.55 0.68 0.71 0.72 0.65
train_precision 1.00 0.72 0.70 0.72 0.72 1.00
test_recall 0.41 0.36 0.34 0.24 0.15 0.38
train_recall 1.00 0.47 0.35 0.24 0.15 1.00
test_f1 0.40 0.43 0.46 0.36 0.24 0.48
train_f1 1.00 0.57 0.47 0.36 0.24 1.00

The mean cross-validation score is the highest for the Random Forest Classifier but it is over-fit with train accuracy of 1.00. The SVM RBF and KNN models are analogy-based methods that don’t support our evaluation of the feature importance. Ridge and Decision Tree Classifier are both good models but we chose Logistic Regression Classifier in the end as it can easily export feature importance and have a decent f1 score compared to the other models.

We continued performed hyperparameter optimization for the Logistic Regression model to find our optimal hyperparamaters:

C value as 0.438 and class_weight as balanced.

Using our model with these optimized hyperparameters, the mean f1 score in cross-validation was 0.48.

Results

We used our optimized Logistic Regression model to predict the test data of 6000 clients. The f1 score on the test data was 0.47, which is comparable to the f1 score of the cross-validation data.

We continued to export the regression coefficients for our features:

Table 3. Feature Coefficients
Feature Coefficient
PAY_0 0.528
BILL_AMT3 0.145
BILL_AMT2 0.111
EDUCATION 0.074
AGE 0.073
PAY_3 0.069
PAY_2 0.065
BILL_AMT4 0.049
MARRIAGE_1 0.048
PAY_5 0.034
PAY_4 0.016
BILL_AMT6 0.002
PAY_6 -0.015
BILL_AMT5 -0.027
PAY_AMT6 -0.029
PAY_AMT4 -0.038
PAY_AMT5 -0.038
MARRIAGE_2 -0.097
PAY_AMT3 -0.097
SEX_2 -0.103
LIMIT_BAL -0.108
MARRIAGE_3 -0.165
PAY_AMT1 -0.188
PAY_AMT2 -0.205
BILL_AMT1 -0.363

Our most positive coefficient was PAY_0: the default history of the client. This is expected because the longer a client has delayed their payments as of September 2005, when the data was collected, the more likely they are to default.

BILL_AMT3 and BILL_AMT2 also have positive coefficients. The higher the statement amount in the previous months (July and August), the more likely the client will default.

The most negative coefficient was BILL_AMT1 which is expected as the higher the amount due in the September statement, the higher the likely a client will default that same month.

Also, PAY_AMT1 and PAY_AMT2 have negative coefficient in our model. This also makes sense as higher payments in recent months (August and September) will result in less likelihood of a client defaulting.

When evaluating with the default status of the test data, our model made 4141 correct predictions for our clients, out of 6000 (69%).

**Figure 4.** Confusion Matrix

Figure 4. Confusion Matrix

We falsely predicted 494 clients would not default and make their payment when in fact, they would not (false negative). These false predictions would be costly for the institution in terms of opportunity cost as they could be charging a higher interest rate on these clients.

On the other hand, we made 1365 false predictions on clients and predicted they would default, when they will not (false positive). This is costly because a false labeling and a possible unjustified interest rate increase can lead to client dissatisfaction.

Our model did not perform as well as we hoped with an f1 test score of 0.47. The data contained a lot of noise, missing labels, and potentially non-linear relations that our model was not able to fit well. Further improvements are needed before this model can be put into production.

Further Improvements

Several things could be done to further improve this model. First of all, further optimization through feature engineering and feature selection may be beneficial. We may be able to get rid of some features that are noisy and have low correlation to our target value. As well, as mentioned earlier, there were features that contained ambiguous categories and our model is not capturing the data that was sorted into “other” categories.

Proper data labelling needs to be done to account for this ambiguous data. If possible, we can consult the company that made the data collection for the missing labels. Lastly, more useful features would improve this model, such as income, size of the household, and amount of debt. With more relevant features for our model to fit to, the data and our prediction accuracy will improve.

References

This report was constructed using Rmarkdown (Allaire et al. 2021), ReadR(Wickham, Hester, and Bryan 2022), Knitr (Xie 2022), kableExtra (Zhu 2021), and tidyverse (Wickham et al. 2019) in R (R Core Team 2019) and the following python (Python 2021) packages: pandas(Snider and Swedo 2004), numpy(Bressert 2012), scikit-learn(Pedregosa et al. 2011), altair(VanderPlas et al. 2018), matplotlib (Bisong 2019), and uci_ml_data_set (n.d.).

n.d. UCI Machine Learning Repository: Default of Credit Card Clients Data Set. https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients.
Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2021. Rmarkdown: Dynamic Documents for r.
Bisong, Ekaba. 2019. “Matplotlib and Seaborn.” In Building Machine Learning and Deep Learning Models on Google Cloud Platform, 151–65. Springer.
Bressert, Eli. 2012. “SciPy and NumPy: An Overview for Developers.”
Dua, Dheeru, and Casey Graff. 2017. UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–30.
Python, Why. 2021. “Python.” Python Releases for Windows 24.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Snider, LA, and SE Swedo. 2004. “PANDAS: Current Status and Directions for Research.” Molecular Psychiatry 9 (10): 900–907.
VanderPlas, Jacob, Brian Granger, Jeffrey Heer, Dominik Moritz, Kanit Wongsuphasawat, Arvind Satyanarayan, Eitan Lees, Ilia Timofeev, Ben Welsh, and Scott Sievert. 2018. “Altair: Interactive Statistical Visualizations for Python.” Journal of Open Source Software 3 (32): 1057.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2022. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.
Xie, Yihui. 2022. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Yeh, I-Cheng, and Che-hui Lien. 2009. “The Comparisons of Data Mining Techniques for the Predictive Accuracy of Probability of Default of Credit Card Clients.” Expert Systems with Applications 36 (2, Part 1): 2473–80. https://doi.org/https://doi.org/10.1016/j.eswa.2007.12.020.
Zhu, Hao. 2021. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.