Dataset: Bank Loan Approval - LR, DT, RF And AUC

About this Dataset

Bank Loan Approval - LR, DT, RF And AUC

DATASET: Dependent variable is 'Personal.Loan'. 0 indicates loan not approved and 1 indicates loan approved.
OBJECTIVE : We will do Exploratory Data Analysis and use Logistic Regression, Decision Tree, Random Forest and AUC to find out which is the best model.
Steps:
Set the working directory and read the data
Check the data types of all the variables
DATA CLEANING
We need to change the data types of certain variables to factor vector
Check for missing data, duplicate records and remove insignificant variables
New data frame created called 'bank1' after dropping the 'ID' column.
EXPLORATORY DATA ANALYSIS
We will try to get some insights by digging into the data through bar charts and box plots which can help the bank management in decision making
Run the required libraries
Out of the total 5000 customers, 4520 have not been approved for a loan while 480 have been
THIS INDICATES THAT INCOME IS HIGHER WHEN THERE ARE LESS FAMILY MEMBERS
THIS INDICATES PERSONAL LOAN HAS BEEN APPROVED FOR CUSTOMERS HAVING HIGHER INCOME
THIS INDICATES THAT THE INCOME IS PRETTY SIMILAR FOR CUSTOMERS OWNING AND NOT OWNING A CREDIT CARD
CUSTOMERS BELONGING TO THE RICH CLASS (INCOME GROUP : 150-200) HAVE THE HIGHEST MORTGAGE
CC AVG IS PRETTY SIMILAR FOR THOSE WHO OPTED FOR ONLINE SERVICES AND THOSE WHO DID NOT
MORE EDUCATED CUSTOMERS HAVE A HIGHER CREDIT AVERAGE
CC AVG IS HIGHER IN THE AGE GROUP OF 22-30 AND 31-40
USING LOGISTIC REGRESSION
'Zipcode' variable has been removed. Therefore, we create a new data frame without the said variable.
Age, Income and Age_range have a VIF value greater than 5. So we will drop the Age_range first.
We create a new data frame called 'bank3' by excluding column 'Age_range'
As we can see, VIF value of column 'Age' is now below 5. The column 'Income' still has a higher VIF value than 5. But I will still keep this as I feel it is very important for further analysis.
Column 'Mortgage' has been removed
Accuracy is 96.10%, Sensitivity is 70.83% and Specificity is 98.78%
AS WE SAW EARLIER THAT THE DATA IS HEAVILY IMBALANCED. Out of the total 5000 customers, 4520 have not been approved for a loan while 480 have been. WE NEED TO BALANCE THE DATA
Predict the test data for over, under and both data using Logistic Regression
Logistic Regression for over_data: Accuracy is 92.1%, Sensitivity is 94.79% and Specificity is 91.81%
Logistic Regression for under_data: Accuracy is 92.1%, Sensitivity is 95.83% and Specificity is 91.70%
Logistic Regression for both_data: Accuracy is 92.2%, Sensitivity is 93.75% and Specificity is 92.04%
Predict the test data for over, under and both data using Decision Tree
Decision Tree for over_data: Accuracy is 92.8%, Sensitivity is 98.96% and Specificity is 92.15%
Decision Tree for under_data: Accuracy is 93.7%, Sensitivity is 98.96% and Specificity is 93.14%
Decision Tree for both_data: Accuracy is 94.5%, Sensitivity is 94.79% and Specificity is 94.47%
Predict the test data for over, under and both data using Random Forest
Random Forest for over_data: Accuracy is 98.4%, Sensitivity is 92.71% and Specificity is 99.00%
Random Forest for under_data: Accuracy is 95.7%, Sensitivity is 98.96% and Specificity is 95.35%
Random Forest for both_data: Accuracy is 98.4%, Sensitivity is 96.88% and Specificity is 98.56%
ROC and AUC for over, under and both_data
AUC 98.01% for over_data for Logistic Regression
AUC 98.2% for under_data for Logistic Regression
AUC 97.86% for both_data for Logistic Regression
AUC 97.72% for over_data for Decision Tree
AUC 98.01% for under_data for Decision Tree
AUC 98.80% for both_data for Decision Tree
AUC 99.83% for over_data for Random Forest
AUC 99.71% for under_data for Random Forest
AUC 99.82% for both_data for Random Forest
CONCLUSION: IF WE DECIDE TO GO WITH AUC , THEN WE CAN MOVE AHEAD WITH RANDOM FOREST AS IT HAS THE HIGHEST AUC AMONGST ALL

Tables

Bankloan

@kaggle.vikramamin_bank_loan_approval_lr_dt_rf_and_auc.bankloan

74.13 KB
5000 rows
14 columns


CREATE TABLE bankloan (
  "id" BIGINT,
  "age" BIGINT,
  "experience" BIGINT,
  "income" BIGINT,
  "zip_code" BIGINT,
  "family" BIGINT,
  "ccavg" DOUBLE,
  "education" BIGINT,
  "mortgage" BIGINT,
  "personal_loan" BIGINT,
  "securities_account" BIGINT,
  "cd_account" BIGINT,
  "online" BIGINT,
  "creditcard" BIGINT
);