Baselight

Bank Loan Approval - LR, DT, RF And AUC

Classification Problem using LR, DT, RF and AUC in R Programming

@kaggle.vikramamin_bank_loan_approval_lr_dt_rf_and_auc

About this Dataset

Bank Loan Approval - LR, DT, RF And AUC

  • DATASET: Dependent variable is 'Personal.Loan'. 0 indicates loan not approved and 1 indicates loan approved.
  • OBJECTIVE : We will do Exploratory Data Analysis and use Logistic Regression, Decision Tree, Random Forest and AUC to find out which is the best model.
    Steps:
  • Set the working directory and read the data
  • Check the data types of all the variables
  • DATA CLEANING
  • We need to change the data types of certain variables to factor vector
  • Check for missing data, duplicate records and remove insignificant variables
  • New data frame created called 'bank1' after dropping the 'ID' column.
  • EXPLORATORY DATA ANALYSIS
  • We will try to get some insights by digging into the data through bar charts and box plots which can help the bank management in decision making
  • Run the required libraries

  • Out of the total 5000 customers, 4520 have not been approved for a loan while 480 have been


  • THIS INDICATES THAT INCOME IS HIGHER WHEN THERE ARE LESS FAMILY MEMBERS

  • THIS INDICATES PERSONAL LOAN HAS BEEN APPROVED FOR CUSTOMERS HAVING HIGHER INCOME

  • THIS INDICATES THAT THE INCOME IS PRETTY SIMILAR FOR CUSTOMERS OWNING AND NOT OWNING A CREDIT CARD

  • CUSTOMERS BELONGING TO THE RICH CLASS (INCOME GROUP : 150-200) HAVE THE HIGHEST MORTGAGE

  • CC AVG IS PRETTY SIMILAR FOR THOSE WHO OPTED FOR ONLINE SERVICES AND THOSE WHO DID NOT

  • MORE EDUCATED CUSTOMERS HAVE A HIGHER CREDIT AVERAGE

  • CC AVG IS HIGHER IN THE AGE GROUP OF 22-30 AND 31-40
  • USING LOGISTIC REGRESSION



  • 'Zipcode' variable has been removed. Therefore, we create a new data frame without the said variable.


  • Age, Income and Age_range have a VIF value greater than 5. So we will drop the Age_range first.
  • We create a new data frame called 'bank3' by excluding column 'Age_range'



  • As we can see, VIF value of column 'Age' is now below 5. The column 'Income' still has a higher VIF value than 5. But I will still keep this as I feel it is very important for further analysis.




  • Column 'Mortgage' has been removed


  • Accuracy is 96.10%, Sensitivity is 70.83% and Specificity is 98.78%
  • AS WE SAW EARLIER THAT THE DATA IS HEAVILY IMBALANCED. Out of the total 5000 customers, 4520 have not been approved for a loan while 480 have been. WE NEED TO BALANCE THE DATA

  • Predict the test data for over, under and both data using Logistic Regression

  • Logistic Regression for over_data: Accuracy is 92.1%, Sensitivity is 94.79% and Specificity is 91.81%

  • Logistic Regression for under_data: Accuracy is 92.1%, Sensitivity is 95.83% and Specificity is 91.70%

  • Logistic Regression for both_data: Accuracy is 92.2%, Sensitivity is 93.75% and Specificity is 92.04%
  • Predict the test data for over, under and both data using Decision Tree

  • Decision Tree for over_data: Accuracy is 92.8%, Sensitivity is 98.96% and Specificity is 92.15%

  • Decision Tree for under_data: Accuracy is 93.7%, Sensitivity is 98.96% and Specificity is 93.14%

  • Decision Tree for both_data: Accuracy is 94.5%, Sensitivity is 94.79% and Specificity is 94.47%
  • Predict the test data for over, under and both data using Random Forest

  • Random Forest for over_data: Accuracy is 98.4%, Sensitivity is 92.71% and Specificity is 99.00%

  • Random Forest for under_data: Accuracy is 95.7%, Sensitivity is 98.96% and Specificity is 95.35%

  • Random Forest for both_data: Accuracy is 98.4%, Sensitivity is 96.88% and Specificity is 98.56%
  • ROC and AUC for over, under and both_data

  • AUC 98.01% for over_data for Logistic Regression

  • AUC 98.2% for under_data for Logistic Regression

  • AUC 97.86% for both_data for Logistic Regression

  • AUC 97.72% for over_data for Decision Tree

  • AUC 98.01% for under_data for Decision Tree

  • AUC 98.80% for both_data for Decision Tree

  • AUC 99.83% for over_data for Random Forest

  • AUC 99.71% for under_data for Random Forest
  • AUC 99.82% for both_data for Random Forest
  • CONCLUSION: IF WE DECIDE TO GO WITH AUC , THEN WE CAN MOVE AHEAD WITH RANDOM FOREST AS IT HAS THE HIGHEST AUC AMONGST ALL

Share link

Anyone who has the link will be able to view this.