Baselight
Sign In
kaggle

Bank Loan Approval - LR, DT, RF And AUC

@kaggle.vikramamin_bank_loan_approval_lr_dt_rf_and_auc

Loading...
Loading...

Classification Problem using LR, DT, RF and AUC in R Programming

  • DATASET: Dependent variable is 'Personal.Loan'. 0 indicates loan not approved and 1 indicates loan approved.
  • OBJECTIVE : We will do Exploratory Data Analysis and use Logistic Regression, Decision Tree, Random Forest and AUC to find out which is the best model.
    Steps:
  • Set the working directory and read the data
  • Check the data types of all the variables
  • DATA CLEANING
  • We need to change the data types of certain variables to factor vector
  • Check for missing data, duplicate records and remove insignificant variables
  • New data frame created called 'bank1' after dropping the 'ID' column.
  • EXPLORATORY DATA ANALYSIS
  • We will try to get some insights by digging into the data through bar charts and box plots which can help the bank management in decision making
  • Run the required libraries

  • Out of the total 5000 customers, 4520 have not been approved for a loan while 480 have been


  • THIS INDICATES THAT INCOME IS HIGHER WHEN THERE ARE LESS FAMILY MEMBERS

  • THIS INDICATES PERSONAL LOAN HAS BEEN APPROVED FOR CUSTOMERS HAVING HIGHER INCOME

  • THIS INDICATES THAT THE INCOME IS PRETTY SIMILAR FOR CUSTOMERS OWNING AND NOT OWNING A CREDIT CARD

  • CUSTOMERS BELONGING TO THE RICH CLASS (INCOME GROUP : 150-200) HAVE THE HIGHEST MORTGAGE

  • CC AVG IS PRETTY SIMILAR FOR THOSE WHO OPTED FOR ONLINE SERVICES AND THOSE WHO DID NOT

  • MORE EDUCATED CUSTOMERS HAVE A HIGHER CREDIT AVERAGE

  • CC AVG IS HIGHER IN THE AGE GROUP OF 22-30 AND 31-40
  • USING LOGISTIC REGRESSION



  • 'Zipcode' variable has been removed. Therefore, we create a new data frame without the said variable.


  • Age, Income and Age_range have a VIF value greater than 5. So we will drop the Age_range first.
  • We create a new data frame called 'bank3' by excluding column 'Age_range'



  • As we can see, VIF value of column 'Age' is now below 5. The column 'Income' still has a higher VIF value than 5. But I will still keep this as I feel it is very important for further analysis.




  • Column 'Mortgage' has been removed


  • Accuracy is 96.10%, Sensitivity is 70.83% and Specificity is 98.78%
  • AS WE SAW EARLIER THAT THE DATA IS HEAVILY IMBALANCED. Out of the total 5000 customers, 4520 have not been approved for a loan while 480 have been. WE NEED TO BALANCE THE DATA

  • Predict the test data for over, under and both data using Logistic Regression

  • Logistic Regression for over_data: Accuracy is 92.1%, Sensitivity is 94.79% and Specificity is 91.81%

  • Logistic Regression for under_data: Accuracy is 92.1%, Sensitivity is 95.83% and Specificity is 91.70%

  • Logistic Regression for both_data: Accuracy is 92.2%, Sensitivity is 93.75% and Specificity is 92.04%
  • Predict the test data for over, under and both data using Decision Tree

  • Decision Tree for over_data: Accuracy is 92.8%, Sensitivity is 98.96% and Specificity is 92.15%

  • Decision Tree for under_data: Accuracy is 93.7%, Sensitivity is 98.96% and Specificity is 93.14%

  • Decision Tree for both_data: Accuracy is 94.5%, Sensitivity is 94.79% and Specificity is 94.47%
  • Predict the test data for over, under and both data using Random Forest

  • Random Forest for over_data: Accuracy is 98.4%, Sensitivity is 92.71% and Specificity is 99.00%

  • Random Forest for under_data: Accuracy is 95.7%, Sensitivity is 98.96% and Specificity is 95.35%

  • Random Forest for both_data: Accuracy is 98.4%, Sensitivity is 96.88% and Specificity is 98.56%
  • ROC and AUC for over, under and both_data

  • AUC 98.01% for over_data for Logistic Regression

  • AUC 98.2% for under_data for Logistic Regression

  • AUC 97.86% for both_data for Logistic Regression

  • AUC 97.72% for over_data for Decision Tree

  • AUC 98.01% for under_data for Decision Tree

  • AUC 98.80% for both_data for Decision Tree

  • AUC 99.83% for over_data for Random Forest

  • AUC 99.71% for under_data for Random Forest
  • AUC 99.82% for both_data for Random Forest
  • CONCLUSION: IF WE DECIDE TO GO WITH AUC , THEN WE CAN MOVE AHEAD WITH RANDOM FOREST AS IT HAS THE HIGHEST AUC AMONGST ALL

Related Datasets

Share link

Anyone who has the link will be able to view this.