Classification Problem using LR, DT, RF and AUC in R Programming
Dataset Description
- DATASET: Dependent variable is 'Personal.Loan'. 0 indicates loan not approved and 1 indicates loan approved.
- OBJECTIVE : We will do Exploratory Data Analysis and use Logistic Regression, Decision Tree, Random Forest and AUC to find out which is the best model.
Steps: - Set the working directory and read the data
- Check the data types of all the variables
- DATA CLEANING
- We need to change the data types of certain variables to factor vector
- Check for missing data, duplicate records and remove insignificant variables
- New data frame created called 'bank1' after dropping the 'ID' column.
- EXPLORATORY DATA ANALYSIS
- We will try to get some insights by digging into the data through bar charts and box plots which can help the bank management in decision making
- Run the required libraries
- Out of the total 5000 customers, 4520 have not been approved for a loan while 480 have been
- THIS INDICATES THAT INCOME IS HIGHER WHEN THERE ARE LESS FAMILY MEMBERS
- THIS INDICATES PERSONAL LOAN HAS BEEN APPROVED FOR CUSTOMERS HAVING HIGHER INCOME
- THIS INDICATES THAT THE INCOME IS PRETTY SIMILAR FOR CUSTOMERS OWNING AND NOT OWNING A CREDIT CARD
- CUSTOMERS BELONGING TO THE RICH CLASS (INCOME GROUP : 150-200) HAVE THE HIGHEST MORTGAGE
- CC AVG IS PRETTY SIMILAR FOR THOSE WHO OPTED FOR ONLINE SERVICES AND THOSE WHO DID NOT
- MORE EDUCATED CUSTOMERS HAVE A HIGHER CREDIT AVERAGE
- CC AVG IS HIGHER IN THE AGE GROUP OF 22-30 AND 31-40
- USING LOGISTIC REGRESSION
- 'Zipcode' variable has been removed. Therefore, we create a new data frame without the said variable.
- Age, Income and Age_range have a VIF value greater than 5. So we will drop the Age_range first.
- We create a new data frame called 'bank3' by excluding column 'Age_range'
- As we can see, VIF value of column 'Age' is now below 5. The column 'Income' still has a higher VIF value than 5. But I will still keep this as I feel it is very important for further analysis.
- Column 'Mortgage' has been removed
- Accuracy is 96.10%, Sensitivity is 70.83% and Specificity is 98.78%
- AS WE SAW EARLIER THAT THE DATA IS HEAVILY IMBALANCED. Out of the total 5000 customers, 4520 have not been approved for a loan while 480 have been. WE NEED TO BALANCE THE DATA
- Predict the test data for over, under and both data using Logistic Regression
- Logistic Regression for over_data: Accuracy is 92.1%, Sensitivity is 94.79% and Specificity is 91.81%
- Logistic Regression for under_data: Accuracy is 92.1%, Sensitivity is 95.83% and Specificity is 91.70%
- Logistic Regression for both_data: Accuracy is 92.2%, Sensitivity is 93.75% and Specificity is 92.04%
- Predict the test data for over, under and both data using Decision Tree
- Decision Tree for over_data: Accuracy is 92.8%, Sensitivity is 98.96% and Specificity is 92.15%
- Decision Tree for under_data: Accuracy is 93.7%, Sensitivity is 98.96% and Specificity is 93.14%
- Decision Tree for both_data: Accuracy is 94.5%, Sensitivity is 94.79% and Specificity is 94.47%
- Predict the test data for over, under and both data using Random Forest
- Random Forest for over_data: Accuracy is 98.4%, Sensitivity is 92.71% and Specificity is 99.00%
- Random Forest for under_data: Accuracy is 95.7%, Sensitivity is 98.96% and Specificity is 95.35%
- Random Forest for both_data: Accuracy is 98.4%, Sensitivity is 96.88% and Specificity is 98.56%
- ROC and AUC for over, under and both_data
- AUC 98.01% for over_data for Logistic Regression
- AUC 98.2% for under_data for Logistic Regression
- AUC 97.86% for both_data for Logistic Regression
- AUC 97.72% for over_data for Decision Tree
- AUC 98.01% for under_data for Decision Tree
- AUC 98.80% for both_data for Decision Tree
- AUC 99.83% for over_data for Random Forest
- AUC 99.71% for under_data for Random Forest
- AUC 99.82% for both_data for Random Forest
- CONCLUSION: IF WE DECIDE TO GO WITH AUC , THEN WE CAN MOVE AHEAD WITH RANDOM FOREST AS IT HAS THE HIGHEST AUC AMONGST ALL
Related Datasets
-
Loan Approval Dataset
@kaggle