Name: Loan Approval Classification Dataset
Creator: Kaggle
License: https://www.apache.org/licenses/

About this Dataset

Loan Approval Classification Dataset

1. Data Source

This dataset is a synthetic version inspired by the original Credit Risk dataset on Kaggle and enriched with additional variables based on Financial Risk for Loan Approval data. SMOTENC was used to simulate new data points to enlarge the instances. The dataset is structured for both categorical and continuous features.

2. Metadata

The dataset contains 45,000 records and 14 variables, each described below:

Column	Description	Type
`person_age`	Age of the person	Float
`person_gender`	Gender of the person	Categorical
`person_education`	Highest education level	Categorical
`person_income`	Annual income	Float
`person_emp_exp`	Years of employment experience	Integer
`person_home_ownership`	Home ownership status (e.g., rent, own, mortgage)	Categorical
`loan_amnt`	Loan amount requested	Float
`loan_intent`	Purpose of the loan	Categorical
`loan_int_rate`	Loan interest rate	Float
`loan_percent_income`	Loan amount as a percentage of annual income	Float
`cb_person_cred_hist_length`	Length of credit history in years	Float
`credit_score`	Credit score of the person	Integer
`previous_loan_defaults_on_file`	Indicator of previous loan defaults	Categorical
`loan_status` (target variable)	Loan approval status: 1 = approved; 0 = rejected	Integer

3. Data Usage

The dataset can be used for multiple purposes:

Exploratory Data Analysis (EDA): Analyze key features, distribution patterns, and relationships to understand credit risk factors.
Classification: Build predictive models to classify the loan_status variable (approved/not approved) for potential applicants.
Regression: Develop regression models to predict the credit_score variable based on individual and loan-related attributes.

Mind the data issue from the original data, such as the instance > 100-year-old as age.

This dataset provides a rich basis for understanding financial risk factors and simulating predictive modeling processes for loan approval and credit scoring.

Feel free to leave comments on the discussion. I'd appreciate your upvote if you find my dataset useful! 😀

Tables

Loan Data

@kaggle.taweilo_loan_approval_classification_data.loan_data

662.5 kB
45,000 rows
14 columns

CREATE TABLE loan_data (
  "person_age" DOUBLE,
  "person_gender" VARCHAR,
  "person_education" VARCHAR,
  "person_income" DOUBLE,
  "person_emp_exp" BIGINT,
  "person_home_ownership" VARCHAR,
  "loan_amnt" DOUBLE,
  "loan_intent" VARCHAR,
  "loan_int_rate" DOUBLE,
  "loan_percent_income" DOUBLE,
  "cb_person_cred_hist_length" DOUBLE,
  "credit_score" BIGINT,
  "previous_loan_defaults_on_file" VARCHAR,
  "loan_status" BIGINT
);