Indians Diabetes V5
This dataset is originally from the National Institute of Diabetes.
@kaggle.willianoliveiragibin_indians_diabetes_v5
This dataset is originally from the National Institute of Diabetes.
@kaggle.willianoliveiragibin_indians_diabetes_v5
Mathematical Analysis of a Diabetes Prediction Dataset
Analyzing data to predict the occurrence of diabetes, based on the dataset provided by the National Institute of Diabetes and Digestive and Kidney Diseases, is a clear example of how mathematical and statistical methods can be applied to solve real-world public health problems. This dataset consists of medical predictor variables and a target variable, called Outcome, which indicates whether a patient was diagnosed with diabetes (value 1) or not (value 0). Below are mathematical examples explained in this context.
Example 1: Logistic Regression for Prediction
Logistic regression is widely used in medical studies to predict binary outcomes, such as the presence or absence of a disease. In the context of this dataset, the basic logistic regression formula can be represented as:
1
1
+
π
β
(
π½
0
+
π½
1
π
1
+
π½
2
π
2
+
β―
+
π½
π
π
π
)
P(Y=1)=
1+e
β(Ξ²
0
β
+Ξ²
1
β
X
1
β
+Ξ²
2
β
X
2
β
+β―+Ξ²
n
β
X
n
β
)
1
β
Where:
X
3
β
= insulin level), the model can be written as:
1
1
+
π
β
(
π½
0
+
π½
1
β
BMI
+
π½
2
β
age
+
π½
3
β
insulin
)
P(Y=1)=
1+e
β(Ξ²
0
β
+Ξ²
1
β
β
BMI+Ξ²
2
β
β
age+Ξ²
3
β
β
insulin)
1
β
Using real data, a trained model could predict whether a patient has diabetes based on these factors.
Example 2: Statistical Correlation Analysis
An essential part of exploratory data analysis is calculating the correlation between predictor variables and the target variable. The formula for Pearsonβs correlation coefficient is:
β
(
π
π
β
π
Λ
)
(
π
π
β
π
Λ
)
β
(
π
π
β
π
Λ
)
2
β
β
(
π
π
β
π
Λ
)
2
r=
β(X
i
β
β
X
Λ
)
2
β
β(Y
i
β
β
Y
Λ
)
2
β
β(X
i
β
β
X
Λ
)(Y
i
β
β
Y
Λ
)
β
Suppose we want to analyze the correlation between Body Mass Index (BMI) and the likelihood of developing diabetes. Calculating
π
r, a value close to 1 would indicate a strong positive correlation, while values near 0 suggest little or no correlation.
For example:
[
0
,
1
,
1
]
Outcome=[0,1,1].
By calculating
π
r, we could determine if higher
π΅
π
πΌ
BMI is associated with an increased chance of diabetes.
Example 3: Model Evaluation Using a Confusion Matrix
After creating a predictive model, it is essential to evaluate its performance. A confusion matrix for this problem would look like this:
Predicted: Non-Diabetic (0) Predicted: Diabetic (1)
Actual: Non-Diabetic (0) True Negatives (TN) False Positives (FP)
Actual: Diabetic (1) False Negatives (FN) True Positives (TP)
Using the values from the confusion matrix, metrics such as the following can be calculated:
π
π
+
π
π
π
π
+
π
π
+
πΉ
π
+
πΉ
π
Accuracy=
TP+TN+FP+FN
TP+TN
β
π
π
π
π
+
πΉ
π
Precision=
TP+FP
TP
β
π
π
π
π
+
πΉ
π
Recall=
TP+FN
TP
β
For instance, if:
0.75
β
(
75
%
)
Accuracy=
50+40+10+20
50+40
β
=0.75(75%)
These metrics are critical to fine-tune the model and make it more effective.
Example 4: Clustering with K-Means
Although clustering is not directly used to predict diabetes, it can help identify patterns in the data. The K-Means algorithm groups patients into
π
k clusters based on their features. The formula to update the centroid (
πΆ
π
C
k
β
) of a cluster is:
β£S
k
β
β£
1
β
xβS
k
β
β
β
x
Where:
π
π
S
k
β
is the set of points in cluster
π
k.
π₯
x represents the data points.
If we wanted to cluster patients based on glucose levels and BMI, the algorithm would iteratively adjust the centroids until the sum of squared distances within clusters is minimized.
Anyone who has the link will be able to view this.