About this Dataset

Indians Diabetes V5

Mathematical Analysis of a Diabetes Prediction Dataset
Analyzing data to predict the occurrence of diabetes, based on the dataset provided by the National Institute of Diabetes and Digestive and Kidney Diseases, is a clear example of how mathematical and statistical methods can be applied to solve real-world public health problems. This dataset consists of medical predictor variables and a target variable, called Outcome, which indicates whether a patient was diagnosed with diabetes (value 1) or not (value 0). Below are mathematical examples explained in this context.

Example 1: Logistic Regression for Prediction
Logistic regression is widely used in medical studies to predict binary outcomes, such as the presence or absence of a disease. In the context of this dataset, the basic logistic regression formula can be represented as:

𝑃
(
𝑌

Name: Indians Diabetes V5
Creator: Kaggle
License: https://creativecommons.org/publicdomain/zero/1.0/

1
)

1
1
+
𝑒
−
(
𝛽
0
+
𝛽
1
𝑋
1
+
𝛽
2
𝑋
2
+
⋯
+
𝛽
𝑛
𝑋
𝑛
)
P(Y=1)=
1+e
−(β
0

+β
1

X
1

+β
2

X
2

+⋯+β
n

X
n

)

Where:

𝑃
(
𝑌

1
)
P(Y=1) is the probability of a patient having diabetes.
𝛽
0
β
0

is the intercept.
𝛽
1
,
𝛽
2
,
…
,
𝛽
𝑛
β
1

,β
2

,…,β
n

are the coefficients of the predictor variables.
𝑋
1
,
𝑋
2
,
…
,
𝑋
𝑛
X
1

,X
2

,…,X
n

are the predictor variables (e.g., BMI, number of pregnancies, etc.).
If we consider three predictor variables (
𝑋
1

X
1

= BMI,
𝑋
2

X
2

= age,
𝑋
3

X
3

= insulin level), the model can be written as:

𝑃
(
𝑌

1
)

1
1
+
𝑒
−
(
𝛽
0
+
𝛽
1
⋅
BMI
+
𝛽
2
⋅
age
+
𝛽
3
⋅
insulin
)
P(Y=1)=
1+e
−(β
0

+β
1

⋅BMI+β
2

⋅age+β
3

⋅insulin)

Using real data, a trained model could predict whether a patient has diabetes based on these factors.

Example 2: Statistical Correlation Analysis
An essential part of exploratory data analysis is calculating the correlation between predictor variables and the target variable. The formula for Pearson’s correlation coefficient is:

𝑟

∑
(
𝑋
𝑖
−
𝑋
ˉ
)
(
𝑌
𝑖
−
𝑌
ˉ
)
∑
(
𝑋
𝑖
−
𝑋
ˉ
)
2
⋅
∑
(
𝑌
𝑖
−
𝑌
ˉ
)
2
r=
∑(X
i

−
X
ˉ
)
2
⋅∑(Y
i

−
Y
ˉ
)
2

∑(X
i

−
X
ˉ
)(Y
i

−
Y
ˉ
)

Suppose we want to analyze the correlation between Body Mass Index (BMI) and the likelihood of developing diabetes. Calculating
𝑟
r, a value close to 1 would indicate a strong positive correlation, while values near 0 suggest little or no correlation.

For example:

Average
𝐵
𝑀
𝐼
BMI (
𝑋
ˉ
X
ˉ
) = 32.0
Average target variable (
𝑌
ˉ
Y
ˉ
) = 0.35
Sample data for
𝑋
𝑖
X
i

and
𝑌
𝑖
Y
i

:
𝐵
𝑀
𝐼

[
30
,
35
,
40
]
BMI=[30,35,40],
𝑂
𝑢
𝑡
𝑐
𝑜
𝑚
𝑒

[
0
,
1
,
1
]
Outcome=[0,1,1].
By calculating
𝑟
r, we could determine if higher
𝐵
𝑀
𝐼
BMI is associated with an increased chance of diabetes.

Example 3: Model Evaluation Using a Confusion Matrix
After creating a predictive model, it is essential to evaluate its performance. A confusion matrix for this problem would look like this:

Predicted: Non-Diabetic (0) Predicted: Diabetic (1)
Actual: Non-Diabetic (0) True Negatives (TN) False Positives (FP)
Actual: Diabetic (1) False Negatives (FN) True Positives (TP)
Using the values from the confusion matrix, metrics such as the following can be calculated:

Accuracy:
Accuracy

𝑇
𝑃
+
𝑇
𝑁
𝑇
𝑃
+
𝑇
𝑁
+
𝐹
𝑃
+
𝐹
𝑁
Accuracy=
TP+TN+FP+FN
TP+TN

Precision:
Precision

𝑇
𝑃
𝑇
𝑃
+
𝐹
𝑃
Precision=
TP+FP
TP

Recall:
Recall

𝑇
𝑃
𝑇
𝑃
+
𝐹
𝑁
Recall=
TP+FN
TP

For instance, if:

𝑇
𝑃

50
TP=50,
𝑇
𝑁

40
TN=40,
𝐹
𝑃

10
FP=10,
𝐹
𝑁

20
FN=20, the accuracy would be:
Accuracy

50
+
40
50
+
40
+
10
+
20

0.75

(
75
%
)
Accuracy=
50+40+10+20
50+40

=0.75(75%)
These metrics are critical to fine-tune the model and make it more effective.

Example 4: Clustering with K-Means
Although clustering is not directly used to predict diabetes, it can help identify patterns in the data. The K-Means algorithm groups patients into
𝑘
k clusters based on their features. The formula to update the centroid (
𝐶
𝑘
C
k

) of a cluster is:

𝐶
𝑘

1
∣
𝑆
𝑘
∣
∑
𝑥
∈
𝑆
𝑘
𝑥
C
k

∣S
k

∣
1

x∈S
k

∑

x
Where:

𝑆
𝑘
S
k

is the set of points in cluster
𝑘
k.
𝑥
x represents the data points.
If we wanted to cluster patients based on glucose levels and BMI, the algorithm would iteratively adjust the centroids until the sum of squared distances within clusters is minimized.

Tables

Diabetes New

@kaggle.willianoliveiragibin_indians_diabetes_v5.diabetes_new

17.16 kB
768 rows
8 columns

CREATE TABLE diabetes_new (
  "age" BIGINT,
  "pregnancies" BIGINT,
  "glucose" BIGINT,
  "bloodpressure" BIGINT,
  "skinthickness" BIGINT,
  "insulin" BIGINT,
  "bmi" DOUBLE,
  "diabetespedigreefunction" DOUBLE
);