Medical Student Dataset
The Medical Student Dataset is a simulated dataset containing 100,000 rows and 12 columns. The dataset is designed to mimic real-world data commonly encountered in medical education and research. It includes various preprocessing issues commonly observed in data, such as missing values, duplicates, and inconsistencies.
Dataset Description
The dataset consists of the following columns:
StudentID
: Unique identifier for each medical student.
Gender
: Gender of the student (e.g., Male, Female).
Age
: Age of the student in years.
Ethnicity
: Ethnicity of the student.
Year
: Academic year of the student.
University
: Name of the university where the student is enrolled.
GPA
: Grade Point Average of the student.
MCAT Score
: Medical College Admission Test (MCAT) score of the student.
Clinical Experience
: Indicator of whether the student has previous clinical experience (Yes/No).
Research Experience
: Indicator of whether the student has previous research experience (Yes/No).
Publication Count
: Number of publications attributed to the student.
Exam Score
: Performance score on a standardized medical examination.
Data Preprocessing Issues
The dataset has been intentionally created to include various preprocessing issues, such as:
- Missing values: Some columns may have missing values represented as NaN.
- Duplicates: Duplicate records may exist in the dataset, representing identical student entries.
- Inconsistencies: The dataset may contain inconsistent or erroneous values in certain columns.
Data Usage
This dataset can be used for various purposes, including data cleaning and preprocessing exercises, exploring data analysis techniques, and evaluating machine learning algorithms. It provides an opportunity to practice handling real-world data challenges often encountered in the field of medical education and research.