Epigenetic Biomarkers for Early Detection and Risk Assessment of retina disease
Dataset Description
Dataset Description
About This Dataset
This dataset is designed for researchers, bioinformaticians, and machine learning practitioners working on:
- Epigenetic biomarker discovery
- DNA methylation analysis
- Retinal disease prediction
- Binary classification tasks in genomics
- Feature-based machine learning in bioinformatics
The dataset focuses on CpG (Cytosine-phosphate-Guanine) sites and their methylation patterns, which play crucial roles in gene regulation and disease development. Retinitis Pigmentosa is a group of rare genetic disorders that involve progressive vision loss, and epigenetic factors like DNA methylation may serve as early detection biomarkers.
Dataset Statistics
- Total Records: 1,000 samples
- Total Features: 5 columns
- Target Variable:
methylation_status(Binary: 0 = Unmethylated, 1 = Methylated) - Class Distribution:
- Methylated (1): 554 samples (55.4%)
- Unmethylated (0): 446 samples (44.6%)
- Missing Values: None (Complete dataset)
- Data Quality: High-quality, normalized, ready for ML/DL applications
Column Descriptions
1. cpg_density (float64)
- Description: Density of CpG sites in the genomic region
- Range: 0.052 to 1.000
- Mean: 0.698
- Standard Deviation: 0.183
- Interpretation: Higher values indicate CpG-rich regions (CpG islands), which are often associated with gene promoters
2. genomic_location (float64)
- Description: Normalized genomic position within the chromosome or region
- Range: 0.000 to 1.000
- Mean: 0.612
- Standard Deviation: 0.234
- Interpretation: Represents the relative position of the CpG site; 0 = start, 1 = end of region
3. regulatory_score (float64)
- Description: Score indicating the regulatory potential of the genomic region
- Range: 0.000 to 1.000
- Mean: 0.460
- Standard Deviation: 0.291
- Interpretation: Higher scores suggest regions with higher regulatory activity (e.g., promoters, enhancers)
4. conservation_score (float64)
- Description: Evolutionary conservation score across species
- Range: 0.064 to 1.000
- Mean: 0.643
- Standard Deviation: 0.199
- Interpretation: Higher scores indicate regions conserved across evolution, suggesting functional importance
5. methylation_status (int64) - TARGET VARIABLE
- Description: Binary label indicating DNA methylation status
- Values:
- 0: Unmethylated (44.6% of samples)
- 1: Methylated (55.4% of samples)
- Interpretation: Indicates whether the CpG site is methylated, which can affect gene expression
Use Cases and Applications
1. Binary Classification
- Predict methylation status based on genomic features
- Suitable for: Logistic Regression, SVM, Random Forest, XGBoost, Neural Networks
2. Feature Importance Analysis
- Identify which genomic features most strongly predict methylation status
- Tools: SHAP, LIME, feature permutation importance
3. Biomarker Discovery
- Identify epigenetic patterns associated with Retinitis Pigmentosa
- Potential for early disease detection and risk assessment
4. Model Benchmarking
- Compare performance of different ML algorithms on genomic data
- Baseline accuracy: ~97% (as demonstrated in the accompanying project)
5. Educational Purposes
- Teach bioinformatics and machine learning concepts
- Demonstrate real-world genomic data analysis workflows
Data Format
- File Format: CSV (Comma-Separated Values)
- File Size: ~80 KB
- Encoding: UTF-8
- Delimiter: Comma (
,) - Header: Yes (first row contains column names)
- Ready for: Python (pandas), R, MATLAB, Excel