Breast Cancer Gene Expression Profiles (METABRIC)
Clinical attributes, m-RNA levels z-score, and genes mutations for 1904 patients
@kaggle.raghadalharbi_breast_cancer_gene_expression_profiles_metabric
Clinical attributes, m-RNA levels z-score, and genes mutations for 1904 patients
@kaggle.raghadalharbi_breast_cancer_gene_expression_profiles_metabric
Most of us know someone who struggled with breast cancer, or at least heard about the struggles facing patients who are fighting against breast cancer. Breast cancer is the most frequent cancer among women, impacting 2.1 million women each year. Breast cancer causes the greatest number of cancer-related deaths among women. In 2018 alone, it is estimated that 627,000 women died from breast cancer.
The most important part of a process of clinical decision-making in patients with cancers, in general, is the accurate estimation of prognosis and survival duration. Breast cancer patients with the same stage of the disease and the same clinical characteristics can have different treatment responses and overall survival, but why?
Cancers are associated with genetic abnormalities. Gene expression measures the level of gene activity in a tissue and gives information about its complex activities. Comparing the genes expressed in normal and diseased tissue can bring better insights into the cancer prognosis and outcomes. Using machine learning techniques on genetic data has the potentials of giving the correct estimation of survival time and can prevent unnecessary surgical and treatment procedures.
The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) database is a Canada-UK Project which contains targeted sequencing data of 1,980 primary breast cancer samples. Clinical and genomic data was downloaded from cBioPortal.
The dataset was collected by Professor Carlos Caldas from Cambridge Research Institute and Professor Sam Aparicio from the British Columbia Cancer Centre in Canada and published on Nature Communications (Pereira et al., 2016). It was also featured in multiple papers including Nature and others:
Name | Type | Description |
---|---|---|
patient_id | object | Patient ID |
age_at_diagnosis | float | Age of the patient at diagnosis time |
type_of_breast_surgery | object | Breast cancer surgery type: 1- MASTECTOMY, which refers to a surgery to remove all breast tissue from a breast as a way to treat or prevent breast cancer. 2- BREAST CONSERVING, which refers to a urgery where only the part of the breast that has cancer is removed |
cancer_type | object | Breast cancer types: 1- Breast Cancer or 2- Breast Sarcoma |
cancer_type_detailed | object | Detailed Breast cancer types: 1- Breast Invasive Ductal Carcinoma 2- Breast Mixed Ductal and Lobular Carcinoma 3- Breast Invasive Lobular Carcinoma 4- Breast Invasive Mixed Mucinous Carcinoma 5- Metaplastic Breast Cancer |
cellularity | object | Cancer cellularity post chemotherapy, which refers to the amount of tumor cells in the specimen and their arrangement into clusters |
chemotherapy | int | Whether or not the patient had chemotherapy as a treatment (yes/no) |
pam50_+_claudin-low_subtype | object | Pam 50: is a tumor profiling test that helps show whether some estrogen receptor-positive (ER-positive), HER2-negative breast cancers are likely to metastasize (when breast cancer spreads to other organs). The claudin-low breast cancer subtype is defined by gene expression characteristics, most prominently: Low expression of cell–cell adhesion genes, high expression of epithelial–mesenchymal transition (EMT) genes, and stem cell-like/less differentiated gene expression patterns |
cohort | float | Cohort is a group of subjects who share a defining characteristic (It takes a value from 1 to 5) |
er_status_measured_by_ihc | float | To assess if estrogen receptors are expressed on cancer cells by using immune-histochemistry (a dye used in pathology that targets specific antigen, if it is there, it will give a color, it is not there, the tissue on the slide will be colored) (positive/negative) |
er_status | object | Cancer cells are positive or negative for estrogen receptors |
neoplasm_histologic_grade | int | Determined by pathology by looking the nature of the cells, do they look aggressive or not (It takes a value from 1 to 3) |
her2_status_measured_by_snp6 | object | To assess if the cancer positive for HER2 or not by using advance molecular techniques (Type of next generation sequencing) |
her2_status | object | Whether the cancer is positive or negative for HER2 |
tumor_other_histologic_subtype | object | Type of the cancer based on microscopic examination of the cancer tissue (It takes a value of 'Ductal/NST', 'Mixed', 'Lobular', 'Tubular/ cribriform', 'Mucinous', 'Medullary', 'Other', 'Metaplastic' ) |
hormone_therapy | int | Whether or not the patient had hormonal as a treatment (yes/no) |
inferred_menopausal_state | object | Whether the patient is is post menopausal or not (post/pre) |
integrative_cluster | object | Molecular subtype of the cancer based on some gene expression (It takes a value from '4ER+', '3', '9', '7', '4ER-', '5', '8', '10', '1', '2', '6') |
primary_tumor_laterality | object | Whether it is involving the right breast or the left breast |
lymph_nodes_examined_positive | float | To take samples of the lymph node during the surgery and see if there were involved by the cancer |
mutation_count | float | Number of gene that has relevant mutations |
nottingham_prognostic_index | float | It is used to determine prognosis following surgery for breast cancer. Its value is calculated using three pathological criteria: the size of the tumour; the number of involved lymph nodes; and the grade of the tumour. |
oncotree_code | object | The OncoTree is an open-source ontology that was developed at Memorial Sloan Kettering Cancer Center (MSK) for standardizing cancer type diagnosis from a clinical perspective by assigning each diagnosis a unique OncoTree code. |
overall_survival_months | float | Duration from the time of the intervention to death |
overall_survival | object | Target variable wether the patient is alive of dead. |
pr_status | object | Cancer cells are positive or negative for progesterone receptors |
radio_therapy | int | Whether or not the patient had radio as a treatment (yes/no) |
3-gene_classifier_subtype | object | Three Gene classifier subtype It takes a value from 'ER-/HER2-', 'ER+/HER2- High Prolif', nan, 'ER+/HER2- Low Prolif','HER2+' |
tumor_size | float | Tumor size measured by imaging techniques |
tumor_stage | float | Stage of the cancer based on the involvement of surrounding structures, lymph nodes and distant spread |
death_from_cancer | int | Wether the patient's death was due to cancer or not (yes/no) |
The genetics part of the dataset contains m-RNA levels z-score for 331 genes, and mutation for 175 genes.
From CBioPortal:
What are mRNA?
The DNA molecules attached to each slide act as probes to detect gene expression, which is also known as the transcriptome or the set of messenger RNA (mRNA) transcripts expressed by a group of genes. To perform a microarray analysis, mRNA molecules are typically collected from both an experimental sample and a reference sample.
What are mRNA Z-Scores?
For mRNA expression data, The calculations of the relative expression of an individual gene and tumor to the gene's expression distribution in a reference population is done. That reference population is all samples in the study . The returned value indicates the number of standard deviations away from the mean of expression in the reference population (Z-score). This measure is useful to determine whether a gene is up- or down-regulated relative to the normal samples or all other tumor samples.
The formula is :
z = (expression in tumor sample - mean expression in reference sample) / standard deviation of expression in reference sample
Anyone who has the link will be able to view this.