Prediction of the pathologically complete response after neoadjuvant chemoradiotherapy for oesophageal cancer (Toxopeus, E. L. A., Nieboer, D., Shapiro, J., Biermann, K., van der Gaast, A., van Rij, C. M., ... & Wijnhoven, B. P. L. (2015). Nomogram for predicting pathologically complete response after neoadjuvant chemoradiotherapy for oesophageal cancer. Radiotherapy and Oncology, 115(3), 392-398.)
Columns:
- Patient ID - as UUID,
- Patient age - integer, based on mean and range from the corresponding paper,
- Patient sex - binary, based the corresponding paper data distribution,
- Tumor type - categorical, based the corresponding paper data distribution,
- Differentiation grade - ordinal, based the corresponding paper data distribution,
- T-stage - ordinal, based the corresponding paper data distribution,
- N-stage - ordinal, based the corresponding paper data distribution,
- M-stage - ordinal, based the corresponding paper data distribution,
- Overall stage - ordinal, based on the lookup table https://www.researchgate.net/figure/Esophageal-cancer-staging-The-TNM-tumor-node-and-metastasis-staging-system-takes_fig1_274257853,
- Survival time - float, based on the Kaplan-Meier curves for this diagnosis https://www.researchgate.net/figure/Kaplan-Meier-curves-for-overall-survival-in-patients-with-esophageal-cancer-treated_fig2_282245546.
Outcome:
- Treatment response, based on the nomogram from the corresponding paper which turned into computational predictive model in Halilaj I, Oberije C, Chatterjee A, van Wijk Y, Rad NM, Galganebanduge P, Lavrova E, Primakov S, Widaatalla Y, Wind A, et al. Open Source Repository and Online Calculator of Prediction Models for Diagnosis and Prognosis in Oncology. Biomedicines. 2022; 10(11):2679. https://doi.org/10.3390/biomedicines10112679.
For information
Cancer stages:
- T: main tumor: 0 - not found, 1-4 - size and extend, X - cannot be measured,
- N: nearby lymphnode affected: 0 - not found, 1-3 - number of lymph nodes, X - cannot be measured,
- M: metastasis: 0 - did not spread, 1 - has spread, X - cannot be measured.
Differentiation grades: 1-4 from well differentiated to undifferentiated, X - cannot be measured.
We use an assumption that the input parameters are non-correlated, which is not biologically the case.
There are some distortions:
- duplication of records,
- duplication of IDs,
- introducing missing values,
- introducing non-realistic values (negative age),
- introducing unexpected values,
- introducing typos.
It is important to deal with them wisely as in healthcare, every data point is precious.