Baselight

Infectious Disease Prediction

Infectious Disease Prediction

@kaggle.haithemhermessi_infectious_disease_prediction

About this Dataset

Infectious Disease Prediction

Context

These data contain counts and rates for Centers for Infectious Diseases-related disease cases among California residents by county, disease, sex, and year spanning 2001-2014 (As of September, 2015). Data were extracted on communicable disease cases with an estimated onset or diagnosis date from 2001 through 2014 from California Confidential Morbidity Reports and/or Laboratory Report that were submitted to CDPH by September 2015 and which met the surveillance case definition for that disease. A cleansing and exploration steps have been performed to generate the train and test datasets.

Content

The train dataset contains 75614 rows and the test data has 18904 rows
**Features: **
Disease:Plain text: The name of the disease reported for the patient.
County: Plain text "The county in which the case resided when they were diagnosed and/or where they are currently receiving care; in most cases this will be the county that reported the case.
****Year ****:Number: Year is derived from the estimated illness onset date. We defined the estimated illness onset date for each case as the date closest to the time when symptoms first appeared. Because date of illness onset may not be recorded, the estimated date of illness onset can range from the first appearance of symptoms to the date the report was made to CDPH. For diseases with insidious illness onset (for instance, coccidioidomycosis), estimated illness onset was more frequently drawn from the diagnosis date
Values include: years spanning 2001-2014, unless otherwise indicated below
****Sex ****:Plain text : Values include: Male, Female,
Count :Number: The number of occurrences of each disease that meet the surveillance definition and/or inclusion criteria specific to that disease for that County, Year, Sex strata. National surveillance case definitions for these conditions can be found at http://wwwn.cdc.gov/nndss/case-definitions.html.
****Population ****:Number: The estimated population size (rounded to the nearest integer) for each County, Year, Sex strata. California Department of Finance (DOF) Population Projection data (P-3 data table) were used to determine the population proportion of a particular demographic subgroup relative to the total State/County population for a given year. These proportions were then applied to the DOF Estimate totals (E-2 data table) for the given State/County and year total, to obtain the estimates used. These data are available at http://www.dof.ca.gov/research/demographic/reports/view.php.
Value: a number (a positive integer)"
****Rate ****:Number:The rate of disease per 100,000 population for the corresponding County, Year, Sex strata using the standard calculation (Count *100,000/Population)
Value: a number (a positive real number xxx.xxx)"
CI.lower:Number: The lower bound of the 95% confidence interval for the calculated rate. The confidence interval was calculated with the R software package (R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.) using the ""Exact Pearson-Klopper"" method as implement in the ""binom"" package (Sundar Dorai-Raj (2014). binom: Binomial Confidence Intervals For Several Parameterizations. R package version 1.1-1. http://CRAN.R-project.org/package=binom)
Value: a number (a positive real number xxx.xxx)"
****CI.uppe
r
:Number:The upper bound of the 95% confidence interval for the calculated rate, calculated as above.
Value: a number (a positive real number xxx.xxx)"

Acknowledgements

Share link

Anyone who has the link will be able to view this.