CDC Text Corpora For Learners: HTML Mirrors Of MMWR, EID, And PCD
@cdc.cdc_ut5n_bmc3
@cdc.cdc_ut5n_bmc3
The attached ZIP archives are part of the CDC Text Corpora for Learners program. This version, comprised of 33,567 articles, was constructed on 2024-03-01 using source content retrieved on 2024-01-09.
The attached three ZIP archives contain the 33,567 articles in 33,576 compiled HTML mirrors of the MMWR Morbidity and Mortality Weekly Report including its series: Weekly Reports, Recommendations and Reports, Surveillance Summaries, Supplements, and Notifiable Diseases, a subset of Weekly Reports, constructed ad hoc; EID Emerging Infectious Diseases; and PCD Preventing Chronic Disease.There is one archive per series. The archive attachments are located in the About this Dataset section of this landing page. In that section when you click Show More, the attachments are located in the section Attachments.
The retrieval and organization of the files included making as few changes to raw sources as possible, to support as many downstream uses as possible.
Tags: harvest-cdc-journals, text analysis, corpus, corpora, ncstltphiw, phic, informatics, data science, nlp, mmwr, pcd, eid, ai, ml, machine learning, llm, language, semantic, linguistic, morphology
Last updated: 2025-07-16 14:00:48+00:00
Anyone who has the link will be able to view this.