Dataset Summary
The Fake News Classification Dataset is an English-language dataset containing just over 45,000 unique news articles. These articles are classified as true (1) or false (0), making it a valuable resource for researchers and practitioners in the field of fake news identification using Transformers models. This is the first version of the dataset aimed at studying fake news detection.
Supported Tasks and Leaderboards
This dataset supports the following tasks:
- Text classification
- Fact-checking
- Intent classification
Languages
The dataset is primarily in English as generally spoken in the United States (en-US).
Dataset Structure
The dataset comprises 40,587 fields related to news articles, including three key types of fields:
- Title: The title of the news article.
- Text: The content of the news article.
- Label: A binary classification indicating whether the news is fake (0) or true (1).
Data Instances
Each instance contains:
- An integer ID
- A string for the title
- A string for the article text
- A label (0 or 1)
Example Instance:
{
"id": "1",
"title": "Palestinians switch off Christmas lights in Bethlehem in anti-Trump protest",
"text": "RAMALLAH, West Bank (Reuters) - Palestinians switched off Christmas lights at Jesus' traditional birthplace in Bethlehem on Wednesday night in protest at U.S. President Donald Trump's decision to recognize Jerusalem as Israel's capital...",
"label": "1"
}
Data Fields
- id: Integer value counting the rows in the dataset.
- title: String summarizing the article.
- text: String containing the article content.
- label: Boolean indicating if the article is true (1) or false (0).
Data Splits
The dataset is divided into three splits:
- Train: 24,353 instances
- Validation: 8,117 instances
- Test: 8,117 instances
Dataset Creation
This dataset was created using Python with the pandas library as the main processing tool. It incorporates a mix of existing fake news datasets, ensuring a comprehensive dataset for training models. All processes and code used for dataset creation are available in the repository: Fake News Detection Repository.
Source Data
The source data is a combination of multiple fake news datasets sourced from Kaggle, a platform for learning and honing skills in Artificial Intelligence.
Initial Data Collection and Normalization
Version 1.0.0 supports supervised learning methodologies for deep learning, focusing on new Transformers models in Natural Language Processing (NLP) with news articles from the United States.
Considerations for Using the Data
This dataset is composed of three phases:
Training Phase: For training your NLP model.
Validation Phase: To validate the effectiveness of the training and check for overfitting.
Test Phase: To evaluate the model’s performance and identify mistakes in fine-tuning.