Baselight

Fake News Classification

A Comprehensive Dataset for Fake News Detection

@kaggle.aadyasingh55_fake_news_classification

About this Dataset

Fake News Classification

Dataset Summary

The Fake News Classification Dataset is an English-language dataset containing just over 45,000 unique news articles. These articles are classified as true (1) or false (0), making it a valuable resource for researchers and practitioners in the field of fake news identification using Transformers models. This is the first version of the dataset aimed at studying fake news detection.

Supported Tasks and Leaderboards

This dataset supports the following tasks:

  1. Text classification
  2. Fact-checking
  3. Intent classification

Languages

The dataset is primarily in English as generally spoken in the United States (en-US).

Dataset Structure

The dataset comprises 40,587 fields related to news articles, including three key types of fields:

  • Title: The title of the news article.
  • Text: The content of the news article.
  • Label: A binary classification indicating whether the news is fake (0) or true (1).

Data Instances

Each instance contains:

  • An integer ID
  • A string for the title
  • A string for the article text
  • A label (0 or 1)

Example Instance:


{
  "id": "1",
  "title": "Palestinians switch off Christmas lights in Bethlehem in anti-Trump protest",
  "text": "RAMALLAH, West Bank (Reuters) - Palestinians switched off Christmas lights at Jesus' traditional birthplace in Bethlehem on Wednesday night in protest at U.S. President Donald Trump's decision to recognize Jerusalem as Israel's capital...",
  "label": "1"
}

Data Fields

  1. id: Integer value counting the rows in the dataset.
  2. title: String summarizing the article.
  3. text: String containing the article content.
  4. label: Boolean indicating if the article is true (1) or false (0).

Data Splits

The dataset is divided into three splits:

  • Train: 24,353 instances
  • Validation: 8,117 instances
  • Test: 8,117 instances

Dataset Creation

This dataset was created using Python with the pandas library as the main processing tool. It incorporates a mix of existing fake news datasets, ensuring a comprehensive dataset for training models. All processes and code used for dataset creation are available in the repository: Fake News Detection Repository.

Source Data

The source data is a combination of multiple fake news datasets sourced from Kaggle, a platform for learning and honing skills in Artificial Intelligence.

Initial Data Collection and Normalization

Version 1.0.0 supports supervised learning methodologies for deep learning, focusing on new Transformers models in Natural Language Processing (NLP) with news articles from the United States.

Considerations for Using the Data

This dataset is composed of three phases:

Training Phase: For training your NLP model.
Validation Phase: To validate the effectiveness of the training and check for overfitting.
Test Phase: To evaluate the model’s performance and identify mistakes in fine-tuning.

Share link

Anyone who has the link will be able to view this.