Context
The issue of “fake news” has arisen recently as a potential threat to high-quality journalism
and well-informed public discourse. The Fake News Challenge was organized in early
2017 to encourage development of machine learning-based classification systems that
perform “stance detection” -- i.e. identifying whether a particular news headline “agrees”
with, “disagrees” with, “discusses,” or is unrelated to a particular news article -- in order to
allow journalists and others to more easily find and investigate possible instances of “fake
news.”
Content
The data provided is (headline, body, stance)
instances, where stance
is one of {unrelated, discuss, agree, disagree}
. The dataset is provided as two CSVs:
train_bodies.csv
This file contains the body text of articles (the articleBody
column) with corresponding IDs (Body ID
)
train_stances.csv
This file contains the labeled stances (the Stance
column) for pairs of article headlines (Headline
) and article bodies (Body ID
, referring to entries in train_bodies.csv
).
Distribution of the data
The distribution of Stance
classes in train_stances.csv
is as follows:
rows |
unrelated |
discuss |
agree |
disagree |
49972 |
0.73131 |
0.17828 |
0.0736012 |
0.0168094 |
There are 4 possible classifications:
- The article text agrees with the headline.
- The article text disagrees with the headline.
- The article text is a discussion of the headline, without taking a position on it.
- The article text is unrelated to the headline (i.e. it doesn’t address the same topic).
Acknowledgements
For details of the task, see FakeNewsChallenge.org