Japanese FakeNews Dataset
this is copy of https://github.com/tanreinama/japanese-fakenews-dataset.
This dataset consists of news articles in Japanese and deep fake articles generated by the GPT-2 Japanese model.
This is a mixed corpus, consisting of the original articles are from the Japanese version of Wikinews, which is released under a Creative Commons (modification allowed) license and the data generated by the GPT-2 Japanese model.
All data will be tagged as either original or fake articles and will fall into one of the following categories.
- Original article (written by humans)
- Partially fake (the second half of the article was generated by the GPT-2 model)
- Completely fake (the entire article was generated by the GPT-2 model)
The columns in the CSV file are as follows.
Column name |
Meaning |
id |
unique ID |
context |
text of the article (UTF-8 encoded) |
isfake |
Tag whether the article is fake or not: 0: Original article 1: Partially fake 2: Completely fake |
nchar_real |
Number of characters in the human-authored part of the article. |
ncahr_fake |
Number of characters in the model-generated part of the article. |
Since the Japanese version of Wikinews is published under "Creative Commons Attribution 2.5 Generic (CC BY 2.5)" (however, articles posted before September 24, 2005 are published under "Creative Commons Attribution 2.1 Japan (CC BY 2.1 JP)"), it is possible to revise an article and create a data set that is "original until the middle, and the rest of the article is AI-generated".
The "Partially fake" article contains original articles on Japanese version of Wikinews until halfway through, when it is replaced by an AI-generated article.
This data set was created for the development of an AI to detect fake news.
The GPT-2 model used is the same as the medium model published by the GPT-2 Japanese project, and no special fine tuning was done.
The prefix "新聞によると~", which is characteristic of the Japanese version of Wikinews, has been added independently of the model.