Japanese FakeNews Dataset
This dataset consists of news articles and deep fake articles in Japanese.
@kaggle.tanreinama_japanese_fakenews_dataset
This dataset consists of news articles and deep fake articles in Japanese.
@kaggle.tanreinama_japanese_fakenews_dataset
this is copy of https://github.com/tanreinama/japanese-fakenews-dataset.
This dataset consists of news articles in Japanese and deep fake articles generated by the GPT-2 Japanese model.
This is a mixed corpus, consisting of the original articles are from the Japanese version of Wikinews, which is released under a Creative Commons (modification allowed) license and the data generated by the GPT-2 Japanese model.
All data will be tagged as either original or fake articles and will fall into one of the following categories.
The columns in the CSV file are as follows.
| Column name | Meaning |
|---|---|
| id | unique ID |
| context | text of the article (UTF-8 encoded) |
| isfake | Tag whether the article is fake or not: 0: Original article 1: Partially fake 2: Completely fake |
| nchar_real | Number of characters in the human-authored part of the article. |
| ncahr_fake | Number of characters in the model-generated part of the article. |
Since the Japanese version of Wikinews is published under "Creative Commons Attribution 2.5 Generic (CC BY 2.5)" (however, articles posted before September 24, 2005 are published under "Creative Commons Attribution 2.1 Japan (CC BY 2.1 JP)"), it is possible to revise an article and create a data set that is "original until the middle, and the rest of the article is AI-generated".
The "Partially fake" article contains original articles on Japanese version of Wikinews until halfway through, when it is replaced by an AI-generated article.
This data set was created for the development of an AI to detect fake news.
The GPT-2 model used is the same as the medium model published by the GPT-2 Japanese project, and no special fine tuning was done.
The prefix "新聞によると~", which is characteristic of the Japanese version of Wikinews, has been added independently of the model.
CREATE TABLE fakenews (
"id" VARCHAR,
"context" VARCHAR,
"isfake" BIGINT,
"nchar_real" BIGINT,
"nchar_fake" BIGINT
);Anyone who has the link will be able to view this.