Baselight

DAIGT Proper Train Dataset

A dataset you can actually train on for the LLM Detect AI Generated Text comp.

@kaggle.thedrcat_daigt_proper_train_dataset

Loading...
Loading...

About this Dataset

DAIGT Proper Train Dataset

Version 2 updated on 11/2/2023:

Since there is no proper train dataset for LLM - Detect AI Generated Text competition, I decided to create one.

Ingredients (please upvote the included datasets!):

New version includes:

  • EssayID if available
  • Generation prompt if available
  • Random 10 fold split stratified by source dataset

Version 3 updated on 11/3/2023:

  • Additional 2400+ AI examples generated with Mistral 7B instruct and a new prompt (let's see how it works!)

Version 4 updated on 11/5/2023:

Tables

Train Drcat 01

@kaggle.thedrcat_daigt_proper_train_dataset.train_drcat_01
  • 37.41 MB
  • 33259 rows
  • 4 columns
Loading...

CREATE TABLE train_drcat_01 (
  "text" VARCHAR,
  "label" BIGINT,
  "source" VARCHAR,
  "fold" BIGINT
);

Train Drcat 02

@kaggle.thedrcat_daigt_proper_train_dataset.train_drcat_02
  • 46.44 MB
  • 39785 rows
  • 6 columns
Loading...

CREATE TABLE train_drcat_02 (
  "essay_id" VARCHAR,
  "text" VARCHAR,
  "label" BIGINT,
  "source" VARCHAR,
  "prompt" VARCHAR,
  "fold" BIGINT
);

Train Drcat 03

@kaggle.thedrcat_daigt_proper_train_dataset.train_drcat_03
  • 49.31 MB
  • 42206 rows
  • 6 columns
Loading...

CREATE TABLE train_drcat_03 (
  "essay_id" VARCHAR,
  "text" VARCHAR,
  "label" BIGINT,
  "source" VARCHAR,
  "prompt" VARCHAR,
  "fold" BIGINT
);

Train Drcat 04

@kaggle.thedrcat_daigt_proper_train_dataset.train_drcat_04
  • 51.73 MB
  • 44206 rows
  • 6 columns
Loading...

CREATE TABLE train_drcat_04 (
  "essay_id" VARCHAR,
  "text" VARCHAR,
  "label" BIGINT,
  "source" VARCHAR,
  "prompt" VARCHAR,
  "fold" BIGINT
);

Share link

Anyone who has the link will be able to view this.