Baselight

CoEdIT Text Editing

A curated dataset for training text editing models

@kaggle.thedevastator_coedit_text_editing_dataset

Loading...
Loading...

About this Dataset

CoEdIT Text Editing


CoEdIT Text Editing

A curated dataset for training text editing models

By grammarly (From Huggingface) [source]


About this dataset

The dataset includes two main files: validation.csv and train.csv. These files contain examples of source texts, along with their corresponding edited versions, after specific text editing tasks have been performed.

Each example in the dataset consists of several columns: task, src (the original source text), and tgt (the edited version of the source text). The task column specifies the type of text editing task that was performed on the source text. This categorical information allows researchers and developers to categorize different types of edits made to the texts.

By utilizing this dataset, researchers can train their own models or evaluate existing ones by comparing their edited outputs with the provided target texts. This facilitates a more comprehensive analysis of model performance.

It should be noted that this dataset does not include specific dates or timeframes associated with each entry. Instead, it focuses solely on providing accurately labeled examples for training purposes

How to use the dataset

Dataset Overview

The dataset consists of two main files: train.csv and validation.csv. These files are in CSV format, making it easy to load and process the data using various programming languages such as Python.

train.csv

The train.csv file contains the training data that can be used to train your text editing models. Each row in this file represents an example of a text editing task performed on a source text. The columns present in this file are as follows:

  • task: This column describes the type of text editing task that was performed on the source text. It is a categorical variable that can take various values, indicating different types of edits.
  • src: This column represents the original source text before any editing took place.
  • tgt: This column contains the edited version of the source text after performing the specified task.

validation.csv

The validation.csv file is used for validating your trained models' performance on unseen data during development or evaluation stages. It has a similar structure to train.csv, with columns such as task, src, and tgt containing information about each respective edit example.

Research Ideas

  • Analyzing common text editing patterns: By studying the edited versions (tgt) of the source texts (src), this dataset can be used to analyze common patterns in text editing. Researchers can gain insights into typical changes made during specific text editing tasks, such as proofreading, paraphrasing, or summarizing.
  • Language generation research: This dataset can also be utilized for language generation research by providing a large-scale collection of paired source-target texts. Researchers can train language generation models using this data to generate accurate and contextually appropriate edits for different types of text editing tasks.

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv

Column name Description
task This column represents the type of text editing task performed on the source text. It is a categorical variable that describes what kind of changes were made to the original text. (Categorical)
src This column contains the original source text before any editing was done. It serves as a reference point for understanding how different tasks have modified it. (Text)
tgt This column represents the edited version of the source text after applying a specific task. It provides examples of how different types of edits can transform a given piece of text. (Text)

File: train.csv

Column name Description
task This column represents the type of text editing task performed on the source text. It is a categorical variable that describes what kind of changes were made to the original text. (Categorical)
src This column contains the original source text before any editing was done. It serves as a reference point for understanding how different tasks have modified it. (Text)
tgt This column represents the edited version of the source text after applying a specific task. It provides examples of how different types of edits can transform a given piece of text. (Text)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit grammarly (From Huggingface).

Tables

Train

@kaggle.thedevastator_coedit_text_editing_dataset.train
  • 9.57 MB
  • 69071 rows
  • 4 columns
Loading...

CREATE TABLE train (
  "n__id" BIGINT,
  "task" VARCHAR,
  "src" VARCHAR,
  "tgt" VARCHAR
);

Validation

@kaggle.thedevastator_coedit_text_editing_dataset.validation
  • 391.55 KB
  • 1712 rows
  • 4 columns
Loading...

CREATE TABLE validation (
  "n__id" BIGINT,
  "task" VARCHAR,
  "src" VARCHAR,
  "tgt" VARCHAR
);

Share link

Anyone who has the link will be able to view this.