Name: CoEdIT Text Editing
Creator: Kaggle
Published: 2025-02-13T08:24:59.071Z
License: https://creativecommons.org/publicdomain/zero/1.0/

A curated dataset for training text editing models

CoEdIT Text Editing

A curated dataset for training text editing models

By grammarly (From Huggingface) [source]

About this dataset

The dataset includes two main files: validation.csv and train.csv. These files contain examples of source texts, along with their corresponding edited versions, after specific text editing tasks have been performed.

Each example in the dataset consists of several columns: task, src (the original source text), and tgt (the edited version of the source text). The task column specifies the type of text editing task that was performed on the source text. This categorical information allows researchers and developers to categorize different types of edits made to the texts.

By utilizing this dataset, researchers can train their own models or evaluate existing ones by comparing their edited outputs with the provided target texts. This facilitates a more comprehensive analysis of model performance.

It should be noted that this dataset does not include specific dates or timeframes associated with each entry. Instead, it focuses solely on providing accurately labeled examples for training purposes

How to use the dataset

Dataset Overview

The dataset consists of two main files: train.csv and validation.csv. These files are in CSV format, making it easy to load and process the data using various programming languages such as Python.

train.csv

The train.csv file contains the training data that can be used to train your text editing models. Each row in this file represents an example of a text editing task performed on a source text. The columns present in this file are as follows:

task: This column describes the type of text editing task that was performed on the source text. It is a categorical variable that can take various values, indicating different types of edits.

src: This column represents the original source text before any editing took place.

tgt: This column contains the edited version of the source text after performing the specified task.

validation.csv

The validation.csv file is used for validating your trained models' performance on unseen data during development or evaluation stages. It has a similar structure to train.csv, with columns such as task, src, and tgt containing information about each respective edit example.

Research Ideas

Analyzing common text editing patterns: By studying the edited versions (tgt) of the source texts (src), this dataset can be used to analyze common patterns in text editing. Researchers can gain insights into typical changes made during specific text editing tasks, such as proofreading, paraphrasing, or summarizing.

Language generation research: This dataset can also be utilized for language generation research by providing a large-scale collection of paired source-target texts. Researchers can train language generation models using this data to generate accurate and contextually appropriate edits for different types of text editing tasks.

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv

Column name	Description
task	This column represents the type of text editing task performed on the source text. It is a categorical variable that describes what kind of changes were made to the original text. (Categorical)
src	This column contains the original source text before any editing was done. It serves as a reference point for understanding how different tasks have modified it. (Text)
tgt	This column represents the edited version of the source text after applying a specific task. It provides examples of how different types of edits can transform a given piece of text. (Text)

File: train.csv

Column name	Description
task	This column represents the type of text editing task performed on the source text. It is a categorical variable that describes what kind of changes were made to the original text. (Categorical)
src	This column contains the original source text before any editing was done. It serves as a reference point for understanding how different tasks have modified it. (Text)
tgt	This column represents the edited version of the source text after applying a specific task. It provides examples of how different types of edits can transform a given piece of text. (Text)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit grammarly (From Huggingface).

Related Datasets

CoEdIT

@kaggle
Dummy Monster

@owid
AI Performance On Language Tasks

@owid
Nuclear Weapons Proliferation

@owid
Ethnic Power Relations Dataset (ETH, 2021)

@owid
AI Performance On Math Problems

@owid

CoEdIT

Dummy Monster

AI Performance On Language Tasks

Nuclear Weapons Proliferation

Ethnic Power Relations Dataset (ETH, 2021)

AI Performance On Math Problems