Name: OpenAI Summarization Corpus
Creator: Kaggle
License: https://creativecommons.org/publicdomain/zero/1.0/

Training and Validation Data from TL;DR, CNN, and Daily Mail

OpenAI Summarization Corpus

Training and Validation Data from TL;DR, CNN, and Daily Mail

By Huggingface Hub [source]

About this dataset

This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.

To use this dataset for summarization tasks:

Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation).

Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization.

Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry..

Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content

Research Ideas

Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset.

Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance.

Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: comparisons_validation.csv

Column name	Description
info	Text to be summarized. (String)
summaries	Summaries generated by workers. (String)
choice	The chosen summary. (String)
batch	Batch for which it was created. (Integer)
split	Split of the dataset between training and validation sets. (String)
extra	Additional information about the given source material available. (String)

File: comparisons_train.csv

Column name	Description
info	Text to be summarized. (String)
summaries	Summaries generated by workers. (String)
choice	The chosen summary. (String)
batch	Batch for which it was created. (Integer)
split	Split of the dataset between training and validation sets. (String)
extra	Additional information about the given source material available. (String)

File: axis_validation.csv

Column name	Description
info	Text to be summarized. (String)
summaries	Summaries generated by workers. (String)
batch	Batch for which it was created. (Integer)
split	Split of the dataset between training and validation sets. (String)
worker	Workers who participated in generating the summary. (String)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.

Related Datasets

Question-Answering Training And Testing Data

@kaggle
AI Performance On Language Tasks

@owid
Global Forest Resources Assessment

@owid
Nuclear Weapons Proliferation

@owid
Historical Series Of Phenological Data For Cherry Tree Flowering At Kyoto City (and March Mean Temperature Reconstructions)

@owid
SFC2014 - REACT EU Overview Allocation Vs Decided

@esifunds

Question-Answering Training And Testing Data

AI Performance On Language Tasks

Global Forest Resources Assessment

Nuclear Weapons Proliferation

Historical Series Of Phenological Data For Cherry Tree Flowering At Kyoto City (and March Mean Temperature Reconstructions)

SFC2014 - REACT EU Overview Allocation Vs Decided