OpenAI Summarization Corpus
Training and Validation Data from TL;DR, CNN, and Daily Mail
By Huggingface Hub [source]
About this dataset
This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results
More Datasets
For more datasets, click here.
Featured Notebooks
- 🚨 Your notebook can be here! 🚨!
How to use the dataset
This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.
To use this dataset for summarization tasks:
- Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation).
- Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization.
- Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry..
- Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content
Research Ideas
- Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset.
- Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance.
- Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy
Acknowledgements
If you use this dataset in your research, please credit the original authors.
Data Source
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
Columns
File: comparisons_validation.csv
Column name |
Description |
info |
Text to be summarized. (String) |
summaries |
Summaries generated by workers. (String) |
choice |
The chosen summary. (String) |
batch |
Batch for which it was created. (Integer) |
split |
Split of the dataset between training and validation sets. (String) |
extra |
Additional information about the given source material available. (String) |
File: comparisons_train.csv
Column name |
Description |
info |
Text to be summarized. (String) |
summaries |
Summaries generated by workers. (String) |
choice |
The chosen summary. (String) |
batch |
Batch for which it was created. (Integer) |
split |
Split of the dataset between training and validation sets. (String) |
extra |
Additional information about the given source material available. (String) |
File: axis_validation.csv
Column name |
Description |
info |
Text to be summarized. (String) |
summaries |
Summaries generated by workers. (String) |
batch |
Batch for which it was created. (Integer) |
split |
Split of the dataset between training and validation sets. (String) |
worker |
Workers who participated in generating the summary. (String) |
Acknowledgements
If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.