Baselight

OpenAI Summarization Corpus

Training and Validation Data from TL;DR, CNN, and Daily Mail

@kaggle.thedevastator_openai_summarization_corpus

Loading...
Loading...

About this Dataset

OpenAI Summarization Corpus


OpenAI Summarization Corpus

Training and Validation Data from TL;DR, CNN, and Daily Mail

By Huggingface Hub [source]


About this dataset

This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.

To use this dataset for summarization tasks:

  • Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation).
  • Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization.
  • Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry..
  • Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content

Research Ideas

  • Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset.
  • Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance.
  • Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: comparisons_validation.csv

Column name Description
info Text to be summarized. (String)
summaries Summaries generated by workers. (String)
choice The chosen summary. (String)
batch Batch for which it was created. (Integer)
split Split of the dataset between training and validation sets. (String)
extra Additional information about the given source material available. (String)

File: comparisons_train.csv

Column name Description
info Text to be summarized. (String)
summaries Summaries generated by workers. (String)
choice The chosen summary. (String)
batch Batch for which it was created. (Integer)
split Split of the dataset between training and validation sets. (String)
extra Additional information about the given source material available. (String)

File: axis_validation.csv

Column name Description
info Text to be summarized. (String)
summaries Summaries generated by workers. (String)
batch Batch for which it was created. (Integer)
split Split of the dataset between training and validation sets. (String)
worker Workers who participated in generating the summary. (String)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.

Tables

Axis Test

@kaggle.thedevastator_openai_summarization_corpus.axis_test
  • 3.47 MB
  • 6312 rows
  • 5 columns
Loading...

CREATE TABLE axis_test (
  "info" VARCHAR,
  "summary" VARCHAR,
  "worker" VARCHAR,
  "batch" VARCHAR,
  "split" VARCHAR
);

Axis Validation

@kaggle.thedevastator_openai_summarization_corpus.axis_validation
  • 2.26 MB
  • 8585 rows
  • 5 columns
Loading...

CREATE TABLE axis_validation (
  "info" VARCHAR,
  "summary" VARCHAR,
  "worker" VARCHAR,
  "batch" VARCHAR,
  "split" VARCHAR
);

Comparisons Train

@kaggle.thedevastator_openai_summarization_corpus.comparisons_train
  • 29.23 MB
  • 92858 rows
  • 7 columns
Loading...

CREATE TABLE comparisons_train (
  "info" VARCHAR,
  "summaries" VARCHAR,
  "choice" BIGINT,
  "worker" VARCHAR,
  "batch" VARCHAR,
  "split" VARCHAR,
  "extra" VARCHAR
);

Comparisons Validation

@kaggle.thedevastator_openai_summarization_corpus.comparisons_validation
  • 30.78 MB
  • 86086 rows
  • 7 columns
Loading...

CREATE TABLE comparisons_validation (
  "info" VARCHAR,
  "summaries" VARCHAR,
  "choice" BIGINT,
  "worker" VARCHAR,
  "batch" VARCHAR,
  "split" VARCHAR,
  "extra" VARCHAR
);

Share link

Anyone who has the link will be able to view this.