Baselight

Wikipedia Biographies Text Generation Dataset

Wikipedia Biographies: Infobox and First Paragraphs Texts

@kaggle.thedevastator_wikipedia_biographies_text_generation_dataset

Loading...
Loading...

About this Dataset

Wikipedia Biographies Text Generation Dataset


Wikipedia Biographies Text Generation Dataset

Wikipedia Biographies: Infobox and First Paragraphs Texts

By wiki_bio (From Huggingface) [source]


About this dataset

The dataset contains several key columns: input_text and target_text. The input_text column includes the infobox and first paragraph of a Wikipedia biography, providing essential information about the individual's background, accomplishments, and notable features. The target_text column consists of the complete biography text extracted from the corresponding Wikipedia page.

In order to facilitate model training and validation, the dataset is divided into three main files: train.csv, val.csv, and test.csv. The train.csv file contains pairs of input text and target text for model training. It serves as a fundamental resource to develop accurate language generation models by providing abundant examples for learning to generate coherent biographical texts.

The val.csv file provides further validation data consisting of additional Wikipedia biographies with their corresponding infoboxes and first paragraphs. This subset allows researchers to evaluate their trained models' performance on unseen examples during development or fine-tuning stages.

Finally, the test.csv file offers a separate set of input texts paired with corresponding target texts for generating complete biographies using pre-trained models or newly developed algorithms. The purpose of this file is to benchmark system performance on unseen data in order to assess generalization capabilities.

This extended description aims to provide an informative overview of the dataset structure, its intended use cases in natural language processing research tasks such as text generation or summarization. Researchers can leverage this comprehensive collection to advance various applications in automatic biography writing systems or content generation tasks that require coherent textual output based on provided partial information extracted from an infobox or initial paragraph sources from online encyclopedias like Wikipedia

How to use the dataset

  • Overview:

    • This dataset consists of biographical information from Wikipedia pages, specifically the infobox and the first paragraph of each biography.
    • The dataset is provided in three separate files: train.csv, val.csv, and test.csv.
    • Each file contains pairs of input text and target text.
  • File Descriptions:

    • train.csv: This file is used for training purposes. It includes pairs of input text (infobox and first paragraph) and target text (complete biography).
    • val.csv: Validation purposes can be fulfilled using this file. It contains a collection of biographies with infobox and first paragraph texts.
    • test.csv: This file can be used to generate complete biographies based on the given input texts.
  • Column Information:

    a) For train.csv:

    • input_text: Input text column containing the infobox and first paragraph of a Wikipedia biography.
    • target_text: Target text column containing the complete biography text for each entry.

    b) For val.csv:

    • input_text: Infobox and first paragraph texts are included in this column.
    • target_text: Complete biography texts are present in this column.

    c) For test.csv:
    The columns follow the pattern mentioned previously, i.e.,input_text followed by target_text.

  • Usage Guidelines:

  • Training Model or Algorithm Development:
    If you are working on training a model or developing an algorithm for generating complete biographies from given inputs, it is recommended to use train.csv as your primary dataset.

  • Model Validation or Evaluation:
    To validate or evaluate your trained model, you can use val.csv as an independent dataset. This dataset contains biographies that have been withheld from the training data.

  • Generating Biographies with Trained Models:
    To generate complete biographies using your trained model, you can make use of test.csv. This dataset provides input texts for which you need to generate the corresponding target texts.

  • Additional Information and Tips:

  • The input text in this dataset includes both an infobox (a structured section containing key-value pairs) and the first paragraph of a Wikipedia biography.

  • The target text is the complete biography for each entry.

  • While working with this dataset, make sure to preprocess and

Research Ideas

  • Text Generation: The dataset can be used to train language models to generate complete Wikipedia biographies given only the infobox and first paragraph as input. This can be useful for automating the generation of biographies or expanding existing ones.
  • Information Extraction: The dataset can also be used for information extraction tasks, where models can be trained to extract specific details or facts from the infobox and first paragraph of a biography, such as birth date, occupation, or notable achievements.
  • Language Understanding: By using this dataset, models can be trained to understand and comprehend biographical text better. This understanding can further extend to other related tasks like question answering, summarization, or sentiment analysis for Wikipedia biographies

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv

Column name Description
input_text This column contains the input text, which includes the infobox and first paragraph of a Wikipedia biography. (Text)
target_text This column contains the target text, which is the complete biography text of a Wikipedia page. (Text)

File: val.csv

Column name Description
input_text This column contains the input text, which includes the infobox and first paragraph of a Wikipedia biography. (Text)
target_text This column contains the target text, which is the complete biography text of a Wikipedia page. (Text)

File: test.csv

Column name Description
input_text This column contains the input text, which includes the infobox and first paragraph of a Wikipedia biography. (Text)
target_text This column contains the target text, which is the complete biography text of a Wikipedia page. (Text)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit wiki_bio (From Huggingface).

Tables

Test

@kaggle.thedevastator_wikipedia_biographies_text_generation_dataset.test
  • 41.91 MB
  • 72831 rows
  • 2 columns
Loading...

CREATE TABLE test (
  "input_text" VARCHAR,
  "target_text" VARCHAR
);

Train

@kaggle.thedevastator_wikipedia_biographies_text_generation_dataset.train
  • 335.9 MB
  • 582659 rows
  • 2 columns
Loading...

CREATE TABLE train (
  "input_text" VARCHAR,
  "target_text" VARCHAR
);

Val

@kaggle.thedevastator_wikipedia_biographies_text_generation_dataset.val
  • 41.98 MB
  • 72831 rows
  • 2 columns
Loading...

CREATE TABLE val (
  "input_text" VARCHAR,
  "target_text" VARCHAR
);

Share link

Anyone who has the link will be able to view this.