Name: WikiSplit
Creator: Kaggle
License: https://creativecommons.org/publicdomain/zero/1.0/

About this Dataset

WikiSplit

One million English sentences, each split into two sentences preserve meaning

By Huggingface Hub [source]

About this dataset

The Splitting the Wiki dataset unlocks complex relationships between text, with over one million English sentences split into two smaller sentences that perfectly preserve the original meaning. All sentences are sourced from the publicly available Wikipedia revision history and can be used to uncover nuanced connections in content. This dataset contains three columns: complex_sentence, simple_sentence_1, and simple_sentence_2, each containing information about how to split a sentence into its constituent parts to reveal greater potential for analysis. With this data it's possible to gain insights and identify patterns hidden within language that could otherwise not be seen – unlocking a whole new world of understanding for natural language processing applications

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In this dataset, complex_sentence is the original sentence from Wikipedia, while simple_sentence_1 and simple_sentence_2 are its respective split sentences. By exploring the relationships between these new splits and the original sentence, users can uncover intricate patterns within text that were previously unidentified.

Using This Dataset
This dataset can be used in a variety of ways to study complex text relations associated with each individual sentence. It is particularly useful for studying natural language processing techniques such as sentiment analysis, topic modeling, and question-answering systems. Some potential use cases include:

Developing new algorithms to better understand how these split phrases relate to each other in terms of sentiment or topology;

Extracting keywords from both sentences using information retrieval algorithms as a means of analyzing their relationship;

Building a classification system designed to classify sentence pairs according to different qualities;

Creating an automated system capable of merging incompatible fragments back together in order to regain the original information contained within it.

Research Ideas

Training natural language processing (NLP) models to recognize language patterns and relationships between the two split sentences, allowing for more sophisticated text understanding.

Developing machine learning models which are able to transfer the meaning of input sentences with greater accuracy than previous methods.

Building summarization models that can generate condensed summaries from complicated sentence structures, ensuring that important information is retained in summaries while unimportant words are removed or replaced with better phrasings

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv

Column name	Description
complex_sentence	The original sentence before it was split into two parts. (String)
simple_sentence_1	The first part of the sentence after it was split. (String)
simple_sentence_2	The second part of the sentence after it was split. (String)

File: train.csv

Column name	Description
complex_sentence	The original sentence before it was split into two parts. (String)
simple_sentence_1	The first part of the sentence after it was split. (String)
simple_sentence_2	The second part of the sentence after it was split. (String)

File: test.csv

Column name	Description
complex_sentence	The original sentence before it was split into two parts. (String)
simple_sentence_1	The first part of the sentence after it was split. (String)
simple_sentence_2	The second part of the sentence after it was split. (String)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.

Tables

Test

@kaggle.thedevastator_unveiling_complex_text_relations_through_splitti.test

1.35 MB
5,000 rows
3 columns

CREATE TABLE test (
  "complex_sentence" VARCHAR,
  "simple_sentence_1" VARCHAR,
  "simple_sentence_2" VARCHAR
);

Train

@kaggle.thedevastator_unveiling_complex_text_relations_through_splitti.train

242.88 MB
989,944 rows
3 columns

CREATE TABLE train (
  "complex_sentence" VARCHAR,
  "simple_sentence_1" VARCHAR,
  "simple_sentence_2" VARCHAR
);

Validation

@kaggle.thedevastator_unveiling_complex_text_relations_through_splitti.validation

1.34 MB
5,000 rows
3 columns

CREATE TABLE validation (
  "complex_sentence" VARCHAR,
  "simple_sentence_1" VARCHAR,
  "simple_sentence_2" VARCHAR
);