WikiSplit
One million English sentences, each split into two sentences preserve meaning
By Huggingface Hub [source]
About this dataset
The Splitting the Wiki dataset unlocks complex relationships between text, with over one million English sentences split into two smaller sentences that perfectly preserve the original meaning. All sentences are sourced from the publicly available Wikipedia revision history and can be used to uncover nuanced connections in content. This dataset contains three columns: complex_sentence, simple_sentence_1, and simple_sentence_2, each containing information about how to split a sentence into its constituent parts to reveal greater potential for analysis. With this data it's possible to gain insights and identify patterns hidden within language that could otherwise not be seen – unlocking a whole new world of understanding for natural language processing applications
More Datasets
For more datasets, click here.
Featured Notebooks
- 🚨 Your notebook can be here! 🚨!
How to use the dataset
In this dataset, complex_sentence is the original sentence from Wikipedia, while simple_sentence_1 and simple_sentence_2 are its respective split sentences. By exploring the relationships between these new splits and the original sentence, users can uncover intricate patterns within text that were previously unidentified.
Using This Dataset
This dataset can be used in a variety of ways to study complex text relations associated with each individual sentence. It is particularly useful for studying natural language processing techniques such as sentiment analysis, topic modeling, and question-answering systems. Some potential use cases include:
- Developing new algorithms to better understand how these split phrases relate to each other in terms of sentiment or topology;
- Extracting keywords from both sentences using information retrieval algorithms as a means of analyzing their relationship;
- Building a classification system designed to classify sentence pairs according to different qualities;
- Creating an automated system capable of merging incompatible fragments back together in order to regain the original information contained within it.
Research Ideas
- Training natural language processing (NLP) models to recognize language patterns and relationships between the two split sentences, allowing for more sophisticated text understanding.
- Developing machine learning models which are able to transfer the meaning of input sentences with greater accuracy than previous methods.
- Building summarization models that can generate condensed summaries from complicated sentence structures, ensuring that important information is retained in summaries while unimportant words are removed or replaced with better phrasings
Acknowledgements
If you use this dataset in your research, please credit the original authors.
Data Source
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
Columns
File: validation.csv
Column name |
Description |
complex_sentence |
The original sentence before it was split into two parts. (String) |
simple_sentence_1 |
The first part of the sentence after it was split. (String) |
simple_sentence_2 |
The second part of the sentence after it was split. (String) |
File: train.csv
Column name |
Description |
complex_sentence |
The original sentence before it was split into two parts. (String) |
simple_sentence_1 |
The first part of the sentence after it was split. (String) |
simple_sentence_2 |
The second part of the sentence after it was split. (String) |
File: test.csv
Column name |
Description |
complex_sentence |
The original sentence before it was split into two parts. (String) |
simple_sentence_1 |
The first part of the sentence after it was split. (String) |
simple_sentence_2 |
The second part of the sentence after it was split. (String) |
Acknowledgements
If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.