MultiNLI Textual Entailment Corpus
Evaluating Cross-Genre Generalization Performance
By Huggingface Hub [source]
About this dataset
The MultiNLI corpus is an expansive crowd-sourced collection of 433K sentence pairs specifically developed to research general-purpose textual reasoning. Boasting frequency data across a large range of spoken and written genres, the corpus offers researchers unique insight into how language use differs by genre and has enabled evaluation of textual reasoning through cross-genre generalization tests.
Consisting of columns for premise, premise_binary_parse, premise_parse, hypothesis, hypothesis_binary_parse, hypothesis_parse, genre and label, the MultiNLI corpus offers researchers unprecedented access to natural language inference datasets across a wide variety of sources. Its cross-genre data provides unparalleled potential for discovering linguistic similarities between domains normally considered distinct in purpose or delivery. The diverse collection provides new opportunities to develop systems that are capable of performing textual entailment tasks independently from the original source material they encountered when training. This revolutionary tool will surely become indispensable as deep learning techniques continue to advance in NLP applications!
More Datasets
For more datasets, click here.
Featured Notebooks
- 🚨 Your notebook can be here! 🚨!
How to use the dataset
First off, the corpus includes sentence pairs and associated labels for textual entailment (entailment, contradiction, neutral), allowing researchers to train models that can accurately infer the relationships between two sentences. It also features data across a range of language genres (spoken and written) in order to study cross-genre generalization performance.
To use this dataset, you'll need to first familiarize yourself with its format: each entry contains five fields - premise, hypothesis, genre, label and binary parse - which can be used for various tasks such as “natural language inference” or “recognizing text entailment” among others. Furthermore, the binary parse fields are represented in context-free grammar format making them easier to understand by computers when implementing algorithms or models related to NLI tasks.
In terms of usage scenarios beyond research purposes; artificial intelligence systems developed using this dataset could potentially be adopted by companies built around NLP (Natural Language Processing). Systems trained on this data would allow businesses interested in efficient customer support/processing such as customer service operators and banking systems respectively; automating certain processes based on natural language inputs from users/clients by utilizing contextual inference techniques such as sentiment analysis or sentence summarization.
Research Ideas
- Training textual entailment models to evaluate cross-genre generalization ability.
- Exploring the frequency of entailment relationships across different genres of text (spoken vs. written).
- Using the label data to build a data-driven natural language processing classifier to accurately classify the relationship between sentence pairs in real world applications from different genres, such as sentiment analysis and question answering systems
Acknowledgements
If you use this dataset in your research, please credit the original authors.
Data Source
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
Columns
File: train.csv
Column name |
Description |
premise |
The premise of the sentence pair. (String) |
premise_binary_parse |
The binary parse of the premise sentence. (String) |
premise_parse |
The parse of the premise sentence. (String) |
hypothesis |
The hypothesis of the sentence pair. (String) |
hypothesis_binary_parse |
The binary parse of the hypothesis sentence. (String) |
hypothesis_parse |
The parse of the hypothesis sentence. (String) |
genre |
The genre of the sentence pair. (String) |
label |
The label of the sentence pair. (String) |
File: validation_matched.csv
Column name |
Description |
premise |
The premise of the sentence pair. (String) |
premise_binary_parse |
The binary parse of the premise sentence. (String) |
premise_parse |
The parse of the premise sentence. (String) |
hypothesis |
The hypothesis of the sentence pair. (String) |
hypothesis_binary_parse |
The binary parse of the hypothesis sentence. (String) |
hypothesis_parse |
The parse of the hypothesis sentence. (String) |
genre |
The genre of the sentence pair. (String) |
label |
The label of the sentence pair. (String) |
File: validation_mismatched.csv
Column name |
Description |
premise |
The premise of the sentence pair. (String) |
premise_binary_parse |
The binary parse of the premise sentence. (String) |
premise_parse |
The parse of the premise sentence. (String) |
hypothesis |
The hypothesis of the sentence pair. (String) |
hypothesis_binary_parse |
The binary parse of the hypothesis sentence. (String) |
hypothesis_parse |
The parse of the hypothesis sentence. (String) |
genre |
The genre of the sentence pair. (String) |
label |
The label of the sentence pair. (String) |
Acknowledgements
If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.