MultiNLI Textual Entailment Corpus by Kaggle | Other

About this Dataset

MultiNLI Textual Entailment Corpus

Evaluating Cross-Genre Generalization Performance

By Huggingface Hub [source]

About this dataset

The MultiNLI corpus is an expansive crowd-sourced collection of 433K sentence pairs specifically developed to research general-purpose textual reasoning. Boasting frequency data across a large range of spoken and written genres, the corpus offers researchers unique insight into how language use differs by genre and has enabled evaluation of textual reasoning through cross-genre generalization tests.
Consisting of columns for premise, premise_binary_parse, premise_parse, hypothesis, hypothesis_binary_parse, hypothesis_parse, genre and label, the MultiNLI corpus offers researchers unprecedented access to natural language inference datasets across a wide variety of sources. Its cross-genre data provides unparalleled potential for discovering linguistic similarities between domains normally considered distinct in purpose or delivery. The diverse collection provides new opportunities to develop systems that are capable of performing textual entailment tasks independently from the original source material they encountered when training. This revolutionary tool will surely become indispensable as deep learning techniques continue to advance in NLP applications!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

First off, the corpus includes sentence pairs and associated labels for textual entailment (entailment, contradiction, neutral), allowing researchers to train models that can accurately infer the relationships between two sentences. It also features data across a range of language genres (spoken and written) in order to study cross-genre generalization performance.

To use this dataset, you'll need to first familiarize yourself with its format: each entry contains five fields - premise, hypothesis, genre, label and binary parse - which can be used for various tasks such as “natural language inference” or “recognizing text entailment” among others. Furthermore, the binary parse fields are represented in context-free grammar format making them easier to understand by computers when implementing algorithms or models related to NLI tasks.

In terms of usage scenarios beyond research purposes; artificial intelligence systems developed using this dataset could potentially be adopted by companies built around NLP (Natural Language Processing). Systems trained on this data would allow businesses interested in efficient customer support/processing such as customer service operators and banking systems respectively; automating certain processes based on natural language inputs from users/clients by utilizing contextual inference techniques such as sentiment analysis or sentence summarization.

Research Ideas

Training textual entailment models to evaluate cross-genre generalization ability.

Exploring the frequency of entailment relationships across different genres of text (spoken vs. written).

Using the label data to build a data-driven natural language processing classifier to accurately classify the relationship between sentence pairs in real world applications from different genres, such as sentiment analysis and question answering systems

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv

Column name	Description
premise	The premise of the sentence pair. (String)
premise_binary_parse	The binary parse of the premise sentence. (String)
premise_parse	The parse of the premise sentence. (String)
hypothesis	The hypothesis of the sentence pair. (String)
hypothesis_binary_parse	The binary parse of the hypothesis sentence. (String)
hypothesis_parse	The parse of the hypothesis sentence. (String)
genre	The genre of the sentence pair. (String)
label	The label of the sentence pair. (String)

File: validation_matched.csv

Column name	Description
premise	The premise of the sentence pair. (String)
premise_binary_parse	The binary parse of the premise sentence. (String)
premise_parse	The parse of the premise sentence. (String)
hypothesis	The hypothesis of the sentence pair. (String)
hypothesis_binary_parse	The binary parse of the hypothesis sentence. (String)
hypothesis_parse	The parse of the hypothesis sentence. (String)
genre	The genre of the sentence pair. (String)
label	The label of the sentence pair. (String)

File: validation_mismatched.csv

Column name	Description
premise	The premise of the sentence pair. (String)
premise_binary_parse	The binary parse of the premise sentence. (String)
premise_parse	The parse of the premise sentence. (String)
hypothesis	The hypothesis of the sentence pair. (String)
hypothesis_binary_parse	The binary parse of the hypothesis sentence. (String)
hypothesis_parse	The parse of the hypothesis sentence. (String)
genre	The genre of the sentence pair. (String)
label	The label of the sentence pair. (String)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.

Tables

Train

@kaggle.thedevastator_multinli_textual_entailment_corpus.train

198.35 MB
392702 rows
10 columns


CREATE TABLE train (
  "promptid" BIGINT,
  "pairid" VARCHAR,
  "premise" VARCHAR,
  "premise_binary_parse" VARCHAR,
  "premise_parse" VARCHAR,
  "hypothesis" VARCHAR,
  "hypothesis_binary_parse" VARCHAR,
  "hypothesis_parse" VARCHAR,
  "genre" VARCHAR,
  "label" BIGINT
);

Validation Matched

@kaggle.thedevastator_multinli_textual_entailment_corpus.validation_matched

3.34 MB
9815 rows
10 columns


CREATE TABLE validation_matched (
  "promptid" BIGINT,
  "pairid" VARCHAR,
  "premise" VARCHAR,
  "premise_binary_parse" VARCHAR,
  "premise_parse" VARCHAR,
  "hypothesis" VARCHAR,
  "hypothesis_binary_parse" VARCHAR,
  "hypothesis_parse" VARCHAR,
  "genre" VARCHAR,
  "label" BIGINT
);

Validation Mismatched

@kaggle.thedevastator_multinli_textual_entailment_corpus.validation_mismatched

3.49 MB
9832 rows
10 columns


CREATE TABLE validation_mismatched (
  "promptid" BIGINT,
  "pairid" VARCHAR,
  "premise" VARCHAR,
  "premise_binary_parse" VARCHAR,
  "premise_parse" VARCHAR,
  "hypothesis" VARCHAR,
  "hypothesis_binary_parse" VARCHAR,
  "hypothesis_parse" VARCHAR,
  "genre" VARCHAR,
  "label" BIGINT
);