SciTail (Multiple-choice science exams)
27,026 Multiple-choice science exams and web sentences
By Huggingface Hub [source]
About this dataset
The Scitail dataset is your gateway to unlocking powerful and advanced Sci-Fi Natural Language Inference (NLI) algorithms. With data sourced from popular books, movies, and TV shows in the genre, this dataset gives you the opportunity to develop and train NLI algorithms capable of understanding complex sci-fi conversations. Containing seven distinct formats including training sets for both predictor format and datagem format as well as testing sets in tsv format and SNLI format - all containing the same fields but in varied structures - this is an essential resource for any scientist looking to explore the realm of sci-fi NLI! Train your algorithm today with Scitail; unlock a future of supercharged Sci-Fi language processing!
More Datasets
For more datasets, click here.
Featured Notebooks
- 🚨 Your notebook can be here! 🚨!
How to use the dataset
This guide will explain how to use the Scitail dataset for Natural Language Inference (NLI). NLI is a machine learning task which involves making predictions about a statement’s labels, such as entailment, contradiction, or neutral. The Scitail dataset contains sci-fi samples sourced from various sources such as books, movies and TV shows that can be used to train and evaluate NLI algorithms.
The Scitail dataset is split into seven different formats: Dataset Gem format for testing and training, Predictor format for validation and training, .TSV format for testing and validation. Each of these formats contain the same data fields in different forms; including premise, hypothesis, label (entailment/contradiction/neutral), label assigned by annotators etc.
To get started using this dataset we recommend downloading the datasets in whichever format you prefer from Kaggle. All files are stored as csv’s with each row representing a single data point in the form of premise-hypothesis pairs with labels assigned by annotators which indicate whether two statements entail one another or not.
Once you have downloaded your preferred datasets it’s time to prepare them for training or evaluation purposes; this includes formatting them correctly so they can be used properly by algorithms. To do so we suggest splitting your chosen file(s) into separate sets — training/validation — such that you have selected samples that are sufficiently representative of real-world language samples that demonstrate positive entailing relations as well examples where no entailing relation exists between two statements or uncertainty exists due to lack of evidence provided within a pair’s context i.e., neutral relation between two statements if ambiguity regarding outcome exists based on premises provided within those statements is present
Research Ideas
- Develop and fine-tune NLI algorithms with different levels of Sci-Fi language complexity.
- Use the annotator labels to develop an automated human-in-the-loop approach to NLI algorithms.
- Incorporate the hypothesis graph structure into existing models to improve accuracy and reduce error rates in identifying contextual comparisons between premises and hypotheses in Sci-Fi texts
Acknowledgements
If you use this dataset in your research, please credit the original authors.
Data Source
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
Columns
File: dgem_format_test.csv
Column name |
Description |
premise |
The premise of the statement (String). |
hypothesis |
The hypothesis of the statement (String). |
label |
The label of the statement – either entailment, neutral or contradiction (String). |
hypothesis_graph_structure |
A graph structure of the hypothesis (Graph) |
File: predictor_format_validation.csv
Column name |
Description |
answer |
The answer to the question. (String) |
sentence2_structure |
A graph structure of the second sentence. (Graph) |
sentence1 |
The first sentence of the statement. (String) |
gold_label |
The label of the statement – either entailment, neutral or contradiction. (String) |
File: tsv_format_test.csv
Column name |
Description |
premise |
The premise of the statement (String). |
hypothesis |
The hypothesis of the statement (String). |
label |
The label of the statement – either entailment, neutral or contradiction (String). |
File: snli_format_validation.csv
Column name |
Description |
sentence1 |
The first sentence of the statement. (String) |
sentence2_structure |
A graph structure of the second sentence. (Graph) |
gold_label |
The label of the statement – either entailment, neutral or contradiction. (String) |
sentence1_binary_parse |
Binary parse of first sentence. (String) |
sentence1_parse |
Parse of first sentence. (String) |
sentence2_parse |
Parse of second sentence. (String) |
annotator_labels |
Labels assigned by annotators. (String) |
File: dgem_format_train.csv
Column name |
Description |
premise |
The premise of the statement (String). |
hypothesis |
The hypothesis of the statement (String). |
label |
The label of the statement – either entailment, neutral or contradiction (String). |
hypothesis_graph_structure |
A graph structure of the hypothesis (Graph) |
File: snli_format_train.csv
Column name |
Description |
sentence1_binary_parse |
Binary parse of first sentence. (String) |
sentence1_parse |
Parse of first sentence. (String) |
sentence1 |
The first sentence of the statement. (String) |
sentence2_parse |
Parse of second sentence. (String) |
sentence2_structure |
A graph structure of the second sentence. (Graph) |
annotator_labels |
Labels assigned by annotators. (String) |
gold_label |
The label of the statement – either entailment, neutral or contradiction. (String) |
File: predictor_format_train.csv
Column name |
Description |
answer |
The answer to the question. (String) |
sentence2_structure |
A graph structure of the second sentence. (Graph) |
sentence1 |
The first sentence of the statement. (String) |
gold_label |
The label of the statement – either entailment, neutral or contradiction. (String) |
File: snli_format_test.csv
Column name |
Description |
sentence1_binary_parse |
Binary parse of first sentence. (String) |
sentence1_parse |
Parse of first sentence. (String) |
sentence1 |
The first sentence of the statement. (String) |
sentence2_parse |
Parse of second sentence. (String) |
sentence2_structure |
A graph structure of the second sentence. (Graph) |
annotator_labels |
Labels assigned by annotators. (String) |
gold_label |
The label of the statement – either entailment, neutral or contradiction. (String) |
File: tsv_format_validation.csv
Column name |
Description |
premise |
The premise of the statement (String). |
hypothesis |
The hypothesis of the statement (String). |
label |
The label of the statement – either entailment, neutral or contradiction (String). |
Acknowledgements
If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.