SciTail (Multiple-choice Science Exams) by Kaggle | Other

About this Dataset

SciTail (Multiple-choice Science Exams)

SciTail (Multiple-choice science exams)

27,026 Multiple-choice science exams and web sentences

By Huggingface Hub [source]

About this dataset

The Scitail dataset is your gateway to unlocking powerful and advanced Sci-Fi Natural Language Inference (NLI) algorithms. With data sourced from popular books, movies, and TV shows in the genre, this dataset gives you the opportunity to develop and train NLI algorithms capable of understanding complex sci-fi conversations. Containing seven distinct formats including training sets for both predictor format and datagem format as well as testing sets in tsv format and SNLI format - all containing the same fields but in varied structures - this is an essential resource for any scientist looking to explore the realm of sci-fi NLI! Train your algorithm today with Scitail; unlock a future of supercharged Sci-Fi language processing!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This guide will explain how to use the Scitail dataset for Natural Language Inference (NLI). NLI is a machine learning task which involves making predictions about a statement’s labels, such as entailment, contradiction, or neutral. The Scitail dataset contains sci-fi samples sourced from various sources such as books, movies and TV shows that can be used to train and evaluate NLI algorithms.

The Scitail dataset is split into seven different formats: Dataset Gem format for testing and training, Predictor format for validation and training, .TSV format for testing and validation. Each of these formats contain the same data fields in different forms; including premise, hypothesis, label (entailment/contradiction/neutral), label assigned by annotators etc.

To get started using this dataset we recommend downloading the datasets in whichever format you prefer from Kaggle. All files are stored as csv’s with each row representing a single data point in the form of premise-hypothesis pairs with labels assigned by annotators which indicate whether two statements entail one another or not.

Once you have downloaded your preferred datasets it’s time to prepare them for training or evaluation purposes; this includes formatting them correctly so they can be used properly by algorithms. To do so we suggest splitting your chosen file(s) into separate sets — training/validation — such that you have selected samples that are sufficiently representative of real-world language samples that demonstrate positive entailing relations as well examples where no entailing relation exists between two statements or uncertainty exists due to lack of evidence provided within a pair’s context i.e., neutral relation between two statements if ambiguity regarding outcome exists based on premises provided within those statements is present

Research Ideas

Develop and fine-tune NLI algorithms with different levels of Sci-Fi language complexity.

Use the annotator labels to develop an automated human-in-the-loop approach to NLI algorithms.

Incorporate the hypothesis graph structure into existing models to improve accuracy and reduce error rates in identifying contextual comparisons between premises and hypotheses in Sci-Fi texts

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: dgem_format_test.csv

Column name	Description
premise	The premise of the statement (String).
hypothesis	The hypothesis of the statement (String).
label	The label of the statement – either entailment, neutral or contradiction (String).
hypothesis_graph_structure	A graph structure of the hypothesis (Graph)

File: predictor_format_validation.csv

Column name	Description
answer	The answer to the question. (String)
sentence2_structure	A graph structure of the second sentence. (Graph)
sentence1	The first sentence of the statement. (String)
gold_label	The label of the statement – either entailment, neutral or contradiction. (String)

File: tsv_format_test.csv

Column name	Description
premise	The premise of the statement (String).
hypothesis	The hypothesis of the statement (String).
label	The label of the statement – either entailment, neutral or contradiction (String).

File: snli_format_validation.csv

Column name	Description
sentence1	The first sentence of the statement. (String)
sentence2_structure	A graph structure of the second sentence. (Graph)
gold_label	The label of the statement – either entailment, neutral or contradiction. (String)
sentence1_binary_parse	Binary parse of first sentence. (String)
sentence1_parse	Parse of first sentence. (String)
sentence2_parse	Parse of second sentence. (String)
annotator_labels	Labels assigned by annotators. (String)

File: dgem_format_train.csv

Column name	Description
premise	The premise of the statement (String).
hypothesis	The hypothesis of the statement (String).
label	The label of the statement – either entailment, neutral or contradiction (String).
hypothesis_graph_structure	A graph structure of the hypothesis (Graph)

File: snli_format_train.csv

Column name	Description
sentence1_binary_parse	Binary parse of first sentence. (String)
sentence1_parse	Parse of first sentence. (String)
sentence1	The first sentence of the statement. (String)
sentence2_parse	Parse of second sentence. (String)
sentence2_structure	A graph structure of the second sentence. (Graph)
annotator_labels	Labels assigned by annotators. (String)
gold_label	The label of the statement – either entailment, neutral or contradiction. (String)

File: predictor_format_train.csv

Column name	Description
answer	The answer to the question. (String)
sentence2_structure	A graph structure of the second sentence. (Graph)
sentence1	The first sentence of the statement. (String)
gold_label	The label of the statement – either entailment, neutral or contradiction. (String)

File: snli_format_test.csv

Column name	Description
sentence1_binary_parse	Binary parse of first sentence. (String)
sentence1_parse	Parse of first sentence. (String)
sentence1	The first sentence of the statement. (String)
sentence2_parse	Parse of second sentence. (String)
sentence2_structure	A graph structure of the second sentence. (Graph)
annotator_labels	Labels assigned by annotators. (String)
gold_label	The label of the statement – either entailment, neutral or contradiction. (String)

File: tsv_format_validation.csv

Column name	Description
premise	The premise of the statement (String).
hypothesis	The hypothesis of the statement (String).
label	The label of the statement – either entailment, neutral or contradiction (String).

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.

Tables

Dgem Format Test

@kaggle.thedevastator_futuristic_natural_language_inference_with_the_s.dgem_format_test

161.77 KB
2126 rows
4 columns


CREATE TABLE dgem_format_test (
  "premise" VARCHAR,
  "hypothesis" VARCHAR,
  "label" VARCHAR,
  "hypothesis_graph_structure" VARCHAR
);

Dgem Format Train

@kaggle.thedevastator_futuristic_natural_language_inference_with_the_s.dgem_format_train

1.53 MB
23088 rows
4 columns


CREATE TABLE dgem_format_train (
  "premise" VARCHAR,
  "hypothesis" VARCHAR,
  "label" VARCHAR,
  "hypothesis_graph_structure" VARCHAR
);

Dgem Format Validation

@kaggle.thedevastator_futuristic_natural_language_inference_with_the_s.dgem_format_validation

106.32 KB
1304 rows
4 columns


CREATE TABLE dgem_format_validation (
  "premise" VARCHAR,
  "hypothesis" VARCHAR,
  "label" VARCHAR,
  "hypothesis_graph_structure" VARCHAR
);

Predictor Format Test

@kaggle.thedevastator_futuristic_natural_language_inference_with_the_s.predictor_format_test

177.93 KB
2126 rows
6 columns


CREATE TABLE predictor_format_test (
  "answer" VARCHAR,
  "sentence2_structure" VARCHAR,
  "sentence1" VARCHAR,
  "sentence2" VARCHAR,
  "gold_label" VARCHAR,
  "question" VARCHAR
);

Predictor Format Train

@kaggle.thedevastator_futuristic_natural_language_inference_with_the_s.predictor_format_train

1.6 MB
23587 rows
6 columns


CREATE TABLE predictor_format_train (
  "answer" VARCHAR,
  "sentence2_structure" VARCHAR,
  "sentence1" VARCHAR,
  "sentence2" VARCHAR,
  "gold_label" VARCHAR,
  "question" VARCHAR
);

Predictor Format Validation

@kaggle.thedevastator_futuristic_natural_language_inference_with_the_s.predictor_format_validation

117.41 KB
1304 rows
6 columns


CREATE TABLE predictor_format_validation (
  "answer" VARCHAR,
  "sentence2_structure" VARCHAR,
  "sentence1" VARCHAR,
  "sentence2" VARCHAR,
  "gold_label" VARCHAR,
  "question" VARCHAR
);

Snli Format Test

@kaggle.thedevastator_futuristic_natural_language_inference_with_the_s.snli_format_test

602.59 KB
2126 rows
7 columns


CREATE TABLE snli_format_test (
  "sentence1_binary_parse" VARCHAR,
  "sentence1_parse" VARCHAR,
  "sentence1" VARCHAR,
  "sentence2_parse" VARCHAR,
  "sentence2" VARCHAR,
  "annotator_labels" VARCHAR,
  "gold_label" VARCHAR
);

Snli Format Train

@kaggle.thedevastator_futuristic_natural_language_inference_with_the_s.snli_format_train

5.86 MB
23596 rows
7 columns


CREATE TABLE snli_format_train (
  "sentence1_binary_parse" VARCHAR,
  "sentence1_parse" VARCHAR,
  "sentence1" VARCHAR,
  "sentence2_parse" VARCHAR,
  "sentence2" VARCHAR,
  "annotator_labels" VARCHAR,
  "gold_label" VARCHAR
);

Snli Format Validation

@kaggle.thedevastator_futuristic_natural_language_inference_with_the_s.snli_format_validation

380.63 KB
1304 rows
7 columns


CREATE TABLE snli_format_validation (
  "sentence1_binary_parse" VARCHAR,
  "sentence1_parse" VARCHAR,
  "sentence1" VARCHAR,
  "sentence2_parse" VARCHAR,
  "sentence2" VARCHAR,
  "annotator_labels" VARCHAR,
  "gold_label" VARCHAR
);

Tsv Format Test

@kaggle.thedevastator_futuristic_natural_language_inference_with_the_s.tsv_format_test

148.08 KB
2126 rows
3 columns


CREATE TABLE tsv_format_test (
  "premise" VARCHAR,
  "hypothesis" VARCHAR,
  "label" VARCHAR
);

Tsv Format Train

@kaggle.thedevastator_futuristic_natural_language_inference_with_the_s.tsv_format_train

1.43 MB
23097 rows
3 columns


CREATE TABLE tsv_format_train (
  "premise" VARCHAR,
  "hypothesis" VARCHAR,
  "label" VARCHAR
);

Tsv Format Validation

@kaggle.thedevastator_futuristic_natural_language_inference_with_the_s.tsv_format_validation

95.96 KB
1304 rows
3 columns


CREATE TABLE tsv_format_validation (
  "premise" VARCHAR,
  "hypothesis" VARCHAR,
  "label" VARCHAR
);