TREC (Question Classification) by Kaggle | Other

About this Dataset

TREC (Question Classification)

5500 labeled questions & answers in training set and another 500 for test set

Source

Huggingface Hub: link

About this dataset

The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set.
The dataset has 6 coarse class labels and 50 fine class labels. Average length of each sentence is 10, vocabulary size of 8700.
Data are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set. These questions were manually labeled.

How to use the dataset

Research Ideas

This dataset can be used to develop and test new question classification models.

This dataset can be used to investigate the differences between human and machine question classification ability.

This dataset can be used to study the evolution of question classification over time (e.g., changes in label usage, sentence length, etc.)

Acknowledgements

License

> License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
> No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv

Column name	Description
label-coarse	The coarse-grained label for the question. (String)
label-fine	The fine-grained label for the question. (String)
text	The text of the question. (String)

File: test.csv

Column name	Description
label-coarse	The coarse-grained label for the question. (String)
label-fine	The fine-grained label for the question. (String)
text	The text of the question. (String)

Tables

Test

@kaggle.thedevastator_the_trec_question_classification_dataset_a_longi.test

15.99 KB
500 rows
3 columns


CREATE TABLE test (
  "label_coarse" BIGINT,
  "label_fine" BIGINT,
  "text" VARCHAR
);

Train

@kaggle.thedevastator_the_trec_question_classification_dataset_a_longi.train

199.75 KB
5452 rows
3 columns


CREATE TABLE train (
  "label_coarse" BIGINT,
  "label_fine" BIGINT,
  "text" VARCHAR
);