Hierarchical Text Classification by Kaggle | Technology and IT

About this Dataset

Hierarchical Text Classification

Context

It's interesting to explore various approaches to hierarchical text classification.

Content

Let's start with a dataset with Amazon product reviews, classes are structured: 6 "level 1" classes, 64 "level 2" classes, and 510 "level 3" classes.
I share 3 files:

train_40k.csv - training 40k Amazon product reviews
valid_10k.csv - 10k reviews left for validation
unlabeled_150k.csv - raw 150k Amazon product reviews, these can be used for language model finetuning.

Level 1 classes are: health personal care, toys games, beauty, pet supplies, baby products, and grocery gourmet food.

Inspiration

Ideas to explore:

a "flat" approach – concatenate class names like "level1/level2/level3", then train a basic mutli-class model
simple hierarchical approach: first, level 1 model classifies reviews into 6 level 1 classes, then one of 6 level 2 models is picked up, and so on.
fancy approaches like seq2seq with reviews as input and "level1 level2 level3" strings as outputs

Tables

Train 40k

@kaggle.kashnitsky_hierarchical_text_classification.train_40k

12.33 MB
40000 rows
10 columns


CREATE TABLE train_40k (
  "productid" VARCHAR,
  "title" VARCHAR,
  "userid" VARCHAR,
  "helpfulness" VARCHAR,
  "score" DOUBLE,
  "time" BIGINT,
  "text" VARCHAR,
  "cat1" VARCHAR,
  "cat2" VARCHAR,
  "cat3" VARCHAR
);

Unlabeled 150k

@kaggle.kashnitsky_hierarchical_text_classification.unlabeled_150k

41.91 MB
150000 rows
2 columns


CREATE TABLE unlabeled_150k (
  "title" VARCHAR,
  "text" VARCHAR
);

Val 10k

@kaggle.kashnitsky_hierarchical_text_classification.val_10k

2.42 MB
10000 rows
10 columns


CREATE TABLE val_10k (
  "productid" VARCHAR,
  "title" VARCHAR,
  "userid" VARCHAR,
  "helpfulness" VARCHAR,
  "score" DOUBLE,
  "time" BIGINT,
  "text" VARCHAR,
  "cat1" VARCHAR,
  "cat2" VARCHAR,
  "cat3" VARCHAR
);