Baselight

Hierarchical Text Classification

Exploring approaches to text classification with structured classes

@kaggle.kashnitsky_hierarchical_text_classification

Loading...
Loading...

About this Dataset

Hierarchical Text Classification

Context

It's interesting to explore various approaches to hierarchical text classification.

Content

Let's start with a dataset with Amazon product reviews, classes are structured: 6 "level 1" classes, 64 "level 2" classes, and 510 "level 3" classes.
I share 3 files:

  • train_40k.csv - training 40k Amazon product reviews
  • valid_10k.csv - 10k reviews left for validation
  • unlabeled_150k.csv - raw 150k Amazon product reviews, these can be used for language model finetuning.

Level 1 classes are: health personal care, toys games, beauty, pet supplies, baby products, and grocery gourmet food.

Inspiration

Ideas to explore:

  • a "flat" approach – concatenate class names like "level1/level2/level3", then train a basic mutli-class model
  • simple hierarchical approach: first, level 1 model classifies reviews into 6 level 1 classes, then one of 6 level 2 models is picked up, and so on.
  • fancy approaches like seq2seq with reviews as input and "level1 level2 level3" strings as outputs

Tables

Train 40k

@kaggle.kashnitsky_hierarchical_text_classification.train_40k
  • 12.33 MB
  • 40000 rows
  • 10 columns
Loading...

CREATE TABLE train_40k (
  "productid" VARCHAR,
  "title" VARCHAR,
  "userid" VARCHAR,
  "helpfulness" VARCHAR,
  "score" DOUBLE,
  "time" BIGINT,
  "text" VARCHAR,
  "cat1" VARCHAR,
  "cat2" VARCHAR,
  "cat3" VARCHAR
);

Unlabeled 150k

@kaggle.kashnitsky_hierarchical_text_classification.unlabeled_150k
  • 41.91 MB
  • 150000 rows
  • 2 columns
Loading...

CREATE TABLE unlabeled_150k (
  "title" VARCHAR,
  "text" VARCHAR
);

Val 10k

@kaggle.kashnitsky_hierarchical_text_classification.val_10k
  • 2.42 MB
  • 10000 rows
  • 10 columns
Loading...

CREATE TABLE val_10k (
  "productid" VARCHAR,
  "title" VARCHAR,
  "userid" VARCHAR,
  "helpfulness" VARCHAR,
  "score" DOUBLE,
  "time" BIGINT,
  "text" VARCHAR,
  "cat1" VARCHAR,
  "cat2" VARCHAR,
  "cat3" VARCHAR
);

Share link

Anyone who has the link will be able to view this.