Baselight

Cleaned Toxic Comments

Preprocessed data for Toxic Comments Classification Challenge

@kaggle.fizzbuzz_cleaned_toxic_comments

Loading...
Loading...

About this Dataset

Cleaned Toxic Comments

Preporcessed Toxic Comments Classification Dataset

The obstacle I faced in Toxic Comments Classification Challenge was the preprocessing part. One can easily improve their LB performance if the preprocessing is done right.

This is the preprocessed version of Toxic Comments Classification Challenge dataset. The code for preprocessing: https://www.kaggle.com/fizzbuzz/toxic-data-preprocessing

Tables

Test Preprocessed

@kaggle.fizzbuzz_cleaned_toxic_comments.test_preprocessed
  • 32.17 MB
  • 153164 rows
  • 10 columns
Loading...

CREATE TABLE test_preprocessed (
  "comment_text" VARCHAR,
  "id" VARCHAR,
  "identity_hate" VARCHAR,
  "insult" VARCHAR,
  "obscene" VARCHAR,
  "set" VARCHAR,
  "severe_toxic" VARCHAR,
  "threat" VARCHAR,
  "toxic" VARCHAR,
  "toxicity" VARCHAR
);

Train Preprocessed

@kaggle.fizzbuzz_cleaned_toxic_comments.train_preprocessed
  • 37 MB
  • 159571 rows
  • 10 columns
Loading...

CREATE TABLE train_preprocessed (
  "comment_text" VARCHAR,
  "id" VARCHAR,
  "identity_hate" DOUBLE,
  "insult" DOUBLE,
  "obscene" DOUBLE,
  "set" VARCHAR,
  "severe_toxic" DOUBLE,
  "threat" DOUBLE,
  "toxic" DOUBLE,
  "toxicity" DOUBLE
);

Share link

Anyone who has the link will be able to view this.