Baselight

PAWS (Paraphrase Word Scrambling)

A dataset for modeling structure, context, and word order information

@kaggle.thedevastator_the_paws_dataset_for_paraphrase_identification

Loading...
Loading...

About this Dataset

PAWS (Paraphrase Word Scrambling)

PAWS (Paraphrase Word Scrambling)

A dataset for modeling structure, context, and word order information


Source

Huggingface Hub: link

About this dataset

PAWS: Paraphrase Adversaries from Word Scrambling
This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset.
For further details, see the accompanying paper: PAWS: Paraphrase Adversaries from Word Scrambling (https://arxiv.org/abs/1904.01130)
PAWS-QQP is not available due to license of QQP. It must be reconstructed by downloading the original data and then running our scripts to produce the data and attach the labels.

How to use the dataset

https://www.kaggle.com/google-research-datasets/paws#_=_

To use this dataset for the task of paraphrase identification, you will need to first split the data into a training set and a test set. You can do this by using the labeled_final_train.csv and labeled_final_test.csv files. Next, you will need to train a paraphrase identification model on the training set. Finally, you can test the performance of your model on the test set

Research Ideas

  • The PAWS dataset can be used to train a machine learning model to identify paraphrases.
  • The PAWS dataset can be used to improve the accuracy of a machine translation system.
  • The PAWS dataset can be used to develop a system that can detect plagiarism

Acknowledgements

License

> License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
> No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: labeled_final_train.csv

Column name Description
sentence1 The first sentence in the pair. (string)
sentence2 The second sentence in the pair. (string)
label A label indicating whether the two sentences are paraphrases of each other (1) or not (0). (integer)

File: unlabeled_final_validation.csv

Column name Description
sentence1 The first sentence in the pair. (string)
sentence2 The second sentence in the pair. (string)
label A label indicating whether the two sentences are paraphrases of each other (1) or not (0). (integer)

File: labeled_final_test.csv

Column name Description
sentence1 The first sentence in the pair. (string)
sentence2 The second sentence in the pair. (string)
label A label indicating whether the two sentences are paraphrases of each other (1) or not (0). (integer)

File: labeled_final_validation.csv

Column name Description
sentence1 The first sentence in the pair. (string)
sentence2 The second sentence in the pair. (string)
label A label indicating whether the two sentences are paraphrases of each other (1) or not (0). (integer)

File: unlabeled_final_train.csv

Column name Description
sentence1 The first sentence in the pair. (string)
sentence2 The second sentence in the pair. (string)
label A label indicating whether the two sentences are paraphrases of each other (1) or not (0). (integer)

File: labeled_swap_train.csv

Column name Description
sentence1 The first sentence in the pair. (string)
sentence2 The second sentence in the pair. (string)
label A label indicating whether the two sentences are paraphrases of each other (1) or not (0). (integer)

Tables

Labeled Final Test

@kaggle.thedevastator_the_paws_dataset_for_paraphrase_identification.labeled_final_test
  • 812.05 KB
  • 8000 rows
  • 4 columns
Loading...

CREATE TABLE labeled_final_test (
  "id" BIGINT,
  "sentence1" VARCHAR,
  "sentence2" VARCHAR,
  "label" BIGINT
);

Labeled Final Train

@kaggle.thedevastator_the_paws_dataset_for_paraphrase_identification.labeled_final_train
  • 7.76 MB
  • 49401 rows
  • 4 columns
Loading...

CREATE TABLE labeled_final_train (
  "id" BIGINT,
  "sentence1" VARCHAR,
  "sentence2" VARCHAR,
  "label" BIGINT
);

Labeled Final Validation

@kaggle.thedevastator_the_paws_dataset_for_paraphrase_identification.labeled_final_validation
  • 812.2 KB
  • 8000 rows
  • 4 columns
Loading...

CREATE TABLE labeled_final_validation (
  "id" BIGINT,
  "sentence1" VARCHAR,
  "sentence2" VARCHAR,
  "label" BIGINT
);

Labeled Swap Train

@kaggle.thedevastator_the_paws_dataset_for_paraphrase_identification.labeled_swap_train
  • 5.38 MB
  • 30397 rows
  • 4 columns
Loading...

CREATE TABLE labeled_swap_train (
  "id" BIGINT,
  "sentence1" VARCHAR,
  "sentence2" VARCHAR,
  "label" BIGINT
);

Unlabeled Final Train

@kaggle.thedevastator_the_paws_dataset_for_paraphrase_identification.unlabeled_final_train
  • 102.44 MB
  • 645652 rows
  • 4 columns
Loading...

CREATE TABLE unlabeled_final_train (
  "id" BIGINT,
  "sentence1" VARCHAR,
  "sentence2" VARCHAR,
  "label" BIGINT
);

Unlabeled Final Validation

@kaggle.thedevastator_the_paws_dataset_for_paraphrase_identification.unlabeled_final_validation
  • 1.27 MB
  • 10000 rows
  • 4 columns
Loading...

CREATE TABLE unlabeled_final_validation (
  "id" BIGINT,
  "sentence1" VARCHAR,
  "sentence2" VARCHAR,
  "label" BIGINT
);

Share link

Anyone who has the link will be able to view this.