WinoBias Coreference Dataset by Kaggle | Demographics and Population Studies

About this Dataset

WinoBias Coreference Dataset

Gender-biased coreference dataset focused on occupation stereotypes in WinoBias

By wino_bias (From Huggingface) [source]

About this dataset

The WinoBias dataset is a comprehensive and valuable resource designed specifically for coreference resolution, with a special emphasis on addressing gender bias. The dataset is centered around Winograd-schema style sentences where various entities are referred to by their respective occupations, such as the nurse, the doctor, or the carpenter.

The primary objective of this groundbreaking dataset is to facilitate the accurate and effective resolution of coreference in these sentences, particularly when it comes to gender-related biases. By examining the relationships between words and their referents in context, coreference resolution models have the opportunity to uncover and address instances where gender stereotypes might be perpetuated.

Each entry in the dataset includes multiple attributes that enhance its usefulness and versatility. These attributes encompass crucial linguistic elements such as part-of-speech tags, parse bits (syntactic structure annotations), word senses, speaker information, named entity recognition tags (identifying entities like persons or locations), verbal predicates, lemma forms of predicates (verb base forms), and coreference clusters.

With its diverse range of occupation-related sentences containing subtle gender biases, the WinoBias dataset provides an invaluable resource for researchers, developers, and evaluators working on improving coreference resolution systems. By evaluating model performance using this data, stakeholders can gain insights into potential areas of bias within their algorithms while striving towards more equitable language processing technologies.

In summary, the WinoBias dataset represents a vital contribution to addressing gender bias in natural language processing tasks by focusing specifically on coreference resolution. Its rich collection of meticulously annotated sentences offers an opportunity for developing more robust models capable of mitigating biased assumptions related to occupations based on gender stereotypes

How to use the dataset

Overview

The dataset consists of Winograd-schema style sentences where entities are referred to by their occupation, such as the nurse, the doctor, or the carpenter. The main goal is to resolve the coreference within these sentences.

File Description

The dataset includes several CSV files with different purposes:

type2_anti_validation.csv: This file contains validation data for evaluating the performance of coreference resolution models on gender-biased sentences in the WinoBias dataset related to occupations.

type2_pro_test.csv: A test data file that evaluates the performance of coreference resolution models specifically on gender-biased sentences related to occupations.

type1_pro_validation.csv: Here you will find validation data for evaluating the performance of a coreference resolution model on gender bias in occupations within the WinoBias dataset.

Each CSV file contains multiple columns representing different features and information about each sentence, such as part number, word number, tokens (words), part-of-speech tags (POS tags), parse bit for each token, predicate lemma (verb lemma), word sense, speaker information, named entity recognition tags (NER tags), verbal predicates used in a sentence, and coreference clusters.

It is important to note that some columns may be repeated multiple times across different files with shared information. For example, part_number may appear more than once but represents different parts or sections within a sentence.

Instructions

To utilize this dataset effectively:

Import one or more relevant CSV files into your preferred programming environment or tool that supports handling tabular data (e.g., Python pandas).

Explore the columns and understand their meanings by referring to the column descriptions provided in this guide.

Analyze the data and perform necessary pre-processing steps based on your specific research or analysis goals. You can consider tasks such as gender bias detection, coreference resolution model development, or evaluation of existing models.

Choose appropriate features/columns for your task and utilize them accordingly.

Leverage the insights from this dataset to gain a better understanding of gender biases present in coreference resolution and find ways to mitigate such biases.

Remember that proper data cleaning, preparation, and feature engineering are crucial steps before applying any machine learning or

Research Ideas

Bias detection: This dataset can be used to evaluate and measure the presence of gender bias in coreference resolution models. By analyzing the performance of different models on biased sentences related to occupations, researchers can identify and address any biases present in these models.

Model improvement: The dataset can also be used to improve existing coreference resolution models by training them on gender-biased examples. By incorporating this data into model training, researchers can enhance the model's ability to accurately resolve coreferences in sentences involving gender-specific occupations.

Algorithm development: Researchers can use this dataset to develop new algorithms or techniques for addressing gender bias in coreference resolution. By testing different strategies on the provided examples, they can identify effective approaches for reducing or eliminating bias in these models

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: type2_anti_validation.csv

Column name	Description
part_number	The number of the sentence part in the dataset. (Integer)
word_number	The position of the word in the sentence. (Integer)
tokens	The individual words in each sentence. (Text)
pos_tags	Part-of-speech tags associated with each token. (Text)
parse_bit	Syntactic structure information for each token. (Text)
predicate_lemma	The lemma of the verb used in the sentence. (Text)
word_sense	The sense of each word in context. (Text)
speaker	The speaker in each sentence. (Text)
ner_tags	Named entity recognition tags that identify specific types like organizations or locations. (Text)
verbal_predicates	Verbal predicates in sentences identified by their corresponding verbs. (Text)
coreference_clusters	Groups of words that refer to the same entity. (Text)

File: type2_pro_test.csv

Column name	Description
part_number	The number of the sentence part in the dataset. (Integer)
word_number	The position of the word in the sentence. (Integer)
tokens	The individual words in each sentence. (Text)
pos_tags	Part-of-speech tags associated with each token. (Text)
parse_bit	Syntactic structure information for each token. (Text)
predicate_lemma	The lemma of the verb used in the sentence. (Text)
word_sense	The sense of each word in context. (Text)
speaker	The speaker in each sentence. (Text)
ner_tags	Named entity recognition tags that identify specific types like organizations or locations. (Text)
verbal_predicates	Verbal predicates in sentences identified by their corresponding verbs. (Text)
coreference_clusters	Groups of words that refer to the same entity. (Text)

File: type1_pro_validation.csv

Column name	Description
part_number	The number of the sentence part in the dataset. (Integer)
word_number	The position of the word in the sentence. (Integer)
tokens	The individual words in each sentence. (Text)
pos_tags	Part-of-speech tags associated with each token. (Text)
parse_bit	Syntactic structure information for each token. (Text)
predicate_lemma	The lemma of the verb used in the sentence. (Text)
word_sense	The sense of each word in context. (Text)
speaker	The speaker in each sentence. (Text)
ner_tags	Named entity recognition tags that identify specific types like organizations or locations. (Text)
verbal_predicates	Verbal predicates in sentences identified by their corresponding verbs. (Text)
coreference_clusters	Groups of words that refer to the same entity. (Text)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit wino_bias (From Huggingface).

Tables

Type1 Anti Test

@kaggle.thedevastator_winobias_coreference_dataset.type1_anti_test

34.05 KB
396 rows
13 columns


CREATE TABLE type1_anti_test (
  "document_id" VARCHAR,
  "part_number" BIGINT,
  "word_number" VARCHAR,
  "tokens" VARCHAR,
  "pos_tags" VARCHAR,
  "parse_bit" VARCHAR,
  "predicate_lemma" VARCHAR,
  "predicate_framenet_id" VARCHAR,
  "word_sense" VARCHAR,
  "speaker" VARCHAR,
  "ner_tags" VARCHAR,
  "verbal_predicates" VARCHAR,
  "coreference_clusters" VARCHAR
);

Type1 Anti Validation

@kaggle.thedevastator_winobias_coreference_dataset.type1_anti_validation

32.34 KB
396 rows
13 columns


CREATE TABLE type1_anti_validation (
  "document_id" VARCHAR,
  "part_number" BIGINT,
  "word_number" VARCHAR,
  "tokens" VARCHAR,
  "pos_tags" VARCHAR,
  "parse_bit" VARCHAR,
  "predicate_lemma" VARCHAR,
  "predicate_framenet_id" VARCHAR,
  "word_sense" VARCHAR,
  "speaker" VARCHAR,
  "ner_tags" VARCHAR,
  "verbal_predicates" VARCHAR,
  "coreference_clusters" VARCHAR
);

Type1 Pro Test

@kaggle.thedevastator_winobias_coreference_dataset.type1_pro_test

34.11 KB
396 rows
13 columns


CREATE TABLE type1_pro_test (
  "document_id" VARCHAR,
  "part_number" BIGINT,
  "word_number" VARCHAR,
  "tokens" VARCHAR,
  "pos_tags" VARCHAR,
  "parse_bit" VARCHAR,
  "predicate_lemma" VARCHAR,
  "predicate_framenet_id" VARCHAR,
  "word_sense" VARCHAR,
  "speaker" VARCHAR,
  "ner_tags" VARCHAR,
  "verbal_predicates" VARCHAR,
  "coreference_clusters" VARCHAR
);

Type1 Pro Validation

@kaggle.thedevastator_winobias_coreference_dataset.type1_pro_validation

32.3 KB
396 rows
13 columns


CREATE TABLE type1_pro_validation (
  "document_id" VARCHAR,
  "part_number" BIGINT,
  "word_number" VARCHAR,
  "tokens" VARCHAR,
  "pos_tags" VARCHAR,
  "parse_bit" VARCHAR,
  "predicate_lemma" VARCHAR,
  "predicate_framenet_id" VARCHAR,
  "word_sense" VARCHAR,
  "speaker" VARCHAR,
  "ner_tags" VARCHAR,
  "verbal_predicates" VARCHAR,
  "coreference_clusters" VARCHAR
);

Type2 Anti Test

@kaggle.thedevastator_winobias_coreference_dataset.type2_anti_test

33.52 KB
396 rows
13 columns


CREATE TABLE type2_anti_test (
  "document_id" VARCHAR,
  "part_number" BIGINT,
  "word_number" VARCHAR,
  "tokens" VARCHAR,
  "pos_tags" VARCHAR,
  "parse_bit" VARCHAR,
  "predicate_lemma" VARCHAR,
  "predicate_framenet_id" VARCHAR,
  "word_sense" VARCHAR,
  "speaker" VARCHAR,
  "ner_tags" VARCHAR,
  "verbal_predicates" VARCHAR,
  "coreference_clusters" VARCHAR
);

Type2 Anti Validation

@kaggle.thedevastator_winobias_coreference_dataset.type2_anti_validation

32.69 KB
396 rows
13 columns


CREATE TABLE type2_anti_validation (
  "document_id" VARCHAR,
  "part_number" BIGINT,
  "word_number" VARCHAR,
  "tokens" VARCHAR,
  "pos_tags" VARCHAR,
  "parse_bit" VARCHAR,
  "predicate_lemma" VARCHAR,
  "predicate_framenet_id" VARCHAR,
  "word_sense" VARCHAR,
  "speaker" VARCHAR,
  "ner_tags" VARCHAR,
  "verbal_predicates" VARCHAR,
  "coreference_clusters" VARCHAR
);

Type2 Pro Test

@kaggle.thedevastator_winobias_coreference_dataset.type2_pro_test

33.48 KB
396 rows
13 columns


CREATE TABLE type2_pro_test (
  "document_id" VARCHAR,
  "part_number" BIGINT,
  "word_number" VARCHAR,
  "tokens" VARCHAR,
  "pos_tags" VARCHAR,
  "parse_bit" VARCHAR,
  "predicate_lemma" VARCHAR,
  "predicate_framenet_id" VARCHAR,
  "word_sense" VARCHAR,
  "speaker" VARCHAR,
  "ner_tags" VARCHAR,
  "verbal_predicates" VARCHAR,
  "coreference_clusters" VARCHAR
);

Type2 Pro Validation

@kaggle.thedevastator_winobias_coreference_dataset.type2_pro_validation

32.72 KB
396 rows
13 columns


CREATE TABLE type2_pro_validation (
  "document_id" VARCHAR,
  "part_number" BIGINT,
  "word_number" VARCHAR,
  "tokens" VARCHAR,
  "pos_tags" VARCHAR,
  "parse_bit" VARCHAR,
  "predicate_lemma" VARCHAR,
  "predicate_framenet_id" VARCHAR,
  "word_sense" VARCHAR,
  "speaker" VARCHAR,
  "ner_tags" VARCHAR,
  "verbal_predicates" VARCHAR,
  "coreference_clusters" VARCHAR
);