WinoBias Coreference Dataset
Gender-biased coreference dataset focused on occupation stereotypes in WinoBias
By wino_bias (From Huggingface) [source]
About this dataset
The WinoBias dataset is a comprehensive and valuable resource designed specifically for coreference resolution, with a special emphasis on addressing gender bias. The dataset is centered around Winograd-schema style sentences where various entities are referred to by their respective occupations, such as the nurse, the doctor, or the carpenter.
The primary objective of this groundbreaking dataset is to facilitate the accurate and effective resolution of coreference in these sentences, particularly when it comes to gender-related biases. By examining the relationships between words and their referents in context, coreference resolution models have the opportunity to uncover and address instances where gender stereotypes might be perpetuated.
Each entry in the dataset includes multiple attributes that enhance its usefulness and versatility. These attributes encompass crucial linguistic elements such as part-of-speech tags, parse bits (syntactic structure annotations), word senses, speaker information, named entity recognition tags (identifying entities like persons or locations), verbal predicates, lemma forms of predicates (verb base forms), and coreference clusters.
With its diverse range of occupation-related sentences containing subtle gender biases, the WinoBias dataset provides an invaluable resource for researchers, developers, and evaluators working on improving coreference resolution systems. By evaluating model performance using this data, stakeholders can gain insights into potential areas of bias within their algorithms while striving towards more equitable language processing technologies.
In summary, the WinoBias dataset represents a vital contribution to addressing gender bias in natural language processing tasks by focusing specifically on coreference resolution. Its rich collection of meticulously annotated sentences offers an opportunity for developing more robust models capable of mitigating biased assumptions related to occupations based on gender stereotypes
How to use the dataset
Overview
The dataset consists of Winograd-schema style sentences where entities are referred to by their occupation, such as the nurse, the doctor, or the carpenter. The main goal is to resolve the coreference within these sentences.
File Description
The dataset includes several CSV files with different purposes:
-
type2_anti_validation.csv
: This file contains validation data for evaluating the performance of coreference resolution models on gender-biased sentences in the WinoBias dataset related to occupations.
-
type2_pro_test.csv
: A test data file that evaluates the performance of coreference resolution models specifically on gender-biased sentences related to occupations.
-
type1_pro_validation.csv
: Here you will find validation data for evaluating the performance of a coreference resolution model on gender bias in occupations within the WinoBias dataset.
Each CSV file contains multiple columns representing different features and information about each sentence, such as part number, word number, tokens (words), part-of-speech tags (POS tags), parse bit for each token, predicate lemma (verb lemma), word sense, speaker information, named entity recognition tags (NER tags), verbal predicates used in a sentence, and coreference clusters.
It is important to note that some columns may be repeated multiple times across different files with shared information. For example, part_number may appear more than once but represents different parts or sections within a sentence.
Instructions
To utilize this dataset effectively:
-
Import one or more relevant CSV files into your preferred programming environment or tool that supports handling tabular data (e.g., Python pandas).
-
Explore the columns and understand their meanings by referring to the column descriptions provided in this guide.
-
Analyze the data and perform necessary pre-processing steps based on your specific research or analysis goals. You can consider tasks such as gender bias detection, coreference resolution model development, or evaluation of existing models.
-
Choose appropriate features/columns for your task and utilize them accordingly.
-
Leverage the insights from this dataset to gain a better understanding of gender biases present in coreference resolution and find ways to mitigate such biases.
Remember that proper data cleaning, preparation, and feature engineering are crucial steps before applying any machine learning or
Research Ideas
- Bias detection: This dataset can be used to evaluate and measure the presence of gender bias in coreference resolution models. By analyzing the performance of different models on biased sentences related to occupations, researchers can identify and address any biases present in these models.
- Model improvement: The dataset can also be used to improve existing coreference resolution models by training them on gender-biased examples. By incorporating this data into model training, researchers can enhance the model's ability to accurately resolve coreferences in sentences involving gender-specific occupations.
- Algorithm development: Researchers can use this dataset to develop new algorithms or techniques for addressing gender bias in coreference resolution. By testing different strategies on the provided examples, they can identify effective approaches for reducing or eliminating bias in these models
Acknowledgements
If you use this dataset in your research, please credit the original authors.
Data Source
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
Columns
File: type2_anti_validation.csv
Column name |
Description |
part_number |
The number of the sentence part in the dataset. (Integer) |
word_number |
The position of the word in the sentence. (Integer) |
tokens |
The individual words in each sentence. (Text) |
pos_tags |
Part-of-speech tags associated with each token. (Text) |
parse_bit |
Syntactic structure information for each token. (Text) |
predicate_lemma |
The lemma of the verb used in the sentence. (Text) |
word_sense |
The sense of each word in context. (Text) |
speaker |
The speaker in each sentence. (Text) |
ner_tags |
Named entity recognition tags that identify specific types like organizations or locations. (Text) |
verbal_predicates |
Verbal predicates in sentences identified by their corresponding verbs. (Text) |
coreference_clusters |
Groups of words that refer to the same entity. (Text) |
File: type2_pro_test.csv
Column name |
Description |
part_number |
The number of the sentence part in the dataset. (Integer) |
word_number |
The position of the word in the sentence. (Integer) |
tokens |
The individual words in each sentence. (Text) |
pos_tags |
Part-of-speech tags associated with each token. (Text) |
parse_bit |
Syntactic structure information for each token. (Text) |
predicate_lemma |
The lemma of the verb used in the sentence. (Text) |
word_sense |
The sense of each word in context. (Text) |
speaker |
The speaker in each sentence. (Text) |
ner_tags |
Named entity recognition tags that identify specific types like organizations or locations. (Text) |
verbal_predicates |
Verbal predicates in sentences identified by their corresponding verbs. (Text) |
coreference_clusters |
Groups of words that refer to the same entity. (Text) |
File: type1_pro_validation.csv
Column name |
Description |
part_number |
The number of the sentence part in the dataset. (Integer) |
word_number |
The position of the word in the sentence. (Integer) |
tokens |
The individual words in each sentence. (Text) |
pos_tags |
Part-of-speech tags associated with each token. (Text) |
parse_bit |
Syntactic structure information for each token. (Text) |
predicate_lemma |
The lemma of the verb used in the sentence. (Text) |
word_sense |
The sense of each word in context. (Text) |
speaker |
The speaker in each sentence. (Text) |
ner_tags |
Named entity recognition tags that identify specific types like organizations or locations. (Text) |
verbal_predicates |
Verbal predicates in sentences identified by their corresponding verbs. (Text) |
coreference_clusters |
Groups of words that refer to the same entity. (Text) |
Acknowledgements
If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit wino_bias (From Huggingface).