Baselight

HAREM Portuguese NER Corpus

Portuguese NER Corpus with 10 Classes

@kaggle.thedevastator_harem_portuguese_ner_corpus

Loading...
Loading...

About this Dataset

HAREM Portuguese NER Corpus


HAREM Portuguese NER Corpus

Portuguese NER Corpus with 10 Classes

By harem (From Huggingface) [source]


About this dataset

The dataset is available in two versions: a complete version with 10 different named entity classes, including Person, Organization, Location, Value, Date, Title, Thing, Event, Abstraction, and Other; and a selective version with only 5 classes (Person, Organization, Location ,Value,and Date). The selective version focuses on the most commonly recognized named entity types.

It's worth noting that the original HAREM dataset had two levels of NER details: Category and Sub-type. However,the processed version of the corpus presented in this Kaggle dataset only includes information up to the Category level.

Each entry in this dataset consists of tokenized words from the original text along with their corresponding NER tags assigned through annotation. The tokens column contains individual words or tokens extracted from the text while **tokens provide a duplicate column for consistency purposes.

Furthermore,the ner_tags column contains specific class labels assigned to each token indicating their corresponding named entity class such as Person or Organization.The **ner_tags serves as an additional identical column which contributes to ensuring consistency within datasets where both columns might co-occur.

This particular Kaggle dataset also contains three separate CSV files: train.csv for training data,a validation.csv subset file utilized for validating NER model performance on Portuguese texts,and test.csv comprising another subset of HAREM corpus where there are tokenized words alongside their respective NER tags.The availability of different files enables users to efficiently train,test,and validate NER models on Portuguese texts using reliable sources,

How to use the dataset

Introduction:

  • Dataset Overview:

  • Dataset Files:
    a) train.csv - Contains the training data with tokens (individual words or tokens) and their corresponding named entity recognition (NER) tags.
    b) validation.csv - Provides a subset of the corpus for validating model performance in identifying named entities.
    c) test.csv - Contains tokenized words from the corpus along with their respective NER tags.

  • Named Entity Classes:
    The dataset includes 10 different named entity classes: Person, Organization, Location, Value, Date,**+, Title,**part as-seq +,, Thing,+seq+ Abstraction,+adv , Event,+pron +no,. Other,+d_em , Type sequences[uTO, DoI, -DATETIME] represent substantive addresses,.

  • Understanding the Columns:
    a) tokens:contains - This column comprises individual tokens or words extracted from the text.
    b)ner_tags: contains** - The ner_tags column lists the assigned named entity recognition tags associated with each token in relation to its respective class.

  • Training and Evaluation:
    To use this dataset for training a NER model, you can utilize the train.csv file. The tokens column will provide you with the words or tokens, while the ner_tags column will guide you in labeling the named entities within your training data.

For evaluating your model's performance, the validation.csv file can be used. Similar to the train.csv file, it contains tokenized words and their corresponding NER tags.

  • Applying Pretrained Models:
    You can also use this dataset to fine-tune or evaluate pretrained NER models in Portuguese. By utilizing transfer learning techniques on this corpus, you may improve their performance on relevant named entity recognition tasks specific

Research Ideas

  • Entity Recognition and Classification: This dataset can be used to train and evaluate models for named entity recognition (NER) tasks in Portuguese. The NER tags provided in the dataset can serve as labels for training models to accurately identify and classify entities such as person names, organization names, locations, dates, etc.
  • Cross-lingual Transfer Learning: The dataset can also be leveraged for cross-lingual transfer learning tasks by training models on this dataset and then using the trained model to extract named entities from other languages as well. This would enable NER tasks in multiple languages using a single trained model by leveraging knowledge gained from this rich resource of labeled data in Portuguese

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv

Column name Description
tokens This column contains individual words or tokens from the text. (Text)
ner_tags This column indicates the named entity class assigned to each token. It is crucial for identifying named entities during training or inference. (Text)

File: train.csv

Column name Description
tokens This column contains individual words or tokens from the text. (Text)
ner_tags This column indicates the named entity class assigned to each token. It is crucial for identifying named entities during training or inference. (Text)

File: test.csv

Column name Description
tokens This column contains individual words or tokens from the text. (Text)
ner_tags This column indicates the named entity class assigned to each token. It is crucial for identifying named entities during training or inference. (Text)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit harem (From Huggingface).

Tables

Test

@kaggle.thedevastator_harem_portuguese_ner_corpus.test
  • 191.45 KB
  • 128 rows
  • 3 columns
Loading...

CREATE TABLE test (
  "id" VARCHAR,
  "tokens" VARCHAR,
  "ner_tags" VARCHAR
);

Train

@kaggle.thedevastator_harem_portuguese_ner_corpus.train
  • 212.02 KB
  • 121 rows
  • 3 columns
Loading...

CREATE TABLE train (
  "id" VARCHAR,
  "tokens" VARCHAR,
  "ner_tags" VARCHAR
);

Validation

@kaggle.thedevastator_harem_portuguese_ner_corpus.validation
  • 28.71 KB
  • 8 rows
  • 3 columns
Loading...

CREATE TABLE validation (
  "id" VARCHAR,
  "tokens" VARCHAR,
  "ner_tags" VARCHAR
);

Share link

Anyone who has the link will be able to view this.