The dataset is available in two versions: a complete version with 10 different named entity classes, including Person, Organization, Location, Value, Date, Title, Thing, Event, Abstraction, and Other; and a selective version with only 5 classes (Person, Organization, Location ,Value,and Date). The selective version focuses on the most commonly recognized named entity types.
It's worth noting that the original HAREM dataset had two levels of NER details: Category and Sub-type. However,the processed version of the corpus presented in this Kaggle dataset only includes information up to the Category level.
Each entry in this dataset consists of tokenized words from the original text along with their corresponding NER tags assigned through annotation. The tokens column contains individual words or tokens extracted from the text while **tokens provide a duplicate column for consistency purposes.
Furthermore,the ner_tags column contains specific class labels assigned to each token indicating their corresponding named entity class such as Person or Organization.The **ner_tags serves as an additional identical column which contributes to ensuring consistency within datasets where both columns might co-occur.
This particular Kaggle dataset also contains three separate CSV files: train.csv for training data,a validation.csv subset file utilized for validating NER model performance on Portuguese texts,and test.csv comprising another subset of HAREM corpus where there are tokenized words alongside their respective NER tags.The availability of different files enables users to efficiently train,test,and validate NER models on Portuguese texts using reliable sources,