WikiANN
Multilingual named entity recognition for LLM training
@kaggle.thedevastator_lombard_language_training_dataset
Multilingual named entity recognition for LLM training
@kaggle.thedevastator_lombard_language_training_dataset
By wikiann (From Huggingface) [source]
Overview
WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format. This version corresponds to the balanced train, dev, and test splits of Rahimi et al. (2019), which supports 176 of the 282 languages from the original WikiANN corpus.
File: pdc_train.csv
| Column name | Description |
|---|---|
| tokens | This column contains individual words or tokens in the Lombard language. (Text) |
| ner_tags | This column contains named entity recognition (NER) tags associated with each token. NER tags help identify and classify named entities such as names, locations, organizations, etc. (Text) |
| langs | This column indicates the language of each token. In this dataset, it specifically represents the Lombard language. (Text) |
| spans | This column provides information about the position or span of each token within the text. (Text) |
File: sr_validation.csv
| Column name | Description |
|---|---|
| tokens | This column contains individual words or tokens in the Lombard language. (Text) |
| ner_tags | This column contains named entity recognition (NER) tags associated with each token. NER tags help identify and classify named entities such as names, locations, organizations, etc. (Text) |
| langs | This column indicates the language of each token. In this dataset, it specifically represents the Lombard language. (Text) |
| spans | This column provides information about the position or span of each token within the text. (Text) |
File: uz_train.csv
| Column name | Description |
|---|---|
| tokens | This column contains individual words or tokens in the Lombard language. (Text) |
| ner_tags | This column contains named entity recognition (NER) tags associated with each token. NER tags help identify and classify named entities such as names, locations, organizations, etc. (Text) |
| langs | This column indicates the language of each token. In this dataset, it specifically represents the Lombard language. (Text) |
| spans | This column provides information about the position or span of each token within the text. (Text) |
- Named Entity Recognition (NER) Training: The dataset can be used to train models for NER tasks specific to the Lombard language. By utilizing the ner_tags column, developers can create models that identify and classify named entities in Lombard text, such as names of people, places, organizations, and more.
- Language Classification: Since the langs column indicates the language of each token in the dataset (which will always be Lombard), this dataset can be used for training language classification models. These models can then be utilized to automatically detect whether a given piece of text is in Lombard or another language.
- Span Identification: The spans column provides information about the position or span of each token within the text. This information can be utilized to develop algorithms or applications that require analyzing specific spans within sentences or paragraphs in Lombard text. For example, it could help identify important phrases or extract certain sections of text from a larger document
If you use this dataset in your research, please credit the original authors.
Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit wikiann (From Huggingface).
CREATE TABLE lmo_validation (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE ln_test (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE ln_train (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE ln_validation (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE lt_test (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE lt_train (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE lt_validation (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE lv_test (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE lv_train (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE lv_validation (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE map_bms_test (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE map_bms_train (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE map_bms_validation (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE mg_test (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE mg_train (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE mg_validation (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE mhr_test (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE mhr_train (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE mhr_validation (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE min_test (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE min_train (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE min_validation (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE mi_test (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE mi_train (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);CREATE TABLE mi_validation (
"tokens" VARCHAR,
"ner_tags" VARCHAR,
"langs" VARCHAR,
"spans" VARCHAR
);Anyone who has the link will be able to view this.