Baselight

WikiANN

Multilingual named entity recognition for LLM training

@kaggle.thedevastator_lombard_language_training_dataset

Loading...
Loading...

About this Dataset

WikiANN


Lombard Language Training Dataset

Lombard Language Training Data

By wikiann (From Huggingface) [source]


About this dataset

Overview

WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format. This version corresponds to the balanced train, dev, and test splits of Rahimi et al. (2019), which supports 176 of the 282 languages from the original WikiANN corpus.

Columns

File: pdc_train.csv

Column name Description
tokens This column contains individual words or tokens in the Lombard language. (Text)
ner_tags This column contains named entity recognition (NER) tags associated with each token. NER tags help identify and classify named entities such as names, locations, organizations, etc. (Text)
langs This column indicates the language of each token. In this dataset, it specifically represents the Lombard language. (Text)
spans This column provides information about the position or span of each token within the text. (Text)

File: sr_validation.csv

Column name Description
tokens This column contains individual words or tokens in the Lombard language. (Text)
ner_tags This column contains named entity recognition (NER) tags associated with each token. NER tags help identify and classify named entities such as names, locations, organizations, etc. (Text)
langs This column indicates the language of each token. In this dataset, it specifically represents the Lombard language. (Text)
spans This column provides information about the position or span of each token within the text. (Text)

File: uz_train.csv

Column name Description
tokens This column contains individual words or tokens in the Lombard language. (Text)
ner_tags This column contains named entity recognition (NER) tags associated with each token. NER tags help identify and classify named entities such as names, locations, organizations, etc. (Text)
langs This column indicates the language of each token. In this dataset, it specifically represents the Lombard language. (Text)
spans This column provides information about the position or span of each token within the text. (Text)

Research Ideas

  • Named Entity Recognition (NER) Training: The dataset can be used to train models for NER tasks specific to the Lombard language. By utilizing the ner_tags column, developers can create models that identify and classify named entities in Lombard text, such as names of people, places, organizations, and more.
  • Language Classification: Since the langs column indicates the language of each token in the dataset (which will always be Lombard), this dataset can be used for training language classification models. These models can then be utilized to automatically detect whether a given piece of text is in Lombard or another language.
  • Span Identification: The spans column provides information about the position or span of each token within the text. This information can be utilized to develop algorithms or applications that require analyzing specific spans within sentences or paragraphs in Lombard text. For example, it could help identify important phrases or extract certain sections of text from a larger document

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit wikiann (From Huggingface).

Tables

Scn Train

@kaggle.thedevastator_lombard_language_training_dataset.scn_train
  • 11.36 KB
  • 100 rows
  • 4 columns
Loading...

CREATE TABLE scn_train (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Scn Validation

@kaggle.thedevastator_lombard_language_training_dataset.scn_validation
  • 10.65 KB
  • 100 rows
  • 4 columns
Loading...

CREATE TABLE scn_validation (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Sco Test

@kaggle.thedevastator_lombard_language_training_dataset.sco_test
  • 11.4 KB
  • 100 rows
  • 4 columns
Loading...

CREATE TABLE sco_test (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Sco Train

@kaggle.thedevastator_lombard_language_training_dataset.sco_train
  • 10.8 KB
  • 100 rows
  • 4 columns
Loading...

CREATE TABLE sco_train (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Sco Validation

@kaggle.thedevastator_lombard_language_training_dataset.sco_validation
  • 12.28 KB
  • 100 rows
  • 4 columns
Loading...

CREATE TABLE sco_validation (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Sd Test

@kaggle.thedevastator_lombard_language_training_dataset.sd_test
  • 16.34 KB
  • 100 rows
  • 4 columns
Loading...

CREATE TABLE sd_test (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Sd Train

@kaggle.thedevastator_lombard_language_training_dataset.sd_train
  • 21.89 KB
  • 100 rows
  • 4 columns
Loading...

CREATE TABLE sd_train (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Sd Validation

@kaggle.thedevastator_lombard_language_training_dataset.sd_validation
  • 22.28 KB
  • 100 rows
  • 4 columns
Loading...

CREATE TABLE sd_validation (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Sh Test

@kaggle.thedevastator_lombard_language_training_dataset.sh_test
  • 421.57 KB
  • 10000 rows
  • 4 columns
Loading...

CREATE TABLE sh_test (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Sh Train

@kaggle.thedevastator_lombard_language_training_dataset.sh_train
  • 799.88 KB
  • 20000 rows
  • 4 columns
Loading...

CREATE TABLE sh_train (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Sh Validation

@kaggle.thedevastator_lombard_language_training_dataset.sh_validation
  • 423.91 KB
  • 10000 rows
  • 4 columns
Loading...

CREATE TABLE sh_validation (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Simple Test

@kaggle.thedevastator_lombard_language_training_dataset.simple_test
  • 63.4 KB
  • 1000 rows
  • 4 columns
Loading...

CREATE TABLE simple_test (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Simple Train

@kaggle.thedevastator_lombard_language_training_dataset.simple_train
  • 981.23 KB
  • 20000 rows
  • 4 columns
Loading...

CREATE TABLE simple_train (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Simple Validation

@kaggle.thedevastator_lombard_language_training_dataset.simple_validation
  • 62.56 KB
  • 1000 rows
  • 4 columns
Loading...

CREATE TABLE simple_validation (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Si Test

@kaggle.thedevastator_lombard_language_training_dataset.si_test
  • 12.28 KB
  • 100 rows
  • 4 columns
Loading...

CREATE TABLE si_test (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Si Train

@kaggle.thedevastator_lombard_language_training_dataset.si_train
  • 12.9 KB
  • 100 rows
  • 4 columns
Loading...

CREATE TABLE si_train (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Si Validation

@kaggle.thedevastator_lombard_language_training_dataset.si_validation
  • 12.74 KB
  • 100 rows
  • 4 columns
Loading...

CREATE TABLE si_validation (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Sk Test

@kaggle.thedevastator_lombard_language_training_dataset.sk_test
  • 634.99 KB
  • 10000 rows
  • 4 columns
Loading...

CREATE TABLE sk_test (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Sk Train

@kaggle.thedevastator_lombard_language_training_dataset.sk_train
  • 1.19 MB
  • 20000 rows
  • 4 columns
Loading...

CREATE TABLE sk_train (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Sk Validation

@kaggle.thedevastator_lombard_language_training_dataset.sk_validation
  • 638.93 KB
  • 10000 rows
  • 4 columns
Loading...

CREATE TABLE sk_validation (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Sl Test

@kaggle.thedevastator_lombard_language_training_dataset.sl_test
  • 496.72 KB
  • 10000 rows
  • 4 columns
Loading...

CREATE TABLE sl_test (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Sl Train

@kaggle.thedevastator_lombard_language_training_dataset.sl_train
  • 721.09 KB
  • 15000 rows
  • 4 columns
Loading...

CREATE TABLE sl_train (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Sl Validation

@kaggle.thedevastator_lombard_language_training_dataset.sl_validation
  • 485.62 KB
  • 10000 rows
  • 4 columns
Loading...

CREATE TABLE sl_validation (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

So Test

@kaggle.thedevastator_lombard_language_training_dataset.so_test
  • 11.84 KB
  • 100 rows
  • 4 columns
Loading...

CREATE TABLE so_test (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

So Train

@kaggle.thedevastator_lombard_language_training_dataset.so_train
  • 12.8 KB
  • 100 rows
  • 4 columns
Loading...

CREATE TABLE so_train (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "langs" VARCHAR,
  "spans" VARCHAR
);

Share link

Anyone who has the link will be able to view this.