Baselight

Multilingual NER Dataset

Multilingual NER Dataset for Named Entity Recognition

@kaggle.thedevastator_multilingual_ner_dataset

Loading...
Loading...

About this Dataset

Multilingual NER Dataset


Multilingual NER Dataset

Multilingual NER Dataset for Named Entity Recognition

By Babelscape (From Huggingface) [source]


About this dataset

The Babelscape/wikineural NER Dataset is a comprehensive and diverse collection of multilingual text data specifically designed for the task of Named Entity Recognition (NER). It offers an extensive range of labeled sentences in nine different languages: French, German, Portuguese, Spanish, Polish, Dutch, Russian, English, and Italian.

Each sentence in the dataset contains tokens (words or characters) that have been labeled with named entity recognition tags. These tags provide valuable information about the type of named entity each token represents. The dataset also includes a language column to indicate the language in which each sentence is written.

This dataset serves as an invaluable resource for developing and evaluating NER models across multiple languages. It encompasses various domains and contexts to ensure diversity and representativeness. Researchers and practitioners can utilize this dataset to train and test their NER models in real-world scenarios.

By using this dataset for NER tasks, users can enhance their understanding of how named entities are recognized across different languages. Furthermore, it enables benchmarking performance comparisons between various NER models developed for specific languages or trained on multiple languages simultaneously.

Whether you are an experienced researcher or a beginner exploring multilingual NER tasks, the Babelscape/wikineural NER Dataset provides a highly informative and versatile resource that can contribute to advancements in natural language processing and information extraction applications on a global scale

How to use the dataset

  • Understand the Data Structure:

    • The dataset consists of labeled sentences in nine different languages: French (fr), German (de), Portuguese (pt), Spanish (es), Polish (pl), Dutch (nl), Russian (ru), English (en), and Italian (it).
    • Each sentence is represented by three columns: tokens, ner_tags, and lang.
    • The tokens column contains the individual words or characters in each labeled sentence.
    • The ner_tags column provides named entity recognition tags for each token, indicating their entity types.
    • The lang column specifies the language of each sentence.
  • Explore Different Languages:

    • Since this dataset covers multiple languages, you can choose to focus on a specific language or perform cross-lingual analysis.
    • Analyzing multiple languages can help uncover patterns and differences in named entities across various linguistic contexts.
  • Preprocessing and Cleaning:

    • Before training your NER models or applying any NLP techniques to this dataset, it's essential to preprocess and clean the data.
    • Consider removing any unnecessary punctuation marks or special characters unless they carry significant meaning in certain languages.
  • Training Named Entity Recognition Models:
    4a. Data Splitting: Divide the dataset into training, validation, and testing sets based on your requirements using appropriate ratios.
    4b. Feature Extraction: Prepare input features from tokenized text data such as word embeddings or character-level representations depending on your model choice.
    4c. Model Training: Utilize state-of-the-art NER models (e.g., LSTM-CRF, Transformer-based models) to train on the labeled sentences and ner_tags columns.
    4d. Evaluation: Evaluate your trained model's performance using the provided validation dataset or test datasets specific to each language.

  • Applying Pretrained Models:

    • Instead of training a model from scratch, you can leverage existing pretrained NER models like BERT, GPT-2, or SpaCy's named entity recognition capabilities.
    • Fine-tune these pre-trained models on your specific NER task using the labeled

Research Ideas

  • Training NER models: This dataset can be used to train NER models in multiple languages. By providing labeled sentences and their corresponding named entity recognition tags, the dataset can help train models to accurately identify and classify named entities in different languages.
  • Evaluating NER performance: The dataset can be used as a benchmark to evaluate the performance of pre-trained or custom-built NER models. By using the labeled sentences as test data, developers and researchers can measure the accuracy, precision, recall, and F1-score of their models across multiple languages.
  • Cross-lingual analysis: With labeled sentences available in nine different languages, researchers can perform cross-lingual analysis on named entities across different language datasets. This can provide insights into how certain types of entities are referred to or categorized across various cultures and linguistic contexts

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: test_fr.csv

Column name Description
tokens This column contains individual words or characters present in each sentence. (Text)
ner_tags This column specifies the named entity recognition tags associated with each token. These tags indicate whether a token represents a person's name, organization, location, date, or other entities. (Text)
lang This column indicates the language of the sentences. (Text)

File: val_de.csv

Column name Description
tokens This column contains individual words or characters present in each sentence. (Text)
ner_tags This column specifies the named entity recognition tags associated with each token. These tags indicate whether a token represents a person's name, organization, location, date, or other entities. (Text)
lang This column indicates the language of the sentences. (Text)

File: test_de.csv

Column name Description
tokens This column contains individual words or characters present in each sentence. (Text)
ner_tags This column specifies the named entity recognition tags associated with each token. These tags indicate whether a token represents a person's name, organization, location, date, or other entities. (Text)
lang This column indicates the language of the sentences. (Text)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Babelscape (From Huggingface).

Tables

Test De

@kaggle.thedevastator_multilingual_ner_dataset.test_de
  • 1.09 MB
  • 12372 rows
  • 3 columns
Loading...

CREATE TABLE test_de (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Test En

@kaggle.thedevastator_multilingual_ner_dataset.test_en
  • 1.18 MB
  • 11597 rows
  • 3 columns
Loading...

CREATE TABLE test_en (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Test Es

@kaggle.thedevastator_multilingual_ner_dataset.test_es
  • 1020.16 KB
  • 9618 rows
  • 3 columns
Loading...

CREATE TABLE test_es (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Test Fr

@kaggle.thedevastator_multilingual_ner_dataset.test_fr
  • 1.34 MB
  • 12678 rows
  • 3 columns
Loading...

CREATE TABLE test_fr (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Test It

@kaggle.thedevastator_multilingual_ner_dataset.test_it
  • 1.44 MB
  • 11069 rows
  • 3 columns
Loading...

CREATE TABLE test_it (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Test Nl

@kaggle.thedevastator_multilingual_ner_dataset.test_nl
  • 927.75 KB
  • 10547 rows
  • 3 columns
Loading...

CREATE TABLE test_nl (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Test Pl

@kaggle.thedevastator_multilingual_ner_dataset.test_pl
  • 1.12 MB
  • 13585 rows
  • 3 columns
Loading...

CREATE TABLE test_pl (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Test Pt

@kaggle.thedevastator_multilingual_ner_dataset.test_pt
  • 1.18 MB
  • 10160 rows
  • 3 columns
Loading...

CREATE TABLE test_pt (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Test Ru

@kaggle.thedevastator_multilingual_ner_dataset.test_ru
  • 1.43 MB
  • 11580 rows
  • 3 columns
Loading...

CREATE TABLE test_ru (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Train De

@kaggle.thedevastator_multilingual_ner_dataset.train_de
  • 9.03 MB
  • 98640 rows
  • 3 columns
Loading...

CREATE TABLE train_de (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Train En

@kaggle.thedevastator_multilingual_ner_dataset.train_en
  • 9.76 MB
  • 92720 rows
  • 3 columns
Loading...

CREATE TABLE train_en (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Train Es

@kaggle.thedevastator_multilingual_ner_dataset.train_es
  • 8.31 MB
  • 76320 rows
  • 3 columns
Loading...

CREATE TABLE train_es (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Train Fr

@kaggle.thedevastator_multilingual_ner_dataset.train_fr
  • 11.54 MB
  • 100800 rows
  • 3 columns
Loading...

CREATE TABLE train_fr (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Train It

@kaggle.thedevastator_multilingual_ner_dataset.train_it
  • 10.93 MB
  • 88400 rows
  • 3 columns
Loading...

CREATE TABLE train_it (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Train Nl

@kaggle.thedevastator_multilingual_ner_dataset.train_nl
  • 6.8 MB
  • 83680 rows
  • 3 columns
Loading...

CREATE TABLE train_nl (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Train Pl

@kaggle.thedevastator_multilingual_ner_dataset.train_pl
  • 9.12 MB
  • 108160 rows
  • 3 columns
Loading...

CREATE TABLE train_pl (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Train Pt

@kaggle.thedevastator_multilingual_ner_dataset.train_pt
  • 8.87 MB
  • 80560 rows
  • 3 columns
Loading...

CREATE TABLE train_pt (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Train Ru

@kaggle.thedevastator_multilingual_ner_dataset.train_ru
  • 12.01 MB
  • 92320 rows
  • 3 columns
Loading...

CREATE TABLE train_ru (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Val De

@kaggle.thedevastator_multilingual_ner_dataset.val_de
  • 1.1 MB
  • 12330 rows
  • 3 columns
Loading...

CREATE TABLE val_de (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Val En

@kaggle.thedevastator_multilingual_ner_dataset.val_en
  • 1.2 MB
  • 11590 rows
  • 3 columns
Loading...

CREATE TABLE val_en (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Val Es

@kaggle.thedevastator_multilingual_ner_dataset.val_es
  • 1003.5 KB
  • 9540 rows
  • 3 columns
Loading...

CREATE TABLE val_es (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Val Fr

@kaggle.thedevastator_multilingual_ner_dataset.val_fr
  • 1.4 MB
  • 12600 rows
  • 3 columns
Loading...

CREATE TABLE val_fr (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Val It

@kaggle.thedevastator_multilingual_ner_dataset.val_it
  • 1.42 MB
  • 11050 rows
  • 3 columns
Loading...

CREATE TABLE val_it (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Val Nl

@kaggle.thedevastator_multilingual_ner_dataset.val_nl
  • 927.89 KB
  • 10460 rows
  • 3 columns
Loading...

CREATE TABLE val_nl (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Val Pl

@kaggle.thedevastator_multilingual_ner_dataset.val_pl
  • 1.16 MB
  • 13520 rows
  • 3 columns
Loading...

CREATE TABLE val_pl (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Val Pt

@kaggle.thedevastator_multilingual_ner_dataset.val_pt
  • 1.17 MB
  • 10070 rows
  • 3 columns
Loading...

CREATE TABLE val_pt (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Val Ru

@kaggle.thedevastator_multilingual_ner_dataset.val_ru
  • 1.44 MB
  • 11540 rows
  • 3 columns
Loading...

CREATE TABLE val_ru (
  "tokens" VARCHAR,
  "ner_tags" VARCHAR,
  "lang" VARCHAR
);

Share link

Anyone who has the link will be able to view this.