LeNER-Br: Portuguese Legal NER by Kaggle | Other

About this Dataset

LeNER-Br: Portuguese Legal NER

Labeled Portuguese Legal NER

By lener_br (From Huggingface) [source]

About this dataset

LeNER-Br is a comprehensive dataset specifically created for named entity recognition (NER) in the Portuguese language, particularly within the domain of legal documents. This dataset consists of manually annotated texts extracted from legislation and legal cases. Each text has undergone meticulous tagging to identify various types of named entities, including persons, locations, time entities, organizations, legislation references, and legal case references.

To curate this dataset, a total of 66 legal documents were collected from diverse Brazilian Courts encompassing both superior and state levels. Prominent courts such as the Supremo Tribunal Federal, Superior Tribunal de Justiça, Tribunal de Justiça de Minas Gerais, and Tribunal de Contas da União contributed to this collection. Additionally, four significant legislation documents like Lei Maria da Penha were also included to ensure a comprehensive representation. In total, 70 unique documents form part of this extensive dataset.

The primary purpose of LeNER-Br is to facilitate the development and evaluation of NER models specifically tailored for Portuguese legal text analysis. The labeled data provided in this dataset enables researchers and data scientists to train their NER models effectively by leveraging insights from varied legal contexts present in Brazil's jurisdiction system.

The columns included within each instance of annotated text include tokens which represent individual words or tokens found within the original texts. The ner_tags column provides valuable information through assigned NER tags for each token that specify their entity type representation - whether it be a person's name or organization name specific to law or any other relevant category that falls under legislative contexts.

Researchers may use LeNER-Br as a benchmark test set against which they can evaluate the performance and efficacy of their own NER models designed for Portuguese legal documents. Moreover,tokenscolumn is repeated twice with additional tagged descriptions including ner_tagswhich contains relevant NER information assigned uniquely for each token.

In conclusion,LeNER-Br dataset is an invaluable resource for advancing NER techniques within the Portuguese language, particularly within the legal domain. It provides a high-quality, manually annotated collection of legal texts specifically chosen to accurately represent Brazil's legislative landscape and entities involved. This dataset serves as a strong foundation for training and evaluating NER models and facilitates advancements in information extraction from Portuguese legal documents

How to use the dataset

The LeNER-Br dataset is a valuable resource for researchers and practitioners working on named entity recognition (NER) in the context of Portuguese legal documents. This guide will provide you with an overview of the dataset and how to effectively utilize it for your NER tasks.

Dataset Overview

LeNER-Br is composed of 70 manually annotated legal documents written in Portuguese. These documents were collected from various Brazilian Courts, including superior and state levels such as the Supremo Tribunal Federal, Superior Tribunal de Justiça, Tribunal de Justiça de Minas Gerais, and Tribunal de Contas da União. The dataset also includes four legislation documents, such as Lei Maria da Penha.

The dataset provides tags for different types of named entities commonly found in legal texts. These named entity types include persons, locations, time entities, organizations, legislations, and legal cases. Additionally, there are two main columns in the dataset that you should pay attention to:

tokens or tokens: This column contains individual words or tokens present in the text of the legal documents.

ner_tags or ner_tags: This column contains named entity recognition (NER) tags assigned to each token in the text. These tags indicate the type of named entity that each token represents.

Utilizing the Dataset

Here are some steps you can follow to make effective use of this dataset:

Data Exploration: Start by loading and exploring the data using your preferred programming language or data analysis tools like Python's pandas library.

Load train.csv file for training your NER models with manually annotated texts.

Utilize test.csv file as a test set for evaluating model performance.

Use validation.csv file for additional validation during model development.

Preprocessing:

Perform necessary preprocessing steps such as removing unwanted characters, normalizing the text, or handling missing values.

Split the dataset into features (tokens) and labels (ner_tags).

Feature Engineering:

Depending on your NER model's requirements, you may need to convert tokens into numerical representations using techniques like word embeddings or one-hot encoding.

Modeling:

Train your NER models using the prepared training set.

Evaluate model performance on the test set to measure accuracy and other relevant metrics.

Fine-tuning and Improvement:
- Use validation dataset for fine-tuning parameters

Research Ideas

Training Named Entity Recognition (NER) models: The dataset can be used to train NER models specifically for Portuguese legal documents. By using the manually annotated texts, models can learn to recognize and classify different types of named entities accurately in the context of legal documents.

Evaluating NER model performance: The dataset includes a separate test set (test.csv) that can be used to evaluate the performance of pre-trained NER models on Portuguese legal documents. This allows researchers and developers to assess how well their models generalize to new legal texts.

Developing language technologies for legal domain: With its focus on legal texts, LeNER-Br can be utilized to develop specific language technologies for the legal domain in Portuguese-speaking countries. This could include applications such as automated summarization of case law, extraction of information from legislation, or building recommendation systems for lawyers based on past cases

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv

Column name	Description
tokens	This column contains individual words or tokens extracted from the text of each document. (Text)
ner_tags	This column provides labeled annotations indicating the named entity recognition (NER) tag for each token. The NER tags classify the tokens into different entity types such as person names, locations, time entities, organizations, legislation names, or legal case references. (Text)

File: train.csv

Column name	Description
tokens	This column contains individual words or tokens extracted from the text of each document. (Text)
ner_tags	This column provides labeled annotations indicating the named entity recognition (NER) tag for each token. The NER tags classify the tokens into different entity types such as person names, locations, time entities, organizations, legislation names, or legal case references. (Text)

File: test.csv

Column name	Description
tokens	This column contains individual words or tokens extracted from the text of each document. (Text)
ner_tags	This column provides labeled annotations indicating the named entity recognition (NER) tag for each token. The NER tags classify the tokens into different entity types such as person names, locations, time entities, organizations, legislation names, or legal case references. (Text)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit lener_br (From Huggingface).

Tables

Test

@kaggle.thedevastator_lener_br_portuguese_legal_ner_dataset.test

189.61 KB
1390 rows
3 columns


CREATE TABLE test (
  "id" BIGINT,
  "tokens" VARCHAR,
  "ner_tags" VARCHAR
);

Train

@kaggle.thedevastator_lener_br_portuguese_legal_ner_dataset.train

891.44 KB
7828 rows
3 columns


CREATE TABLE train (
  "id" BIGINT,
  "tokens" VARCHAR,
  "ner_tags" VARCHAR
);

Validation

@kaggle.thedevastator_lener_br_portuguese_legal_ner_dataset.validation

148.69 KB
1177 rows
3 columns


CREATE TABLE validation (
  "id" BIGINT,
  "tokens" VARCHAR,
  "ner_tags" VARCHAR
);