Baselight

Belgian Statutory Article Retrieval Dataset

Legal Q&A Dataset for Law Information Retrieval

@kaggle.thedevastator_belgian_statutory_article_retrieval_dataset_bsar

Loading...
Loading...

About this Dataset

Belgian Statutory Article Retrieval Dataset


Belgian Statutory Article Retrieval Dataset (BSARD)

Legal Q&A Dataset for Law Information Retrieval

By maastrichtlawtech (From Huggingface) [source]


About this dataset

In the train.csv file, you will find a vast array of legal questions along with the relevant statutory articles that provide answers or guidance on these questions. Each question is associated with specific statutory article IDs. Additionally, the categorical information such as categories, subcategories, and extra descriptions are provided to offer further context to the legal queries.

Similarly, the test.csv file presents a set of legal questions along with their corresponding statutory article IDs. However, in this case, not only do you have access to category and subcategory labels but also detailed extra descriptions that can assist in understanding the particular nuances or background information related to each question.

Lastly, for those interested in exploring synthetic data for law information retrieval tasks, the synthetic.csv file contains synthesized legal questions paired with corresponding statutory article IDs.

It is important to note that this dataset does not include specific dates associated with each entry. It is solely focused on providing an extensive collection of legal questions and their corresponding statutory articles to facilitate research and development in law information retrieval applications.

With its comprehensive coverage and carefully curated data entries encompassing various categories and subcategories within Belgium's legal framework, BSARD serves as a valuable resource for researchers working on natural language processing (NLP), machine learning algorithms designe

How to use the dataset

Introduction:

Dataset Overview:
The BSARD dataset consists of three main files: train.csv, test.csv, and synthetic.csv. Each file contains legal questions along with additional information such as statutory article IDs, categories, subcategories, and extra descriptions.

File Descriptions:

  • train.csv: This file contains legal questions from real-life scenarios that were used for training purposes.

  • test.csv: The test.csv file includes unseen legal questions along with their corresponding statutory article IDs, categories, subcategories, and extra descriptions. It serves as a benchmark to evaluate model performance.

  • synthetic.csv: Synthetic legal questions are present in this file that can help in diversifying the training data when necessary.

Understanding the Columns:

Each dataset file consists of several columns that play an essential role in conducting law information retrieval tasks:

  • question: This column holds the actual legal question text.

  • category: Represents the broad category to which each legal question belongs.

  • subcategory: Indicates the specific subcategory under which each question falls.

  • extra_description (optional): Provides further contextual or additional information related to specific legal questions.

Using the Dataset Effectively:

  • Preprocessing:
    • Remove any unnecessary characters from the text.
    • Consider removing stop words or performing stemming/lemmatization if appropriate for your task.
    • Normalize case sensitivity based on your requirements.
  • Training Phase (using train.csv):
    • Analyze statistical properties of categories/subcategories in order to understand their distributions accurately.
    • Employ suitable algorithms like classification models or natural language processing techniques based on the task's requirement.
    • Leverage additional information in the extra_description column to extract more valuable features.
  • Evaluation Phase (using test.csv):
    • Develop a model using the training set and apply it to unseen legal questions from the test set.
    • Analyze performance metrics such as accuracy, precision, recall, or F1-score depending on your evaluation goals.
  • Synthetic Data (synthetic.csv):
    • Utilize synthetic data to augment your training dataset and increase its diversity when necessary.

Conclusion:

Research Ideas

  • Legal research: The dataset can be used for legal research purposes, where researchers can analyze the legal questions and statutory articles to gain insights into specific areas of law or identify common legal issues.
  • Information retrieval system development: The dataset can be used to develop and train information retrieval systems specifically designed for retrieving relevant statutory articles based on legal questions.
  • Natural language processing (NLP) applications: The dataset can be utilized in the development of NLP models and algorithms that aim to understand and process legal documents, such as identifying key terms, extracting relevant information, or summarizing legal texts

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv

Column name Description
question The text of the legal question. (Text)
category The broad classification or category assigned to each question. (Text)
subcategory A more specific classification within each category. (Text)
extra_description Additional context or information related to the respective question. (Text)

File: test.csv

Column name Description
question The text of the legal question. (Text)
category The broad classification or category assigned to each question. (Text)
subcategory A more specific classification within each category. (Text)
extra_description Additional context or information related to the respective question. (Text)

File: synthetic.csv

Column name Description
question The text of the legal question. (Text)
category The broad classification or category assigned to each question. (Text)
subcategory A more specific classification within each category. (Text)
extra_description Additional context or information related to the respective question. (Text)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit maastrichtlawtech (From Huggingface).

Tables

Synthetic

@kaggle.thedevastator_belgian_statutory_article_retrieval_dataset_bsar.synthetic
  • 3.74 MB
  • 113165 rows
  • 6 columns
Loading...

CREATE TABLE synthetic (
  "id" BIGINT,
  "question" VARCHAR,
  "article_ids" VARCHAR,
  "category" VARCHAR,
  "subcategory" VARCHAR,
  "extra_description" VARCHAR
);

Test

@kaggle.thedevastator_belgian_statutory_article_retrieval_dataset_bsar.test
  • 24.89 KB
  • 222 rows
  • 6 columns
Loading...

CREATE TABLE test (
  "id" BIGINT,
  "question" VARCHAR,
  "article_ids" VARCHAR,
  "category" VARCHAR,
  "subcategory" VARCHAR,
  "extra_description" VARCHAR
);

Train

@kaggle.thedevastator_belgian_statutory_article_retrieval_dataset_bsar.train
  • 63.06 KB
  • 886 rows
  • 6 columns
Loading...

CREATE TABLE train (
  "id" BIGINT,
  "question" VARCHAR,
  "article_ids" VARCHAR,
  "category" VARCHAR,
  "subcategory" VARCHAR,
  "extra_description" VARCHAR
);

Share link

Anyone who has the link will be able to view this.