Baselight

Portuguese Instruction

Enhancing Non-English Language Models with Portuguese Instruction

@kaggle.thedevastator_portuguese_instruction_dataset

About this Dataset

Portuguese Instruction


Portuguese Instruction

Enhancing Non-English Language Models with Portuguese Instruction

By Rishiraj Acharya (From Huggingface) [source]


About this dataset

This dataset, titled PortugueseChat - Building Instruction Datasets for Non-English Languages, is specifically created to address the challenges faced by non-English language models (LLMs). As English-first LLMs struggle with performance, latency, and speed when processing languages other than English, it has become crucial to enhance their capabilities in non-English languages.

By providing instructional prompts and messages in Portuguese, this dataset aims to improve the performance of non-English LLMs. It serves as a valuable resource for training and fine-tuning these models to better understand and respond effectively to instructions in Portuguese.

The dataset structure includes several columns such as 'prompt', 'messages', 'category', and 'text'. The 'prompt' column contains prompts provided to the language model for generating responses. The 'messages' column represents the conversation messages exchanged between the user and the language model. The 'category' column denotes the topic or category of each conversation. Finally, the 'text' column comprises the generated responses by the language model.

Researchers can utilize this resource for training more robust conversational AI systems that excel at interpreting instructional prompts in Portuguese while generating coherent responses. Additionally, developers can test their existing non-English LLMs against this benchmark dataset as a means to evaluate their performance.

Furthermore, this dataset serves as an important step towards bridging the gap between English-first language models and other languages by promoting research and advancements in non-English instruction datasets creation. With its comprehensive content tailored explicitly for Portuguese speakers engaging with AI systems through instructions or conversations, it contributes significantly towards improving multilingual NLP capabilities across various applications like chatbots, virtual assistants, and more.

The dataset contains a train.csv file, comprising instructional prompts and messages in Portuguese for training purposes. Additionally, there is a test.csv file available for evaluating the performance of non-English LLMs on instructional tasks in Portuguese. Both these files contribute to enhancing the overall quality of instruction data and fine-tuning models to deliver reliable results while interacting with Portuguese speakers.

Overall, the PortugueseChat - Building Instruction Datasets for Non-English Languages dataset provides an essential resource for researchers and practitioners aiming to advance non-English NLP technologies. Its comprehensive nature, combined with its dedicated focus on instruction

How to use the dataset

Dataset Structure

The dataset consists of two files: train.csv and test.csv.

train.csv:

  • Contains training data for building and improving non-English language models.
  • Includes columns such as prompt, messages, category, and text.
    • The prompt column provides the prompts given to the language model for generating responses.
    • The messages column contains conversation messages exchanged between users and the language model.
    • The category column represents the category or topic of each conversation.
    • The generated responses by the language model are stored in the text column.

test.csv:

  • Contains test data for evaluating non-English language models' performance on instructional tasks in Portuguese.
  • Similar structure as train.csv but without generated responses.

How to Explore this Dataset

To get started with exploring this dataset, you can follow these steps:

  • Load or import train.csv into your preferred programming environment or tool for data analysis. You can use popular libraries such as pandas in Python.

  • Take a look at the different columns: prompt, messages, category, text. Understand their meanings and purposes within each conversation context.

  • Analyze different categories or topics present in this dataset by grouping conversations based on their respective categories using grouping functionalities provided by your chosen tool or library.

  • Explore conversations within each category using their prompts, messages exchanged between users and model, and the generated responses. This step will help you understand the structure and context of each conversation.

  • Utilize this dataset for training or fine-tuning non-English language models to improve their performance on instructional tasks in Portuguese.

Considerations

While using this dataset, please keep in mind the following considerations:

  • Ensure proper data cleaning and preprocessing before using the data for training or evaluating language models.
  • Be aware of any biases present in the dataset and handle them accordingly during model training or evaluation.
  • When using this dataset for research or further development, it is recommended to cite its source appropriately.

Research Ideas

  • Chatbot Training: This dataset can be used for training chatbots in non-English languages. By providing a variety of instructional prompts and messages, the dataset can help improve the performance of chatbots in understanding and responding to user queries in languages other than English.
  • Language Model Development: The dataset can be used to build language models specifically designed for non-English languages. By training on the provided prompts and messages, these language models can better understand and generate text in Portuguese, leading to more accurate translations, text generation, and other natural language processing tasks.
  • Language Learning Assistance: The dataset can also be used as a resource for language learning assistance platforms or applications that focus on teaching or practicing Portuguese. By providing instructional prompts and model-generated responses, learners can receive instant feedback and support in their target language, enhancing their learning experience

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv

Column name Description
prompt The prompts provided to the language model for generating responses. (Text)
messages The conversation messages exchanged between users and the language model. (Text)
category The category or topic of each conversation. (Text)
text The generated responses by the language model. (Text)

File: test.csv

Column name Description
prompt The prompts provided to the language model for generating responses. (Text)
messages The conversation messages exchanged between users and the language model. (Text)
category The category or topic of each conversation. (Text)
text The generated responses by the language model. (Text)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Rishiraj Acharya (From Huggingface).

Tables

Test

@kaggle.thedevastator_portuguese_instruction_dataset.test
  • 1010.59 KB
  • 500 rows
  • 5 columns
Loading...

CREATE TABLE test (
  "prompt" VARCHAR,
  "prompt_id" VARCHAR,
  "messages" VARCHAR,
  "category" VARCHAR,
  "text" VARCHAR
);

Train

@kaggle.thedevastator_portuguese_instruction_dataset.train
  • 18.02 MB
  • 9500 rows
  • 5 columns
Loading...

CREATE TABLE train (
  "prompt" VARCHAR,
  "prompt_id" VARCHAR,
  "messages" VARCHAR,
  "category" VARCHAR,
  "text" VARCHAR
);