Baselight

Open Assistant

Over 10,000 Annotated Trees in 35 Languages

@kaggle.thedevastator_multilingual_conversation_dataset

Loading...
Loading...

About this Dataset

Open Assistant


Open Assistant

Over 10,000 Annotated Trees in 35 Languages

By Huggingface Hub [source]


About this dataset

OpenAssistant Conversations (OASST1) is a remarkable conversation corpus created with the help of over 13,500 volunteers and containing over 10,000 fully annotated conversation trees across 35 languages. It contains 161,443 messages that have all been human-annotated with 461,292 quality ratings for quality assurance purposes.

This dataset offers an incredible resource to researchers and developers alike who want to explore conversational AI technology. With the immense breadth of language options supported by OASST1, projects can be built that engage users from all over the world in natural language conversations. Additionally, since every message has undergone extensive human annotation and review, its accuracy can be trusted implicitly when building your own bots or related applications.

Whatever your goals may be in working with Natural Language Processing (NLP) technologies, OASST1 promises you a versatile platform filled with esteemed possibilities!

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

Guide to Using the Multilingual Conversation (OASST1) Dataset

Introduction

This guide is to help you understand and use the OpenAssistant Conversations (OASST1) dataset. It covers important terms and topics related to the dataset, provides an overview of how it is structured, and outlines a step-by-step approach for utilizing its features. The conversational data included in this dataset can be used to train cognitive assistants across multiple languages, as well as evaluating language recognition systems.

What Is Included in the Dataset?

The OASST1 dataset includes 161,443 messages spread across 35 different languages. These messages are annotated with 461,292 quality ratings from human annotators across 13,500 volunteers covering 10 thousand fully annotated conversation trees. The messages are made up of text content along with associated labels for context such as role (user or assistant), language used, synthetic data identification information (machine generated), review results (positive/negative/neutral), and detoxification flag if appropriate. Emojis may also be noted within message text where appropriate.

Structure of Data

The structure of data within each conversation tree is organized by a combination of fields listed in validation and training datasets:

  • Role - Speaker role identified as user or assistant
  • Text - Text content provided by speaker
  • Language – Language spoken or written
  • Review Count – Number of reviews indicated by human raters Constructed Message Sets - Deleted Flag – Whether message was deleted or not Savior Model Name – Name indicating playful transformer model used for synthetic message generation Synthetic Indicator– Boolean indicating synthetic vs real messages Review Results – Positive/Negative/Neutral designation given by humans based on current assertions Detoxification Flag- Boolean flag; indicates detoxification has been applied Tree State– Depicts internal conversation tree progression Rank– Rating given from 1-5 assigned by human raters Labels— Contextual tags attached to identify other topic areas besides default language provided Emojis- Nonverbal comments which can be recognized visually Created Date— Date upon which message was created Model Name– Type name referencing particular machine learning model utilized when generating synthetically derived conversations

How Can This Data Be Used?

This conversational data can be used in numerous ways both practically and academically depending on your project goals. It supports evaluation

Research Ideas

  • Natural language understanding machine learning tasks such as intent classification or sentiment analysis
  • Training chatbot models with state-of-the-art performance in multiple languages
  • Language usage studies and AI research with a corpus of human conversations in 35 languages

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv

Column name Description
created_date Date the conversation was created. (Date)
text Text of the conversation. (String)
role Role of the speaker in the conversation (User/Bot). (String)
lang Language of the conversation. (String)
review_count Number of reviews for the conversation. (Integer)
review_result Result of the reviews for the conversation. (String)
deleted Whether the conversation was deleted or not. (Boolean)
rank Rank of the conversation. (Integer)
synthetic Whether the conversation is synthetic or not. (Boolean)
model_name Name of the model used for the conversation. (String)
detoxify Whether the conversation was detoxified or not. (Boolean)
tree_state State of the conversation tree. (String)
emojis Emojis used in the conversation. (String)
labels Labels associated with the conversation. (String)

File: train.csv

Column name Description
created_date Date the conversation was created. (Date)
text Text of the conversation. (String)
role Role of the speaker in the conversation (User/Bot). (String)
lang Language of the conversation. (String)
review_count Number of reviews for the conversation. (Integer)
review_result Result of the reviews for the conversation. (String)
deleted Whether the conversation was deleted or not. (Boolean)
rank Rank of the conversation. (Integer)
synthetic Whether the conversation is synthetic or not. (Boolean)
model_name Name of the model used for the conversation. (String)
detoxify Whether the conversation was detoxified or not. (Boolean)
tree_state State of the conversation tree. (String)
emojis Emojis used in the conversation. (String)
labels Labels associated with the conversation. (String)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.

Tables

Train

@kaggle.thedevastator_multilingual_conversation_dataset.train
  • 44.13 MB
  • 84437 rows
  • 18 columns
Loading...

CREATE TABLE train (
  "message_id" VARCHAR,
  "parent_id" VARCHAR,
  "user_id" VARCHAR,
  "created_date" VARCHAR,
  "text" VARCHAR,
  "role" VARCHAR,
  "lang" VARCHAR,
  "review_count" BIGINT,
  "review_result" VARCHAR,
  "deleted" BOOLEAN,
  "rank" DOUBLE,
  "synthetic" BOOLEAN,
  "model_name" VARCHAR,
  "detoxify" VARCHAR,
  "message_tree_id" VARCHAR,
  "tree_state" VARCHAR,
  "emojis" VARCHAR,
  "labels" VARCHAR
);

Validation

@kaggle.thedevastator_multilingual_conversation_dataset.validation
  • 2.38 MB
  • 4401 rows
  • 18 columns
Loading...

CREATE TABLE validation (
  "message_id" VARCHAR,
  "parent_id" VARCHAR,
  "user_id" VARCHAR,
  "created_date" VARCHAR,
  "text" VARCHAR,
  "role" VARCHAR,
  "lang" VARCHAR,
  "review_count" BIGINT,
  "review_result" VARCHAR,
  "deleted" BOOLEAN,
  "rank" DOUBLE,
  "synthetic" BOOLEAN,
  "model_name" VARCHAR,
  "detoxify" VARCHAR,
  "message_tree_id" VARCHAR,
  "tree_state" VARCHAR,
  "emojis" VARCHAR,
  "labels" VARCHAR
);

Share link

Anyone who has the link will be able to view this.