Open Assistant
Over 10,000 Annotated Trees in 35 Languages
By Huggingface Hub [source]
About this dataset
OpenAssistant Conversations (OASST1) is a remarkable conversation corpus created with the help of over 13,500 volunteers and containing over 10,000 fully annotated conversation trees across 35 languages. It contains 161,443 messages that have all been human-annotated with 461,292 quality ratings for quality assurance purposes.
This dataset offers an incredible resource to researchers and developers alike who want to explore conversational AI technology. With the immense breadth of language options supported by OASST1, projects can be built that engage users from all over the world in natural language conversations. Additionally, since every message has undergone extensive human annotation and review, its accuracy can be trusted implicitly when building your own bots or related applications.
Whatever your goals may be in working with Natural Language Processing (NLP) technologies, OASST1 promises you a versatile platform filled with esteemed possibilities!
More Datasets
For more datasets, click here.
Featured Notebooks
- 🚨 Your notebook can be here! 🚨!
How to use the dataset
Guide to Using the Multilingual Conversation (OASST1) Dataset
Introduction
This guide is to help you understand and use the OpenAssistant Conversations (OASST1) dataset. It covers important terms and topics related to the dataset, provides an overview of how it is structured, and outlines a step-by-step approach for utilizing its features. The conversational data included in this dataset can be used to train cognitive assistants across multiple languages, as well as evaluating language recognition systems.
What Is Included in the Dataset?
The OASST1 dataset includes 161,443 messages spread across 35 different languages. These messages are annotated with 461,292 quality ratings from human annotators across 13,500 volunteers covering 10 thousand fully annotated conversation trees. The messages are made up of text content along with associated labels for context such as role (user or assistant), language used, synthetic data identification information (machine generated), review results (positive/negative/neutral), and detoxification flag if appropriate. Emojis may also be noted within message text where appropriate.
Structure of Data
The structure of data within each conversation tree is organized by a combination of fields listed in validation and training datasets:
- Role - Speaker role identified as user or assistant
- Text - Text content provided by speaker
- Language – Language spoken or written
- Review Count – Number of reviews indicated by human raters Constructed Message Sets - Deleted Flag – Whether message was deleted or not Savior Model Name – Name indicating playful transformer model used for synthetic message generation Synthetic Indicator– Boolean indicating synthetic vs real messages Review Results – Positive/Negative/Neutral designation given by humans based on current assertions Detoxification Flag- Boolean flag; indicates detoxification has been applied Tree State– Depicts internal conversation tree progression Rank– Rating given from 1-5 assigned by human raters Labels— Contextual tags attached to identify other topic areas besides default language provided Emojis- Nonverbal comments which can be recognized visually Created Date— Date upon which message was created Model Name– Type name referencing particular machine learning model utilized when generating synthetically derived conversations
How Can This Data Be Used?
This conversational data can be used in numerous ways both practically and academically depending on your project goals. It supports evaluation
Research Ideas
- Natural language understanding machine learning tasks such as intent classification or sentiment analysis
- Training chatbot models with state-of-the-art performance in multiple languages
- Language usage studies and AI research with a corpus of human conversations in 35 languages
Acknowledgements
If you use this dataset in your research, please credit the original authors.
Data Source
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
Columns
File: validation.csv
Column name |
Description |
created_date |
Date the conversation was created. (Date) |
text |
Text of the conversation. (String) |
role |
Role of the speaker in the conversation (User/Bot). (String) |
lang |
Language of the conversation. (String) |
review_count |
Number of reviews for the conversation. (Integer) |
review_result |
Result of the reviews for the conversation. (String) |
deleted |
Whether the conversation was deleted or not. (Boolean) |
rank |
Rank of the conversation. (Integer) |
synthetic |
Whether the conversation is synthetic or not. (Boolean) |
model_name |
Name of the model used for the conversation. (String) |
detoxify |
Whether the conversation was detoxified or not. (Boolean) |
tree_state |
State of the conversation tree. (String) |
emojis |
Emojis used in the conversation. (String) |
labels |
Labels associated with the conversation. (String) |
File: train.csv
Column name |
Description |
created_date |
Date the conversation was created. (Date) |
text |
Text of the conversation. (String) |
role |
Role of the speaker in the conversation (User/Bot). (String) |
lang |
Language of the conversation. (String) |
review_count |
Number of reviews for the conversation. (Integer) |
review_result |
Result of the reviews for the conversation. (String) |
deleted |
Whether the conversation was deleted or not. (Boolean) |
rank |
Rank of the conversation. (Integer) |
synthetic |
Whether the conversation is synthetic or not. (Boolean) |
model_name |
Name of the model used for the conversation. (String) |
detoxify |
Whether the conversation was detoxified or not. (Boolean) |
tree_state |
State of the conversation tree. (String) |
emojis |
Emojis used in the conversation. (String) |
labels |
Labels associated with the conversation. (String) |
Acknowledgements
If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.