Databricks Dolly (15K) by Kaggle | Other

About this Dataset

Databricks Dolly (15K)

Over 15,000 Language Models and Dialogues for Interactive Chat Applications

By Huggingface Hub [source]

About this dataset

This exceptional dataset, created by Databricks employees, provides 15,000+ language models and dialogues to power dynamic ChatGPT applications. By generating prompt-response pairs from 8 different instruction categories, our goal is to facilitate the use of large language models for interactive dialogue interactions—all while avoiding information taken from any web sources except Wikipedia for particular instruction sets. Use this open-source dataset to explore the boundaries of text-based conversations and uncover new insights about natural language processing!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

First, let's take a look at the columns in this dataset: Instruction (string), Context (string), Response (string), Category (string). Each record represents a prompt-response pair or conversation between two people. The Instruction and Context fields contain what is said by one individual and the Response holds what is said back by another, culminating in a conversation. These paired entries are then classified into one of 8 different categories based on their content. Knowing this information can help you best utilize the corpus to your desired purposes.

For example: if you are training a dialogue system you could develop multiple funneling pipelines using this dataset to enrich your model with real-world conversations or create intelligent chatbot interactions. If you want to generate natural language answers as part of Q&A systems then you could utilize excerpts from Wikipedia for particular subsets of instruction categories as well drawing upon prompt-response pairs within those given instructions all from within the Databricks set. Furthermore, since each record is independently labeled into one of 8 defined categories - such as make reservations or compare products - there are many possibilities for leveraging these classification labels with supervised learning techniques such as multi-class classification neural networks or logistic regression classifiers.

In short, this substantial resource offers an array of creative ways to explore different types of dialogue related applications without being limited by needing data from external web sources – all that’s needed from here is your own imagination!

Research Ideas

Generating deep learning models to detect and respond to conversational intent.

Training language models to use natural language processing (NLP) for customer service queries.

Creating custom dialogue agents that are better able to handle more complex conversational interactions, such as those powered by machine learning techniques like supervised or unsupervised learning methods

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv

Column name	Description
Instruction	Text prompt that should generate an appropriate response from a machine learning model/chatbot using natural language processing techniques. (Text)
Context	Provides context to improve accuracy by giving the model more information about what’s happening in a conversation or request execution. (Text)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.

Tables

Train

@kaggle.thedevastator_databricks_chatgpt_dataset.train

7.32 MB
15011 rows
4 columns


CREATE TABLE train (
  "instruction" VARCHAR,
  "context" VARCHAR,
  "response" VARCHAR,
  "category" VARCHAR
);