Baselight

Self-instruct Starcoder

Instruct dataset generated from starcoder

@kaggle.thedevastator_exploring_starcoder_instructions

Loading...
Loading...

About this Dataset

Self-instruct Starcoder


Self-instruct Starcoder

Instruct dataset generated from starcoder

By Huggingface Hub [source]


About this dataset

This unique dataset explores the power of self-instructional language suitable for use in natural language processing applications. Developed by Stanford Alpaca, StarCoder is a powerful and sophisticated algorithm that uses deep learning to automatically generate instructions with both accuracy and creative flair. To assure the quality of our findings, we tailored specific modifications to our pipeline in order to avoid redundancy, resulting in three sets: curated, raw, and unique.

The curated set consists of non-redundant instructions derived from an instruction similarity threshold of 0.5 while maintaining the original creative aspect of the language generated from StarCoder. The raw dataset contains all original instructions untouched by our modifications while the unique set includes newly created instructions based on existing ones within the curated set. Each entry within these datasets contains an instruction with its respective output string as well as its most similar instruction along with their corresponding average similarity scores offering users a vast selection of options when researching this powerful tool. This comprehensive database is invaluable for studying self-instructional language and pushing research forward in natural language processing applications

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains a comprehensive set of self-instructional language generated from StarCoder, a model created by Stanford Alpaca. To make this dataset more diverse and unique, relevant modifications were applied to the generation pipeline to enhance both the accuracy and creativity of the instructions.

The instructions are divided into three distinct datasets: curated, raw, and unique. The curated set consists of non-redundant instructions with an instruction similarity threshold of 0.5; this set is intended for models that require greater accuracy or experimentation with context homogeneity and semantic flexibility. The raw dataset is comprised of all the original instructions generated with StarCoder; this preserves as much information as possible, making it useful for tasks such as hierarchical structure extraction or analyzing different types of variable precision outputs. Finally, the unique set is made up from distinct instruction sets generated from the curated one; this dataset focuses on selecting reliable output units while maintaining significant semantic diversity within language generation tasks such as dialogue systems or open-ended dialogue agents..

Each instruction row in the datasets pairs an input instruction string along with its output string and most similar command-, producing a total of 9 columns in each file (curated/raw/unique).csv: Instruction (the input string), Output (the result generated by Hinricher's learning network using scaled remix algorithms), Most Similar Instruction (the most similar command based on TF-IDF vectors) ,and Average Similarity Score (an average score between 0 to 1 indicating similarity between two commands). Furthermore, each row also includes their respective columns labels in bold font type indicating their role within the pair – e.g., “Instruction” vs “Output” vsetc).

Using these three datasets provided – ‘Curated', ‘Raw' and 'Unique' - helps lifting any ambiguity concerning your own application requirements while working towards generating new insights through creative data exploration tailored specifically for your project.(insert specific task examples) For instance - if your goal is to create a model that automatically decodes commands then you might opt for settings towards homogeneity such as limiting training outputs only to commands below a certain level maybe sampling words containing at least 8 characters combining text variation & metadata -- over data coming from eitheer Curated datasets offering contextually precise answersor Unique ones emphasizing contextual variation yet still containing reliable outcomes respectively . Contrarily , if said need calls upon expanding semantics while still requiring high precision what could be better than directly mixing words coming from both Curated&Unique

Research Ideas

  • Natural language understanding applications such as conversation assistants, which can understand human instructions and execute them effectively.
  • Computer vision tasks where AI algorithms need to be trained to correctly understand different kinds of commands and instructions within images or videos.
  • Machine-learning models that can learn from natural language instructions and generate accurate predictions about future tasks or actions

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: curated.csv

Column name Description
instruction The instruction generated by StarCoder. (String)
output The output string associated with the instruction. (String)
most_similar The most similar instruction to the one generated by StarCoder. (String)
avg_similarity_score The average similarity score between the instruction and the most similar instruction. (Float)

File: raw.csv

Column name Description
instruction The instruction generated by StarCoder. (String)
output The output string associated with the instruction. (String)
most_similar The most similar instruction to the one generated by StarCoder. (String)
avg_similarity_score The average similarity score between the instruction and the most similar instruction. (Float)

File: unique.csv

Column name Description
instruction The instruction generated by StarCoder. (String)
output The output string associated with the instruction. (String)
most_similar The most similar instruction to the one generated by StarCoder. (String)
avg_similarity_score The average similarity score between the instruction and the most similar instruction. (Float)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.

Tables

Compile

@kaggle.thedevastator_exploring_starcoder_instructions.compile
  • 3.74 MB
  • 3549 rows
  • 4 columns
Loading...

CREATE TABLE compile (
  "instruction" VARCHAR,
  "output" VARCHAR,
  "most_similar" VARCHAR,
  "avg_similarity_score" DOUBLE
);

Curated

@kaggle.thedevastator_exploring_starcoder_instructions.curated
  • 841.77 KB
  • 771 rows
  • 4 columns
Loading...

CREATE TABLE curated (
  "instruction" VARCHAR,
  "output" VARCHAR,
  "most_similar" VARCHAR,
  "avg_similarity_score" DOUBLE
);

Raw

@kaggle.thedevastator_exploring_starcoder_instructions.raw
  • 5.41 MB
  • 5003 rows
  • 4 columns
Loading...

CREATE TABLE raw (
  "instruction" VARCHAR,
  "output" VARCHAR,
  "most_similar" VARCHAR,
  "avg_similarity_score" DOUBLE
);

Unique

@kaggle.thedevastator_exploring_starcoder_instructions.unique
  • 359.9 KB
  • 308 rows
  • 4 columns
Loading...

CREATE TABLE unique (
  "instruction" VARCHAR,
  "output" VARCHAR,
  "most_similar" VARCHAR,
  "avg_similarity_score" DOUBLE
);

Share link

Anyone who has the link will be able to view this.