Sciphi Textbooks Are All You Need
650,000 Unique Samples from K-12 to Grad School
By Huggingface Hub [source]
About this dataset
This dataset is your one-stop comprehensive resource for educational research. Featuring 650,000 unique textbook samples on a wide range of courses from the earliest days of K-12 to the most advanced graduate programs, dive deep into the educational ecosystem with an expansive library built for exploration and discovery.
Analyze course materials with confidence, examining their nuances through different perspectives and learning styles by leveraging prompted samples, completed versions, and even notes left by fellow researchers. And take your projects one step further with adjustable parameters such as models used and temperature settings aiding in optimization of results tailored to your work.
Whether you are trainer seeking fresh curriculum ideas or a student looking for primary source materials in history or literature classes, our open-source collection handles it all—one million pages strong!
More Datasets
For more datasets, click here.
Featured Notebooks
- 🚨 Your notebook can be here! 🚨!
How to use the dataset
This comprehensive open-source textbook library for educational research is an invaluable and expansive resource for researchers, educators, and students alike. With 650,000 unique samples from K-12 to graduate school academic levels across a variety of courses, this dataset provides critical insights into the vast array of educational material available.
In order to use this dataset, there are several key columns to consider: formatted_prompt, completion, first_task, second_task, last_task , notes , title , model , and temperature . Each column contains valuable information that can help you better understand the sample textbooks included in the dataset. For example:
-Formatted Prompt: The original prompt used to generate a given sample of textbook text.
-Completion: The generated results from a given prompt based on the model used (the higher the temperature used when generating text output will result in more varied sentences).
-Tasks: Each task corresponds with separate portions of a process that were completed (e.g.: first_task may have generated an introduction paragraph while last task may have summarized certain key points identified in earlier tasks).
-Notes & Title : These two columns provide descriptive meta data about each sample including expert notes regarding further improvements or other additions that could be made as well as titles assigned by subject matter experts.
With accessibility to such informative data points users will be able to reproduce results or even start their own exploration using one cohesive dataset for all their drafting / programming needs!
Research Ideas
- Text classification for automatically assigning courses and topics to a given body of text.
- Generating natural language summaries of textbooks or educational material, such as short document descriptors for search engine optimization (SEO) purposes.
- Devising new tasks for which to train machine learning models, such as predicting the completed form of incomplete sentences in order to facilitate more accurate auto-fill capabilities when composing documents
Acknowledgements
If you use this dataset in your research, please credit the original authors.
Data Source
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
Columns
File: train.csv
Column name |
Description |
formatted_prompt |
A prompt that has been formatted for use in the dataset. (String) |
completion |
The completion of the prompt. (String) |
first_task |
The first task associated with the prompt. (String) |
second_task |
The second task associated with the prompt. (String) |
last_task |
The last task associated with the prompt. (String) |
notes |
Any additional notes associated with the prompt. (String) |
title |
The title of the prompt. (String) |
model |
The model used to generate the prompt. (String) |
temperature |
The temperature used to generate the prompt. (Float) |
Acknowledgements
If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.