This dataset was compiled from two different sources. The first source includes 74 datasets obtained from https://github.com/jorgegus/autotext_data. For further details, please refer to the following paper:
Madrid, J., Escalante, H. J., & Morales, E. "Meta-learning of textual representations." arXiv preprint arXiv:1906.08934 (2019).
The second source consists of 43 datasets obtained from the Hugging Face API, accessible at https://huggingface.co/docs/datasets/index.
The dataset was divided into training and testing sets with an equal split ratio of 50/50. Additionally, mini train sets were created, comprising 100 training rows per dataset while ensuring a minimum of 2 examples per label. In cases where there were more than 50 labels, the mini train sets were larger.
The synthetic dataset was generated by feeding the documents from the mini sets into the GPT-3.5 API and rephrasing them. Each synthetic dataset contains 1000 rows, with 10 rows corresponding to each of the 100 original rows.
This dataset was specifically developed to facilitate research on automated machine learning (AutoML) for text classification. Please appropriately cite the respective sources when using this dataset.