ARCD (Arabic Language Comprehension)
1,395 Questions Posed by Crowdworkers
By Huggingface Hub [source]
About this dataset
This dataset contains 1,395 unique questions from Arabic Wikipedia-compiled articles that have been posed by crowdworkers. With each question comes the corresponding article title, context in which the question was asked, and a list of possible answers to choose from. Using this data set allows researchers and developers access to larger pieces of context when testing natural language processing (NLP) technologies in Arabic. By utilizing this collection of questions as a starting point for research and development, it is possible to further understand the complexities of Arabic language comprehension at scale, while also creating more resources to aid future researchers in creating more robust NLP algorithms. This dataset is an invaluable resource for those seeking to explore natural language processing tasks on a grand scale in the Arabic language
More Datasets
For more datasets, click here.
Featured Notebooks
- 🚨 Your notebook can be here! 🚨!
How to use the dataset
How to use the Unlocked Arabic Language Comprehension with the ARCD Dataset
To use this dataset for research purposes, you can download it from Kaggle into a folder on your computer designated for data files. Once you have downloaded and saved it in an easily accessible location, read through each row as this will provide you with information about each question's context and its corresponding answer options (answers). Familiarize yourself with both the title and context fields so that it's easy to follow what is being asked in each question posed by croworkers. Additionally, refer back to the validation csv file if you feel like there are any discrepancies between what is being asked or answered within either set of files.
Next step is to begin analyzing all of these text strings and figuring out how best to interpret them using NLP algorithms such as Word Embedding or Named Entity Recognition (NER). You can also convert them into word vectors using Machine Learning techniques such as logistic regression or support vector machines (SVM) so that they become easier to analyze . Ultimately this will allow researchers more precise accuracy level when trying to determine whether someone answering a particular question was correct or not, which can then be used for other tasks such as drawing conclusions about language understanding overall.
Keep in mind that while utilizing NLP algorithms on large datasets can be computationally intensive task , there are ways around that depending on your toolkit - utilize cloud services like Google Cloud Platform if needed with cost considerations taken into account at all times! Finally when data analysis is complete , preparing your findings should consist of creating visualizations along with concise summaries/conclusions which will serve very useful in aiding future research efforts related specifically towards Arabic language understanding abilities!
Research Ideas
- Improving Question Answering Systems: The ARCD dataset can be used to train and evaluate question-answering AI models.
- Arabic Text Classification: The data contained in the ARCD dataset could be leveraged to build automatic text classification systems for different topics of Arabic text.
- Reinforcement Learning: The questions contained in the dataset can be used with reinforcement learning techniques, allowing agents to play a game of guessing which answer is correct given a context and list of options
Acknowledgements
If you use this dataset in your research, please credit the original authors.
Data Source
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
Columns
File: validation.csv
Column name |
Description |
title |
The title of the context from which the question was posed. (String) |
answers |
The answers to the question posed by the crowdworker. (String) |
File: train.csv
Column name |
Description |
title |
The title of the context from which the question was posed. (String) |
answers |
The answers to the question posed by the crowdworker. (String) |
Acknowledgements
If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.