Name: TyDi QA (Questions & Answers In 11 Languages)
Creator: Kaggle
Published: 2025-02-13T08:25:17.267Z
License: https://creativecommons.org/publicdomain/zero/1.0/

Answerable TyDi QA is an extension of the GoldP subtask of the original TyDi QA

TyDi QA (Questions & Answers in 11 Languages)

Answerable TyDi QA is an extension of the GoldP subtask of the original TyDi QA

By Huggingface Hub [source]

About this dataset

Welcome to the Answerable-TyDiQA dataset - the key to unlocking the incredible world of AI research, language engineering and NLP! This extensive open source collection of question-answer pairs has been extracted from the Tashkeela Giclée Web Corpus and offers researchers, developers, and data scientists a wealth of real-world scenarios for exploration. With columns such as question_text, document_title,language,annotations,document_plaintext and even a document_url accompanying each data point - this is an unprecedented level of access to deep realms of knowledge. Unlock hidden insights into underlying linguistic patterns or make groundbreaking advances in natural language understanding - whatever you're looking for you'll find it within this uniquely curated dataset! Make sure to make full use of its vast potential today!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Welcome to the Answerable-TyDiQA Dataset! This dataset is an extensive open-source collection of question-answer pairs from the Tashkeela Giclée Web Corpus.

AI researchers, language engineers, and NLP enthusiasts can use this dataset to explore and gain insight from real world scenarios in Natural Language Processing (NLP) tasks such as question answering , information extraction, text summarization etc.

In this guide you will learn how to get started with using the Answerable-TyDiQA Dataset.

Research Ideas

AI-based question answering systems: Using the question-answer pairs in the Answerable-TyDiQA dataset, AI-based Q&A models can be trained and tested to better understand how questions are typically formatted, how language is used, and what potential answers to look for when trying to answer a user's query.

Natural language processing research: With its comprehensive data from real-world scenarios, the Answerable TyDiQA dataset can also be leveraged by NLP researchers to identify trends in language usage and extract valuable insights from large text corpora for developing advanced applications such as sentiment analysis or machine translation solutions.

Search engine optimization (SEO): For businesses looking to optimize their Web presence by targeting high quality search engine results pages (SERPs), using the data from this dataset could help them craft their content based on commonly asked related questions—along with corresponding answers—in order to incrementally improve their ranking in SERPs organically over time

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv

Column name	Description
question_text	This column contains the text of the questions asked. (String)
document_title	This column contains the title of the document associated with the question. (String)
language	This column contains the language of the question. (String)
annotations	This column contains annotations associated with the question. (String)
document_plaintext	This column contains the plain text content of the document associated with the question. (String)
document_url	This column contains the URL of the document associated with the question. (String)

File: train.csv

Column name	Description
question_text	This column contains the text of the questions asked. (String)
document_title	This column contains the title of the document associated with the question. (String)
language	This column contains the language of the question. (String)
annotations	This column contains annotations associated with the question. (String)
document_plaintext	This column contains the plain text content of the document associated with the question. (String)
document_url	This column contains the URL of the document associated with the question. (String)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.

Related Datasets

WikiQA (Open-Domain Q&A)

@kaggle
Eucalyptus Growth And Environmental Data

@euremarkable
Dummy Monster

@owid
AI Performance On Math Problems

@owid
Trust Questions In The European Social Survey, Latinobarómetro And Afrobarometer

@owid
AI Performance On Language Tasks

@owid

WikiQA (Open-Domain Q&A)

Eucalyptus Growth And Environmental Data

Dummy Monster

AI Performance On Math Problems

Trust Questions In The European Social Survey, Latinobarómetro And Afrobarometer

AI Performance On Language Tasks