Cmrc2018 - Chinese Machine Reading Comprehension by Kaggle | Other

About this Dataset

Cmrc2018 - Chinese Machine Reading Comprehension

cmrc2018 - Chinese Machine Reading Comprehension Dataset

Chinese MRC Dataset with Language Diversities

By cmrc2018 (From Huggingface) [source]

About this dataset

The cmrc2018 dataset is a comprehensive and extensive collection of Chinese machine reading comprehension data, with the primary objective of incorporating language diversities into the field. The dataset encompasses a wide range of topics and consists of approximately 20,000 real questions that have been meticulously annotated by human experts. These questions are based on paragraphs extracted from Wikipedia, ensuring the authenticity and reliability of the content.

An intriguing aspect of this dataset is the inclusion of a challenge set, which presents users with complex questions that necessitate an in-depth understanding and multi-sentence inference from the given context. This provides an opportunity to evaluate machine reading comprehension models for their ability to comprehend nuanced information and draw insightful conclusions.

With its large-scale nature and diverse array of language variations, this dataset serves as a valuable resource for training, testing, and evaluating machine reading comprehension models in Chinese. By simulating real-world scenarios through its carefully curated annotations, it enables researchers to enhance the performance and robustness of these models in tackling complex comprehension tasks.

In summary, the cmrc2018 dataset offers an unparalleled opportunity for advancements in Chinese machine reading comprehension research by not only providing vast amounts of data but also introducing challenges that foster deeper analysis and understanding

How to use the dataset

The cmrc2018 dataset is a valuable resource for researchers and practitioners working on machine reading comprehension in the Chinese language. It contains a large-scale collection of questions that have been carefully annotated on Wikipedia paragraphs by human experts. This guide will help you make the most out of this dataset without including any specific dates.

Familiarize Yourself with the Dataset:

Take some time to explore the dataset and understand its structure. It consists of several files, including train.csv, validation.csv, and test.csv. Each file contains columns such as context, question, and answers.

The context column provides paragraphs from Wikipedia that have been annotated.

The question column contains the questions asked about these context paragraphs.

The answers column contains the correct answers to these questions.

Understand Language Diversities:

One unique aspect of this dataset is its focus on language diversities in Chinese machine reading comprehension. It aims to challenge models with questions requiring comprehensive understanding and multi-sentence inference throughout the context.

Pay attention to nuances, idioms, or other linguistic elements that may pose challenges for machine comprehension systems.

Training Machine Reading Comprehension Models:

Use the provided training data (train.csv) to train your own machine reading comprehension models in Chinese.

Leverage existing models or develop novel architectures using state-of-the-art techniques like transformer-based models (e.g., BERT) specifically designed for natural language understanding tasks.

Evaluating Model Performance:
- Utilize both validation data (validation.csv) and test data (test.csv) for evaluating your machine reading comprehension models' performance on this particular dataset.
- Evaluate metrics like accuracy, precision, recall, F1 score, etc., to measure the effectiveness of your models.

Experiment and Iterate:

Use the provided dataset not only for training models but also for conducting experiments and improving their performance.

Implement techniques like data augmentation, transfer learning, or ensemble methods to enhance model accuracy and robustness.

Stay Connected:

Engage with the Kaggle community and participate in discussions or competitions related to Chinese machine reading comprehension.

Share your insights, code implementations, and findings with others working in this field.

Remember that machine reading comprehension is an evolving field, so it's crucial to stay updated with the latest research papers, techniques,

Research Ideas

Training Machine Reading Comprehension Models: This dataset can be used to train machine reading comprehension models specifically for Chinese language. The annotated questions and answers can be used as training data to develop models that can accurately understand and answer questions based on given context paragraphs.

Evaluating Model Performance: The dataset provides a validation set with known correct answers for each question. This can be used to evaluate the performance of different machine reading comprehension models on Chinese language.

Testing Generalization Ability: The challenge set included in the dataset contains questions that require comprehensive understanding and multi-sentence inference throughout the context. This makes it suitable to test the generalization ability of machine reading comprehension models, as they need to go beyond simple word matching to answer these types of questions accurately

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv

Column name	Description
context	The context column contains paragraphs from Wikipedia that have been annotated for machine reading comprehension. (Text)
question	The question column contains the questions corresponding to the context paragraphs. (Text)
answers	The answers column contains the accurate responses to the questions based on the information provided in the context paragraphs. (Text)

File: train.csv

Column name	Description
context	The context column contains paragraphs from Wikipedia that have been annotated for machine reading comprehension. (Text)
question	The question column contains the questions corresponding to the context paragraphs. (Text)
answers	The answers column contains the accurate responses to the questions based on the information provided in the context paragraphs. (Text)

File: test.csv

Column name	Description
context	The context column contains paragraphs from Wikipedia that have been annotated for machine reading comprehension. (Text)
question	The question column contains the questions corresponding to the context paragraphs. (Text)
answers	The answers column contains the accurate responses to the questions based on the information provided in the context paragraphs. (Text)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit cmrc2018 (From Huggingface).

Tables

Test

@kaggle.thedevastator_cmrc2018_chinese_machine_reading_comprehension_d.test

379.99 KB
1002 rows
4 columns


CREATE TABLE test (
  "id" VARCHAR,
  "context" VARCHAR,
  "question" VARCHAR,
  "answers" VARCHAR
);

Train

@kaggle.thedevastator_cmrc2018_chinese_machine_reading_comprehension_d.train

3.75 MB
10142 rows
4 columns


CREATE TABLE train (
  "id" VARCHAR,
  "context" VARCHAR,
  "question" VARCHAR,
  "answers" VARCHAR
);

Validation

@kaggle.thedevastator_cmrc2018_chinese_machine_reading_comprehension_d.validation

1.11 MB
3219 rows
4 columns


CREATE TABLE validation (
  "id" VARCHAR,
  "context" VARCHAR,
  "question" VARCHAR,
  "answers" VARCHAR
);