Baselight

EduQG Dataset - LLM Science Exam Format (~3.4K)

An external training data for 'Kaggle - LLM Science Exam' competition

@kaggle.nlztrk_eduqg_dataset_llm_science_exam_format_34k

About this Dataset

EduQG Dataset - LLM Science Exam Format (~3.4K)

The competition-formatted version of the dataset of EduQG: A Multi-format Multiple Choice Dataset for the Educational Domain paper. The data includes wide range of questions from educational domain.

Original Data: https://github.com/hadifar/question-generation/tree/main/raw_data

This dataset includes both original and formatted versions of the data. The original dataset includes 5-choice and 4-choice questions. 4-choice questions are imputed with duplicate of a random wrong choice of each question.

The formatting is executed with the following snippet I wrote:

import pandas as pd
import numpy as np
import json
from tqdm.auto import tqdm

eduqg_json = json.load(open("eduqg_train.json",))
eduqg2_json = json.load(open("eduqg_val.json",))

questions = []

for json_corpus in [eduqg_json, eduqg2_json]:
    for chapter in tqdm(json_corpus):
        for question in chapter["questions"]:

            question_text = question["question"]["normal_format"]
            question_choices = question["question"]["question_choices"]
            question_answer_id = question["answer"]["ans_choice"]

            if len(question_choices) == 4:
                false_answer_ids = np.delete(np.arange(4), question_answer_id)
                duplicate_false_id = np.random.choice(false_answer_ids)

                question_row = {"prompt": question_text,
                 "A": question_choices[0],
                 "B": question_choices[1],
                 "C": question_choices[2],
                 "D": question_choices[3],
                 "E": question_choices[duplicate_false_id],
                 "answer": ["A","B","C","D","E"][question_answer_id]
                }
            elif len(question_choices) == 5:
                question_row = {"prompt": question_text,
                 "A": question_choices[0],
                 "B": question_choices[1],
                 "C": question_choices[2],
                 "D": question_choices[3],
                 "E": question_choices[4],
                 "answer": ["A","B","C","D","E"][question_answer_id]
                }        
            else:
                continue
            questions.append(question_row)

out_df = pd.DataFrame(questions).reset_index().rename(columns={"index":"id"})
out_df.to_csv("eduqg_llm_formatted.csv", index=False)

Share link

Anyone who has the link will be able to view this.