The competition-formatted version of the dataset of EduQG: A Multi-format Multiple Choice Dataset for the Educational Domain paper. The data includes wide range of questions from educational domain.
Original Data: https://github.com/hadifar/question-generation/tree/main/raw_data
This dataset includes both original and formatted versions of the data. The original dataset includes 5-choice and 4-choice questions. 4-choice questions are imputed with duplicate of a random wrong choice of each question.
The formatting is executed with the following snippet I wrote:
import pandas as pd
import numpy as np
import json
from tqdm.auto import tqdm
eduqg_json = json.load(open("eduqg_train.json",))
eduqg2_json = json.load(open("eduqg_val.json",))
questions = []
for json_corpus in [eduqg_json, eduqg2_json]:
for chapter in tqdm(json_corpus):
for question in chapter["questions"]:
question_text = question["question"]["normal_format"]
question_choices = question["question"]["question_choices"]
question_answer_id = question["answer"]["ans_choice"]
if len(question_choices) == 4:
false_answer_ids = np.delete(np.arange(4), question_answer_id)
duplicate_false_id = np.random.choice(false_answer_ids)
question_row = {"prompt": question_text,
"A": question_choices[0],
"B": question_choices[1],
"C": question_choices[2],
"D": question_choices[3],
"E": question_choices[duplicate_false_id],
"answer": ["A","B","C","D","E"][question_answer_id]
}
elif len(question_choices) == 5:
question_row = {"prompt": question_text,
"A": question_choices[0],
"B": question_choices[1],
"C": question_choices[2],
"D": question_choices[3],
"E": question_choices[4],
"answer": ["A","B","C","D","E"][question_answer_id]
}
else:
continue
questions.append(question_row)
out_df = pd.DataFrame(questions).reset_index().rename(columns={"index":"id"})
out_df.to_csv("eduqg_llm_formatted.csv", index=False)