EduQG Dataset - LLM Science Exam Format (~3.4K)
An external training data for 'Kaggle - LLM Science Exam' competition
@kaggle.nlztrk_eduqg_dataset_llm_science_exam_format_34k
An external training data for 'Kaggle - LLM Science Exam' competition
@kaggle.nlztrk_eduqg_dataset_llm_science_exam_format_34k
The competition-formatted version of the dataset of EduQG: A Multi-format Multiple Choice Dataset for the Educational Domain paper. The data includes wide range of questions from educational domain.
Original Data: https://github.com/hadifar/question-generation/tree/main/raw_data
This dataset includes both original and formatted versions of the data. The original dataset includes 5-choice and 4-choice questions. 4-choice questions are imputed with duplicate of a random wrong choice of each question.
The formatting is executed with the following snippet I wrote:
import pandas as pd
import numpy as np
import json
from tqdm.auto import tqdm
eduqg_json = json.load(open("eduqg_train.json",))
eduqg2_json = json.load(open("eduqg_val.json",))
questions = []
for json_corpus in [eduqg_json, eduqg2_json]:
for chapter in tqdm(json_corpus):
for question in chapter["questions"]:
question_text = question["question"]["normal_format"]
question_choices = question["question"]["question_choices"]
question_answer_id = question["answer"]["ans_choice"]
if len(question_choices) == 4:
false_answer_ids = np.delete(np.arange(4), question_answer_id)
duplicate_false_id = np.random.choice(false_answer_ids)
question_row = {"prompt": question_text,
"A": question_choices[0],
"B": question_choices[1],
"C": question_choices[2],
"D": question_choices[3],
"E": question_choices[duplicate_false_id],
"answer": ["A","B","C","D","E"][question_answer_id]
}
elif len(question_choices) == 5:
question_row = {"prompt": question_text,
"A": question_choices[0],
"B": question_choices[1],
"C": question_choices[2],
"D": question_choices[3],
"E": question_choices[4],
"answer": ["A","B","C","D","E"][question_answer_id]
}
else:
continue
questions.append(question_row)
out_df = pd.DataFrame(questions).reset_index().rename(columns={"index":"id"})
out_df.to_csv("eduqg_llm_formatted.csv", index=False)
CREATE TABLE eduqg_llm_formatted (
"id" BIGINT,
"prompt" VARCHAR,
"a" VARCHAR,
"b" VARCHAR,
"c" VARCHAR,
"d" VARCHAR,
"e" VARCHAR,
"answer" VARCHAR
);
Anyone who has the link will be able to view this.