Abstract

AbstractView references

Generating high-quality multiple-choice questions (MCQs) is a time-consuming activity that has led practitioners and researchers to develop community question banks and reuse the same questions from semester to semester. This results in generic MCQs which are not relevant to every course. Template-based methods for generating MCQs require less effort but are similarly limited. At the same time, advances in natural language processing have resulted in large language models (LLMs) that are capable of doing tasks previously reserved for people, such as generating code, code explanations, and programming assignments. In this paper, we investigate whether these generative capabilities of LLMs can be used to craft high-quality M CQs more efficiently, thereby enabling instructors to focus on personalizing MCQs to each course and the associated learning goals. We used two LLMs, GPT-3 and GPT-4, to generate isomorphic MCQs based on MCQs from the Canterbury Question Bank and an Introductory to Low-level C Programming Course. We evaluated the resulting MCQs to assess their ability to generate correct answers based on the question stem, a task that was previously not possible. Finally, we investigate whether there is a correlation between model performance and the discrimination score of the associated MCQ to understand whether low discrimination questions required the model to do more inference and therefore perform poorly. GPT-4 correctly generated the answer for 78.5% of MCQs based only on the question stem. This suggests that instructors could use these models to quickly draft quizzes, such as during a live class, to identify misconceptions in real-time. We also replicate previous findings that GPT-3 performs poorly on answering, or in our case generating, correct answers to MCQs. We also present cases we observed where LLMs struggled to produce correct answers. Finally, we discuss implications for computing education. © 2023 IEEE.

Updated: