Abstract

Recent advances in artificial intelligence have led to the development of large language models (LLMs), which are able to generate text, images, and source code based on prompts provided by humans. In this paper, we explore the capabilities of an LLM - OpenAI’s GPT-3 model to provide feedback for student written code. Specifically, we examine the feasibility of GPT-3 to check, critique and suggest changes to code written by learners in an online programming exam of an undergraduate Python programming course.We collected 1211 student code submissions from 7 questions asked in a programming exam, and provided the GPT-3 model with separate prompts to check, critique and provide suggestions on these submissions. We found that there was a high variability in the accuracy of the model’s feedback for student submissions. Across questions, the range for accurately checking the correctness of the code was between 57% to 79%, between 41% to 77% for accurately critiquing code, and between 32% and 93% for suggesting appropriate changes to the code. We also found instances where the model generated incorrect and inconsistent feedback. These findings suggest that models like GPT-3 currently cannot be ‘directly’ used to provide feedback to students for programming assessments.

Updated: