Evaluating the Quality of LLM-Generated Explanations for Logical Errors in CS1 Student Programs
Abstract
When students in CS1 (Introductory Programming) write erroneous code, course staff can use automated tools to provide various types of helpful feedback. In this paper, we focus on syntactically correct student code containing logical errors. Tools that explain logical errors typically require course staff to invest greater effort than tools that detect such errors. To reduce this effort, prior work has investigated the use of Large Language Models (LLMs) such as GPT-3 to generate explanations. Unfortunately, these explanations can be incomplete or incorrect, and therefore unhelpful if presented to students directly. Nevertheless, LLM-generated explanations may be of adequate quality for Teaching Assistants (TAs) to efficiently craft helpful explanations on their basis. We evaluate the quality of explanations generated by an LLM (GPT-3.5-turbo) in two ways, for 30 buggy student solutions across 6 code-writing problems. First, in a study with 5 undergraduate TAs, we compare TA perception of LLM-generated and peer-generated explanation quality. TAs were unaware which explanations were LLM-generated, but they found them to be comparable in quality to peer-generated explanations. Second, we performed a detailed manual analysis of LLM-generated explanations for all 30 buggy solutions. We found at least one incorrect statement in 15/30 explanations (50%). However, in 28/30 cases (93%), the LLM-generated explanation correctly identified at least one logical error. Our results suggest that for large CS1 courses, TAs with adequate training to detect erroneous statements may be able to extract value from such explanations.