Evaluating the Quality of LLM-Generated Explanations for Logical Errors in CS1 Student Programs

Abstract

When students in CS1 (Introductory Programming) write erroneous code, course staff can use automated tools to provide various types of helpful feedback. In this paper, we focus on syntactically correct student code containing logical errors. Tools that explain logical errors typically require course staff to invest greater effort than tools that detect such errors. To reduce this effort, prior work has investigated the use of Large Language Models (LLMs) such as GPT-3 to generate explanations. Unfortunately, these explanations can be incomplete or incorrect, and therefore unhelpful if presented to students directly. Nevertheless, LLM-generated explanations may be of adequate quality for Teaching Assistants (TAs) to efficiently craft helpful explanations on their basis. We evaluate the quality of explanations generated by an LLM (GPT-3.5-turbo) in two ways, for 30 buggy student solutions across 6 code-writing problems. First, in a study with 5 undergraduate TAs, we compare TA perception of LLM-generated and peer-generated explanation quality. TAs were unaware which explanations were LLM-generated, but they found them to be comparable in quality to peer-generated explanations. Second, we performed a detailed manual analysis of LLM-generated explanations for all 30 buggy solutions. We found at least one incorrect statement in 15/30 explanations (50%). However, in 28/30 cases (93%), the LLM-generated explanation correctly identified at least one logical error. Our results suggest that for large CS1 courses, TAs with adequate training to detect erroneous statements may be able to extract value from such explanations.

Your Name

Evaluating the Quality of LLM-Generated Explanations for Logical Errors in CS1 Student Programs

Abstract

You may also enjoy

Insights from Social Shaping Theory: The Appropriation of Large Language Models in an Undergraduate Programming Course

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study

Desirable Characteristics for AI Teaching Assistants in Programming Education