Evaluating LLM-generated Worked Examples in an Introductory Programming Course
Abstract
Worked examples, which illustrate the process for solving a problem step-by-step, are a well-established pedagogical technique that has been widely studied in computing classrooms. However, creating high-quality worked examples is very time-intensive for educators, and thus learners tend not to have access to a broad range of such examples. The recent emergence of powerful large language models (LLMs), which appear capable of generating high-quality human-like content, may offer a solution. Separate strands of recent work have shown that LLMs can accurately generate code suitable for a novice audience, and that they can generate high-quality explanations of code. Therefore, LLMs may be well suited to creating a broad range of worked examples, overcoming the bottleneck of manual effort that is currently required. In this work, we present a novel tool, ‘WorkedGen’, which uses an LLM to generate interactive worked examples. We evaluate this tool with both an expert assessment of the content, and a user study involving students in a first-year Python programming course (n = ~400). We find that prompt chaining and one-shot learning are useful strategies for optimising the output of an LLM when producing worked examples. Our expert analysis suggests that LLMs generate clear explanations, and our classroom deployment revealed that students find the LLM-generated worked examples useful for their learning. We propose several avenues for future work, including investigating WorkedGen’s value in a range of programming languages, and with more complex questions suitable for more advanced courses.