Abstract

Recent studies have showcased the exceptional performance of LLMs (Large Language Models) on assessment questions across various discipline areas. This can be helpful if used to support the learning process, for example by enabling students to quickly generate and contrast alternative solution approaches. However, concerns about student over-reliance and inappropriate use of LLMs in education are common. Understanding the capabilities of LLMs is essential for instructors to make informed decisions on question choices for learning and assessment tasks. In CS (Computer Science), previous evaluations of LLMs have focused on CS1 and CS2 questions, and little is known about how well LLMs perform for assessment questions in upper-level CS courses such as CG (Computer Graphics), which covers a wide variety of concepts and question types. To address this gap, we compiled a dataset of past assessment questions used in a final-year undergraduate course about introductory CG, and evaluated the performance of GPT-4 on this dataset. We also classified assessment questions and evaluated the performance of GPT-4 for different types of questions. We found that the performance tended to be best for simple mathematical questions, and worst for questions requiring creative thinking, and those with complex descriptions and/or images. We share our benchmark dataset with the community and provide new insights into the capabilities of GPT-4 in the context of CG courses. We highlight opportunities for teaching staff to improve student learning by guiding the use of LLMs for CG questions, and inform decisions around question choices for assessment tasks.

Updated: