Abstract

Large Language Model (LLM) artificial intelligence tools present a unique challenge for educators who teach programming languages. While LLMs like ChatGPT have been well documented for their ability to complete exams and create prose, there is a noticeable lack of research into their ability to solve problems using high-level programming languages. Like many other university educators, those teaching programming courses would like to detect if students submit assignments generated by an LLM. To investigate grade performance and the likelihood of instructors identifying code generated by artificial intelligence (AI) tools, we compare code generated by students and ChatGPT for introductory Python homework assignments. Our research reveals mixed results on both counts, with ChatGPT performing like a mid-range student on assignments and seasoned instructors struggling to detect AI-generated code. This indicates that although AI-generated results may not always be identifiable, they do not currently yield results approaching those of diligent students. We describe our methodology for selecting and evaluating the code examples, the results of our comparison, and the implications for future classes. We conclude with recommendations for how instructors of programming courses can mitigate student use of LLM tools as well as articulate the inherent value of preserving students’ individual creativity in producing programming languages.

Updated: