An Empirical Study on How Large Language Models Impact Software Testing Learning
Abstract
Software testing is a challenging topic in software engineering education and requires creative approaches to engage learners. For example, the Code Defenders game has students compete over a Java class under test by writing effective tests and mutants. While such gamified approaches deal with problems of motivation and engagement, students may nevertheless require help to put testing concepts into practice. The recent widespread diffusion of Generative AI and Large Language Models raises the question of whether and how these disruptive technologies could address this problem, for example, by providing explanations of unclear topics and guidance for writing tests. However, such technologies might also be misused or produce inaccurate answers, which would negatively impact learning. To shed more light on this situation, we conducted the first empirical study investigating how students learn and practice new software testing concepts in the context of the Code Defenders testing game, supported by a smart assistant based on a widely known, commercial Large Language Model. Our study shows that students had unrealistic expectations about the smart assistant, “blindly” trusting any output it generated, and often trying to use it to obtain solutions for testing exercises directly. Consequently, students who resorted to the smart assistant more often were less effective and efficient than those who did not. For instance, they wrote 8.6% fewer tests, and their tests were not useful in 78.0% of the cases. We conclude that giving unrestricted and unguided access to Large Language Models might generally impair learning. Thus, we believe our study helps to raise awareness about the implications of using Generative AI and Large Language Models in Computer Science Education and provides guidance towards developing better and smarter learning tools.