Computers Grade Essays As Well As Humans: Study

Multiple-choice tests are a pretty terrible way of measuring students’ understanding of complex concepts, but overextended teachers rely on fill-in-the-bubble exams because they’re easy to mark. There may be a better alternative: Scientists have created software that can grade short-answer essays in five seconds—and a new paper in the Journal of Science Education and Technology suggests it may be a more accurate measure of students’ understanding than multiple-choice tests.

A group of researchers, led by Elizabeth Beggrow at the Ohio State University, assessed science students’ understanding of key ideas about evolution using four methods: multiple-choice tests, human-scored written explanations, computer-scored written explanations, and clinical oral interviews. Clinical interviews—which allow professors to ask follow-up questions and engage students in dialogue—are considered ideal, but would be an impractical drain on teachers’ time; in this study, the clinical interviews lasted 14 minutes on average, and some took nearly half an hour. Machines, on the other hand, could generate a score in less than five seconds, though they took a few minutes to set up. The researchers "taught" the software to mark essays by feeding it examples of human-scored essays until it learned to recognize patterns in what the human scorers were looking for.

Beggrow and her team recruited 104 undergraduates enrolled in a biology class and offered them $20 to have their understanding of evolution assessed three times over: in a clinical interview with a professor, in this multiple-choice test, and in a short-answer essay test that would be graded by both the machine and a professor. Students’ level of understanding was quantified based on how often they answered questions using appropriate "normative" scientific ideas—for example, that species evolve in response to competition or limited resources—as opposed to "naïve" or non-normative ideas, like that acquired traits are heritable or that evolution is a goal-directed process. When Beggrow and her team analyzed the data, they found that professors’ and computers’ scores of students’ short essays were almost identical—the correlation was 0.96 to 1—and that the correlation between interviews and short-essay scores (0.56) was stronger than the one between interviews and multiple-choice answers (0.34). Written responses are better measures of students’ understanding because they ask students to recall—rather than simply recognize—information. Beggrow et al. write:

Asking students to construct explanations enables researchers and teachers to glean important insights about the structure and composition of student thought, that is, the open-ended format permits many permutations of ‘right’ and ‘wrong’ knowledge elements, rather than either ‘right’ or ‘wrong’ multiple-choice formats…. A major challenge facing the field of science education is building assessment tools and systems that are capable of validly and efficiently evaluating authentic scientific practices… Multiple choice assessments simply cannot assess students’ communication and explanation abilities.

Another danger of multiple-choice tests is that, because they contain “enticing distractors,” they may actually cause students to develop false ideas.

This is not the first hint we’ve had that machines may be capable of grading long-answer tests; last year, a study found a high degree of similarity between human and computer-generated scores of over 20,000 middle- and high-school essays on a variety of topics. Skeptics object that machines can be fooled and that robo-graders threaten “the idea that reading and writing are uniquely human.” While it’s obviously impossible for a robot to judge a paper’s creative or literary merit, being able to assign students essays rather than multiple choice tests could revolutionize science teaching.