American Schools Need More Testing, Not Less

In high school, we used to moan about Mr. Koonz’s chemistry class. Every Friday, Mr. Koonz required his students to turn in a worksheet and take a test. Every single Friday. We begged for a break from the constant assessments. But nothing swayed him.

It turns out Mr. Koonz was on to something really important. In progressive circles, testing now inspires a lot of skepticism, if not outright hostility. With so much riding on test scores for both teachers and students, the standardized exams required by No Child Left Behind seem to encourage more cheating than learning. At best, they foster memorization, but at the expense of originality and critical thinking. In the modern era, when information can be more easily—and accurately—Googled than mentally recalled, old-fashioned testing strikes its critics as obsolete. (That is what a bunch of students caught cheating at New York’s elite Stuyvesant High School tried arguing.) But it turns out that the right kinds of assessments—frequent, short tests—can actually yield big educational benefits. It’s called the “testing effect,” and policymakers are missing an opportunity by not doing more to take advantage of it.

The problem with the standardized tests mandated by No Child Left Behind—as well as with the SAT, A.P., GMAT, MCAT, bar exam, medical boards, and the rest of the standardized tests undergirding the U.S. credentialing system—is that they’re built on what researchers call the “dipstick” view of assessment. They assume that there’s a fixed amount of knowledge and ability in a student’s head, which the test merely measures. But that’s not what science has shown. Done properly, testing is not inert. Rather, it can be much more like the physical phenomena underlying the Heisenberg Uncertainty Principle. In the act of measuring students, you can actually affect how much knowledge they absorb and how well they retain it.

Though it doesn’t get a lot of mainstream attention, the research documenting the testing effect goes back nearly 100 years. In one experiment, three groups of high school students were given reading passages to study. The first group did nothing other than go over the material once. The second group studied it two times. The third group was given an initial test on what they’d read. Two weeks later, the students in all three groups were brought back and given an identical quiz. While the group that studied the passages a second time scored better than the group that just studied them once, the students who were initially tested performed best. The results held up when the students sat for follow-ups five months later. The testing had enhanced learning and retention more than just studying.

A young neuroscientist named Andrew Butler has gone further, showing that testing can actually facilitate creative problem solving. In Butler’s research, undergraduates were given six prose passages of about 1,000 words each filled with facts and concepts. (Fact: There are approximately 1,000 species of bats. Concept: how bats’ echolocation works.) He had the students just study some of the passages; others, he repeatedly tested them on. Not only did his subjects demonstrate a better grasp of the tested material, but they also fared better when asked to take the concepts about which they’d been quizzed and apply them in completely new contexts—for example, by using what they’d learned about bat and bird wings to answer questions about airplane wings. When students had been tested on the passages, rather than just reading them, they got about 50 percent more of the answers correct. They were better at drawing inferences, thanks to the testing effect.

A key to triggering the testing effect is timing: The sooner students are tested after encountering new material, the more it sinks in, while waiting just seven days to test students can substantially reduce performances. On the other hand, the more testing a student gets on a given set of more information, the greater the benefits. With the first few tests, students show dramatic gains. With further testing, the positive effects on retention taper off. But surprisingly, there is no plateau. Even after 20 or 30 tests, students’ performances progressively improve with each additional assessment.

Illustration by Raul Urias

As for why all of this happens: No one is entirely sure. The most plausible explanation is that connections between neural cells are the subject of a brutal natural-selection process. When you fail to engage them, they seem to wither away; brain power is a classic case of “use it or lose it.” Because the recall process involved in test-taking requires real mental effort, it bulks up the brain’s neural connections and may force the brain to create multiple, alternative retrieval routes for accessing the same piece of information. Frequent mental struggle strengthens intellectual wiring. This may be why, for all the SAT’s drawbacks, SAT prep courses featuring lots of practice exams can boost vocabulary and math skills—by forcing students to retrieve the information on all those flash cards, they provide helpful mental workouts.

So why isn’t there much more testing in U.S. schools? Teachers’ schedules are one major obstacle. Developing good quiz questions—not to mention grading them—is labor intensive. For the classes I teach at the University of Pennsylvania, it takes me and my co-instructor, aided by our two teaching fellows, about an hour a week to develop just five multiple-choice questions to give our students. Mr. Koonz’s tests were on yellowing sheets of paper that he reused year to year. That time-saving approach now has a significant drawback: In the Internet age, students would just post the questions on the Web, which would tip off other students to exactly what’s being tested, eliminating the organic recall that is key to the testing effect. Today’s teachers would need to develop new test questions each week for each class, but few teachers have that kind of free time.

Here is where the U.S. Department of Education can step in. The department could sponsor an initiative, or even create an institute, with the aim of hiring subject and test-writing experts to develop 10,000 short-answer and multiple-choice questions in each academic area—reading comprehension, mathematics, science, and history—for each grade level from first to twelfth. (That comes to about half a million questions total.) The questions, with answers, would be put up on the Web for teachers to deploy in their classes. It would be fine for students to have access to the queries in advance, since the sheer size of the database would make it impossible to prep for the resultant tests through brute memorization. The program would be ongoing, with the Department of Education’s experts continually enriching and adding to the pool of questions.

Of course, teachers would still have to endure the grumbling to which my classmates and I subjected Mr. Koonz, and make their students “eat their broccoli” as it were. But the moaning and groaning notwithstanding, the regimen might not be such a tough sell. Students seem intuitively aware of the testing effect. In a recent paper, researchers found that “students who were tested frequently rated their classes more favorably (in course ratings at the end of the semester) than the students who were tested less frequently.”

That leaves a final consideration: the bill. The College Board does not release how much it spends developing each question for the PSAT, SAT, or A.P. exams. But assume it is even $100 per question. Extrapolating, that means half a million questions would cost Arne Duncan $50 million, out of a budget of $70 billion. We’ve spent much more to barely nudge standardized scores, whereas, if the research is right, this effort would actually deepen students’ knowledge. It seems like a smart deal.

Ezekiel J. Emanuel is a vice provost and professor at the University of Pennsylvania.