The two big criteria for judging tests are reliability and validity. Reliability refers to the consistency of the test results each time it is given. For example, psychometric tests attempt to gain a picture of a person’s more permanent personality traits. However, if a test question asked me if I’d rather be in a crowd or alone, my answer would vary based on my mood. This is why psychometric test tend to ask the same questions, worded slightly different, over and over and over again. On my part I get a little peeved at this repetition and have a perverse desire to answer the copycat questions in diametrically opposite directions.
The achievement tests, which have become so much a part of education, base questions on knowledge and skills. These test tend to have less of a challenge with reliability as long as they are composed of the old standby multiple choice. But questions of validity still arise.
Testing anything other than knowledge of memorized facts and word meanings seems difficult. One tactic is to have students read a selection and base answers on it. However, sometimes actually knowing about the subject in a history or science reading selection causes the students to err because they answer what they know rather than what the reading selection reports. Reading selections and questions do not have to be factual. In this age of constant revision it appears that we can only count on what we have just read to be true. This is not a good belief to reinforce in students through tests.
The next tactic is to test higher level thinking skills by adding writing to achievement tests. But that means human scorers are needed to rate these compositions. Inter-rater reliability now comes into play. It’s not easy to set up scoring criteria so that multiple raters will rate the same compositions in the same manner. In fact, it is impossible. Every composition has to be read and rated by multiple people. So we’re back to the same problem of inconsistent results for the same tests.
Artwork by S.L. Listman