Classical test theory
|
Classical test theory is a body of related psychometric theory that predict outcomes of psychological testing such as the difficulty of items or the ability of test-takers. Generally speaking, the aim of classical test theory is to understand and improve the reliability of psychological tests.
Classical test theory may be regarded as roughly synonymous with true score theory. The term "classical" refers not only to the chronology of these models but also contrasts with the more recent psychometric theories, generally referred to collectively as item response theory, which sometimes bear the appelation "modern" as in "modern latent trait theory".
Classical test theory is based on the decomposition of observed scores into true and error scores. The theory views the observed score <math>x<math> of person <math>i<math>, denoted as <math>x_i<math>, as a realization of a random variable <math>X<math>. The person is characterized by a probability distribution over the possible realizations of this random variable. This distribution is called a "propensity distribution". Person <math>i<math>'s true score, <math>t_i<math>, is axiomatically defined as the expectation of this propensity distribution. This definition is formally stated as
(Eq. 1) <math>{\varepsilon}(X_i)=t_i.<math>
Secondly, the so-called error score for person <math>i<math>, <math>E_i<math>, is defined as the difference between <math>i<math>'s observed score and his true score:
(Eq. 2) <math>E_i=X_i - t_i.<math>
Note that <math>X_i<math> and <math>E_i<math> are random variables, but <math>t_i<math> is a constant. Also note that it directly follows from these definitions that the error score has expectation zero:
(Eq. 3) <math>{\varepsilon}(E_i)={\varepsilon}(X_i - t_i)={\varepsilon}(X_i)-{\varepsilon}(t_i)=t_i - t_i = 0.<math>
The above equations represent the assumptions that classical test theory makes at the level of the individual person. However, the theory is never used to analyze individual test scores; rather, the focus of the theory is on properties of test scores relative to populations of persons. Hence, the next step is to introduce a population-sampling scheme into the structure of classical test theory. When we assume that people are randomly sampled from a population, the true score becomes a random variable too, so that we get the (in)famous equation
(Eq. 4) <math>X = T + E <math>
which can be found in every introductory textbook on test theory (too often without justification).
Classical test theory is concerned with the relations between the three variables <math>X<math>, <math>T <math>, and <math> E <math> in the population. These relations are used to say something about the quality of test scores. In this regard, the most important concept is that of reliability. The reliability of the observed test scores <math>X<math>, which is denoted as <math>{\rho^2_{XT}}<math>, is defined as the ratio of true score variance <math>{\sigma^2_T}<math> to the observed score variance <math>{\sigma^2_X}<math>:
(Eq. 5) <math>{\rho^2_{XT}} = \frac{{\sigma^2_T}}{{\sigma^2_X}}.<math>
Because the variance of the observed scores can be shown to equal the sum of the variance of true scores and the variance of error scores, this is equivalent to
(Eq. 6) <math>{\rho^2_{XT}} = \frac{{\sigma^2_T}}{{\sigma^2_X}} = \frac{{\sigma^2_T}}{{\sigma^2_T}+{\sigma^2_E}}.<math>
This equation, which formulates a signal-to-noise ratio, has intuitive appeal: The reliability of test scores becomes higher as the proportion of error variance in the test scores becomes lower and vice versa. The reliability is equal to the proportion of the variance in the test scores that we could explain if we knew the true scores. The square root of the reliability is the correlation between true and observed scores.
Note that reliability is not, as is often suggested in textbooks, a fixed property of tests, but a property of test scores that is relative to a particular population. This is because test scores will not be equally reliable in every population. For instance, as is the case for any correlation, the reliability of test scores will be lowered by restriction of range. Thus, IQ-test scores that are highly reliable in the general population will be less reliable in a population of college students. Also note that test scores are perfectly unreliable for any given individual <math>i<math>, because, as has been noted above, the true score is a constant at the level of the individual, which implies it has zero variance, so that the ratio of true score variance to observed score variance, and hence reliability, is zero. The reason for this is that, in the classical test theory model, all observed variability in <math>i<math>'s scores is random error by definition (see Eq. 2). Classical test theory is relevant only at the level of populations, not at the level of individuals.
Reliability cannot be estimated directly since that would require one to observe the true scores, which according to classical test theory is impossible. However, estimates of reliability can be obtained by various means. One way of estimating reliability is by constructing a so-called parallel test. A parallel test is a test that has the property that, for every individual, it yields the same true score and the same observed score variance as the original test. If we have parallel tests x and x', then this means that
(Eq. 7) <math>{\varepsilon}(X_i)={\varepsilon}(X'_i)<math>
and
(Eq. 8) <math>{\sigma}^2_{E_i}={\sigma}^2_{E'_i}<math>.
Under these assumptions, it follows that the correlation between parallel test scores equals reliability (see Lord & Novick, 1968, Ch. 2, for a proof).
(Eq. 9) <math>
{\rho}_{XX'}=
\frac{{\sigma}_{XX'}}{{\sigma}_X{\sigma}_{X'}}=
\frac{ {\sigma}_T^2 }{ {\sigma}_X^2 }=
{\rho}_{XT}^2.
<math>
The estimation of reliability by the use of parallel tests is cumbersome, because parallel tests are very hard to come by. In practice the method is rarely used. Instead, researchers use a measure of internal consistency known as Cronbach's <math>{\alpha}<math>. Consider a test consisting of <math>k<math> items <math>u_{j}<math>, <math>j=1,\ldots,j,\ldots,k<math>. The total test score is defined as the sum of the individual item scores, so that for individual <math>i<math>
(Eq. 10) <math>X_{i}=\sum_{j=1}^{k}{U_{ij}}<math>.
Then Cronbach's <math>{\alpha}<math> equals
(Eq. 11) <math> \alpha =\frac{k}{k-1}\frac{\sum_{j=1}^{k}{\sigma^{2}_{U_{i}}}}{\sigma^2_{X}}<math>.
Cronbach's <math>{\alpha}<math> can be shown to provide a lower bound for reliability under rather mild assumptions. Thus, the reliability of test scores in a population is always higher than the value of Cronbach's <math>{\alpha}<math> in that population. Thus, this method is empirically feasible and, as a result, it is very popular among researchers.
As has been noted above, the entire exercise of classical test theory is done to arrive at a suitable definition of reliability. Reliabity is supposed to say something about the general quality of the test scores in question. The general idea is that, the higher reliability is, the better. Classical test theory does not say how high reliability is supposed to be. In the literature a value over .80 appears to be deemed 'acceptable'; a value over .90 is 'good'. Values between .70 and .80 are seen as mediocre but still defensible; values below .70 are bad. It must be noted that these 'criteria' are not based on reasonable arguments but the result of convention. Whether they make any sense or not is unclear.
Classical test theory is by far the most influential theory of test scores in the social sciences. In psychometrics, the theory has been superseded by the more sophisticated models in Item Response Theory (IRT). IRT models, however, are catching on very slowly in mainstream research. One of the main problems causing this is the lack of widely available, user-friendly software; also, IRT is not included in standard statistical packages like SPSS, whereas these packages routinely provide estimates of Cronbach's <math>{\alpha}<math>. As long as this problem is not solved, classical test theory will probably remain the the theory of choice for many researchers, even though it is, by psychometric standards, outdated.