How do you measure something you cannot see? Intelligence, personality, attitudes, and knowledge are not like height or weight. They leave no direct trace on a ruler or a scale. Psychometrics is the field that tries to solve this problem: it builds and tests models that turn observable responses—answers to test questions, ratings on a questionnaire—into quantitative estimates of unobservable mental attributes. The history of psychometrics is a story of increasingly sophisticated answers to the question of what a measurement model should look like, and each new framework has changed the relationship between theory, data, and inference.
The earliest systematic attempt to model mental attributes statistically emerged in the early twentieth century. Researchers such as Charles Spearman and later Louis Thurstone noticed that scores on different mental tests tended to correlate. Spearman’s solution was the first formal factor model: a single general intelligence factor, g, explained the common variance among tests, while each test also had its own unique variance. Thurstone later argued for multiple primary mental abilities rather than one general factor. What united these early factor analysts was a shared method: they began with a matrix of observed correlations and used mathematical procedures to extract a smaller number of latent variables—factors—that could reproduce those correlations.
This Factor Analytic Tradition was fundamentally exploratory. The researcher did not start with a strong theory about how many factors existed or what they meant; the data itself suggested the structure. The goal was to discover the underlying dimensions of mental ability, not to test a pre-specified model. The tradition remains alive today as a data-reduction tool, but its limitations soon became clear. Factor analysis could tell you that a set of items correlated, but it offered no formal account of how measurement error worked or how the properties of individual items affected the scores people received.
Classical Test Theory (CTT), developed roughly from the 1930s through the 1970s, addressed the error problem directly. Its core equation is deceptively simple: an observed score equals a true score plus error. The true score is defined as the expected value over infinite repeated administrations of the same test—a hypothetical quantity that CTT never claimed to observe directly. What CTT gave the field was a practical vocabulary for talking about reliability (the proportion of observed-score variance that is true-score variance) and a set of formulas for estimating it.
CTT’s great strength was its simplicity. It required only basic statistics and worked well for whole-test scores. A test developer could compute Cronbach’s alpha, estimate the standard error of measurement, and report how consistently the test ranked people. But CTT had a serious limitation: its parameters were test-dependent. The difficulty of an item and the ability of a person could only be defined relative to the particular sample and the particular test used. If you gave a harder test to the same group, the person’s estimated ability would change, not because they had changed but because the metric had shifted. This sample-dependence made it difficult to compare scores across different tests or to build large item banks that could be used adaptively.
CTT never disappeared. It is still the default framework for many classroom tests, survey scales, and applied settings where the assumptions are good enough and the simplicity is an advantage. But its limitations created pressure for a more powerful approach.
Item Response Theory (IRT), which began to take shape in the 1950s and became dominant from the 1970s onward, solved the sample-dependence problem by modeling the relationship between a person’s latent trait and their probability of answering a given item correctly. Instead of summing items into a total score and calling that the measurement, IRT estimates a separate mathematical function—an item characteristic curve—for each item. The most common model, the Rasch model or the two-parameter logistic model, gives each item a difficulty parameter and sometimes a discrimination parameter. A person’s ability is estimated independently of which particular items they happened to take, as long as the items are calibrated on the same scale.
This shift from test-level to item-level modeling was transformative. It made possible computer-adaptive testing, where each subsequent item is chosen based on the person’s current ability estimate, dramatically shortening tests without losing precision. It also enabled large-scale item banking and equating across test forms. But IRT came with stronger assumptions than CTT. Most IRT models assume unidimensionality—that all items measure a single latent trait—and local independence, meaning that responses to items are unrelated once the trait is controlled for. When these assumptions are violated, the model can produce misleading results. IRT also requires larger sample sizes for stable item parameter estimation than CTT does.
IRT and CTT now coexist in a practical division of labor. IRT is the standard for high-stakes testing, licensure exams, and large-scale assessments like the SAT or PISA. CTT remains common in smaller-scale research and survey development where sample sizes are modest and the need for item-level precision is lower. The two frameworks share the same basic goal—quantifying latent attributes—but they differ in how they handle the relationship between items, persons, and error.
Structural Equation Modeling (SEM), which emerged in the 1970s and expanded rapidly in the 1980s and 1990s, grew out of the same factor-analytic roots as IRT but took a different direction. Where IRT focused on precise measurement of a single latent trait, SEM integrated measurement with causal hypothesis testing. An SEM model typically has two parts: a measurement model, which specifies how observed variables relate to latent factors (confirmatory factor analysis, or CFA), and a structural model, which specifies causal or correlational paths among the latent factors themselves.
This integration was a major departure from earlier frameworks. The Factor Analytic Tradition had been exploratory; SEM was confirmatory. A researcher had to specify the entire model—which items loaded on which factors, which factors predicted which other factors—before seeing the data, and then test how well that model fit. CTT and IRT, by contrast, focused on measurement alone and left the relationships among latent variables to a separate analysis step. SEM brought measurement and theory testing into a single statistical framework, allowing researchers to ask whether their theoretical model was consistent with the observed covariance matrix.
SEM did not replace IRT; the two frameworks serve different primary purposes. IRT is optimized for item calibration, test construction, and person scoring. SEM is optimized for testing theories about the structure of latent variables and their causal connections. In practice, researchers often use both: IRT to build reliable scales and SEM to test hypotheses about how those scales relate to each other. The tension between them is not a rivalry but a difference in what question each framework is designed to answer.
The most recent framework, Computational Psychometrics, began to take shape around the turn of the twenty-first century and has accelerated with the availability of large-scale digital data. Traditional psychometric models were designed for relatively small numbers of items administered under controlled conditions. Computational Psychometrics extends measurement to complex, unstructured data: keystroke logs from online learning platforms, speech patterns from interviews, eye-tracking data, or sequences of actions in educational games.
What makes Computational Psychometrics distinctive is not a single model but a methodological stance. It integrates traditional psychometric models—IRT and SEM are often used as components—with machine learning techniques such as neural networks, Bayesian nonparametrics, and natural language processing. The goal is to extract latent variables from data that do not come in the form of clean item responses. For example, a researcher might use a recurrent neural network to model a student’s learning trajectory from clickstream data and then map that trajectory onto a latent ability scale using an IRT-like model.
This framework represents a synthesis rather than a replacement. It absorbs the insights of earlier frameworks—the need for latent variables, the importance of modeling error, the value of item-level information—while adding tools for handling high-dimensional, sequential, or multimodal data. The tension within Computational Psychometrics is between theory-driven and data-driven approaches. Some researchers argue that machine learning models should be constrained by psychometric theory to ensure interpretability and fairness; others argue that the complexity of real-world data requires flexible models that may not fit neatly into traditional latent-variable frameworks.
Today, IRT, SEM, and Computational Psychometrics are all active research traditions, and they are not always in agreement. The main point of consensus is that measurement requires a model—a formal link between observed responses and latent attributes—and that the quality of that model must be evaluated empirically. No serious psychometrician today would rely on raw sum scores without considering reliability, dimensionality, or item properties.
The main disagreement concerns the role of a priori theory. IRT and SEM are theory-driven: the researcher specifies the model before seeing the data, and the data are used to test or calibrate that model. Computational Psychometrics, especially in its more machine-learning-oriented variants, is often data-driven: the model is discovered or learned from the data, and theory enters afterward as interpretation. This is a living disagreement, not a settled one. Proponents of theory-driven approaches worry that data-driven models can overfit, produce uninterpretable latent spaces, or encode biases present in the training data. Proponents of data-driven approaches argue that traditional models are too restrictive for the kinds of rich, dynamic data that modern technology produces.
In practice, the frameworks coexist. A large-scale assessment program might use IRT for item calibration, SEM for construct validation, and Computational Psychometrics for analyzing process data from computer-based tasks. The field has not converged on a single framework, and it probably never will. Different measurement problems require different tools, and the history of psychometrics suggests that each new framework expands the toolkit rather than discarding the old ones.