Educational assessment has always been pulled between two competing pressures: the need to compare students against each other or against fixed standards, and the desire to use assessment to improve learning. This tension between measurement and improvement, between accountability and growth, has driven the development of the field's major frameworks. Each framework emerged not in isolation but as a response to the limitations of earlier approaches, and today several of them coexist, each serving a different purpose in the educational system.
For most of the twentieth century, the dominant framework for thinking about tests was Classical Test Theory (CTT). CTT rests on a simple equation: a student's observed score on a test equals their true score plus random error. This framework was designed for norm-referenced interpretation—comparing a student's performance to that of a larger group. The reliability of a test, in CTT, depends heavily on the sample of students who take it and the length of the test. A test that works well for one group may not work well for another, and item-level analysis is limited because all statistics are test-level. CTT provided the statistical language for large-scale standardized testing, but its sample-dependency meant that comparisons across different populations were always uncertain.
By the 1960s, educators began to ask a different question: instead of asking how a student compares to others, why not ask whether a student has mastered a specific set of knowledge or skills? Criterion-Referenced Assessment emerged from this pressure. Unlike CTT, which ranks students along a normal curve, criterion-referenced tests measure performance against a fixed standard or learning objective. A student's score is interpreted not in relation to peers but in relation to a predefined mastery threshold. This shift was closely tied to the mastery learning movement, which argued that most students could learn if given enough time and appropriate instruction. Criterion-referenced logic did not replace CTT—both frameworks continued to be used for different purposes—but it introduced a fundamentally different way of thinking about what test scores mean.
While criterion-referenced assessment changed the purpose of testing, Item Response Theory (IRT), emerging around 1960 and still dominant today, transformed the technical machinery. IRT models the probability of a correct response as a function of both the student's ability and the characteristics of the item—difficulty, discrimination, and the chance of guessing. Unlike CTT, IRT produces item-level statistics that are invariant across samples, meaning that a test can be calibrated once and then used with different populations. This sample-independence made IRT the preferred framework for large-scale testing programs, adaptive testing, and test equating. IRT did not reject CTT outright; rather, it absorbed the goal of precise measurement and achieved it with more powerful statistical tools. Today, IRT provides the psychometric infrastructure for most high-stakes standardized assessments worldwide.
In 1967, Michael Scriven introduced a distinction that would reshape the field: the difference between formative and summative evaluation. Summative Assessment judges the final outcome of learning—a grade, a certification, a pass-or-fail decision. Formative Assessment, by contrast, is designed to provide feedback during the learning process, helping both teachers and students adjust their efforts. This was not a technical refinement like IRT but a conceptual reorientation. Formative assessment challenged the assumption that assessment's primary role was measurement. Instead, it argued that assessment could be a tool for improvement. The two frameworks did not compete directly; they served different moments in the instructional cycle. But the tension between them—improvement versus accountability—became a central debate in the field. Formative assessment, still active today, has been widely adopted in classrooms, though its implementation often remains uneven.
By the 1990s, critics of traditional testing argued that multiple-choice questions and decontextualized items failed to capture meaningful learning. Authentic Assessment emerged from this critique, insisting that assessment tasks should mirror the real-world challenges students will face outside school—writing a persuasive essay, conducting a science experiment, solving a complex problem. Closely related, Performance Assessment focuses on observable demonstrations of skill, such as a presentation, a portfolio, or a lab practical. The two frameworks overlap heavily: both reject the artificiality of conventional tests and emphasize direct evidence of competence. Where they differ is in emphasis—authentic assessment stresses the real-world relevance of the task, while performance assessment stresses the observable nature of the demonstration. Neither framework replaced earlier approaches; instead, they carved out a space for assessment that prioritizes depth and application over efficiency and standardization.
Also emerging in the 1990s, Standards-Based Assessment took the logic of criterion-referenced assessment and scaled it to the level of entire educational systems. Instead of individual teachers setting their own mastery criteria, standards-based systems define what all students should know and be able to do at each grade level, then assess whether they have met those standards. This framework provided the backbone for national and state accountability systems, such as the No Child Left Behind Act in the United States. Standards-Based Assessment drew heavily on IRT for its technical infrastructure—IRT made it possible to compare results across years, schools, and populations. It also coexists with summative assessment, since standards-based tests are typically administered at the end of a learning period for accountability purposes. The framework remains dominant in policy contexts, though it has been criticized for narrowing the curriculum and encouraging teaching to the test.
Around 2000, a further evolution of formative assessment crystallized under the label Assessment for Learning (AfL). While formative assessment was often teacher-directed—the teacher gathers information and adjusts instruction—AfL places the student at the center. It emphasizes self-assessment, peer feedback, and the use of clear learning targets so that students can monitor their own progress. AfL extends formative assessment by arguing that the ultimate purpose of assessment is to develop students' capacity to learn independently. It also challenges the teacher-dominated model of formative assessment, insisting that students must be active participants, not just recipients of feedback. AfL does not reject summative or standards-based assessment; instead, it argues that classroom assessment should be primarily for learning, not merely of learning. This framework remains influential in research and professional development, though it has been harder to implement at scale than its proponents hoped.
Today, no single framework dominates educational assessment. Instead, the field operates as a pluralistic ecosystem. Item Response Theory remains the technical standard for large-scale testing, providing the statistical tools that make standards-based accountability systems possible. Standards-Based Assessment continues to drive policy and school improvement efforts worldwide. Assessment for Learning has become the dominant framework for classroom practice, shaping how teachers think about feedback, student agency, and the purpose of daily assessment. Authentic and Performance Assessment remain influential in specific contexts—portfolio assessments in the arts, project-based learning in STEM, and competency-based credentialing.
What the leading frameworks agree on is that assessment should be purposeful: it should serve clear goals, whether those are accountability, certification, or learning improvement. They also agree that technical quality matters—reliability, validity, and fairness are not optional. Where they disagree is on the primary audience and purpose. IRT and standards-based assessment prioritize system-level comparability and accountability; their audience is policymakers, administrators, and the public. Assessment for Learning prioritizes the individual learner and the classroom; its audience is students and teachers. This division of labor is not always comfortable—accountability pressures can undermine classroom formative practices, and a focus on student agency can seem at odds with standardized testing. But the coexistence of these frameworks reflects the reality that assessment serves multiple, sometimes conflicting, purposes. The history of educational assessment is not a story of one framework triumphing over others, but of an expanding toolkit, each tool suited to a different job.
The frameworks described here continue to evolve. IRT is being extended by more complex models that measure growth over time or assess multidimensional skills. Standards-based systems are grappling with how to include performance tasks and authentic assessments without sacrificing comparability. Assessment for Learning is being integrated with digital platforms that provide real-time feedback. The central tension between measurement and improvement remains unresolved, and that is likely to keep the field dynamic. Students entering educational assessment today will find a field that is technically sophisticated, philosophically contested, and deeply consequential for how societies define and measure learning.