Corpus linguistics is not simply the practice of using collections of texts for linguistic research. From its modern emergence in the mid-twentieth century, the subfield has been defined by a persistent, productive disagreement: what should be the relationship between corpus data and linguistic theory? Should the data merely illustrate or test pre-existing theoretical categories, or should the patterns found in corpora drive the formation of new theoretical concepts? This question has generated three distinct methodological frameworks—Corpus-Driven Linguistics, Corpus-Based Linguistics, and Neo-Firthian Corpus Linguistics—each offering a different answer. Their coexistence, competition, and mutual influence form the central intellectual history of the field.
Before the frameworks themselves crystallized, the groundwork was laid by the creation of the first machine-readable corpora. The Brown Corpus, completed in 1967 by Henry Kučera and W. Nelson Francis, was a landmark: a structured, balanced collection of one million words of American English, designed to be representative of the language as a whole. The Lancaster-Oslo-Bergen (LOB) Corpus soon followed for British English. These early resources were primarily used for descriptive and comparative studies—frequency counts, grammatical patterns, and lexicography. They did not yet challenge the dominant theoretical frameworks of the day, such as Generative Linguistics, which relied on introspection and grammaticality judgments. The corpora were seen as useful tools for certain empirical questions, but not as a basis for rethinking linguistic theory itself. This tool-oriented stance would soon be challenged.
The most radical break came in the 1980s, spearheaded by John Sinclair and the COBUILD project at the University of Birmingham. Sinclair argued that linguists had been asking the wrong question. Instead of using corpora to confirm or illustrate existing theoretical categories, researchers should let the corpus speak for itself. Patterns of word use, collocation, and phraseology should be allowed to generate new linguistic categories, not be forced into pre-existing ones. This was the core of the Corpus-Driven Linguistics framework.
Sinclair's approach was deliberately inductive. He advocated for minimal annotation of corpus data, fearing that theoretical assumptions would be baked into the tagging process and thus predetermine the results. The COBUILD dictionary, based on the analysis of the Bank of English corpus, was a practical demonstration of this philosophy: its definitions were grounded in the most frequent and typical patterns of actual usage, not in the intuitions of lexicographers. The pressure this framework addressed was the growing dissatisfaction with the limitations of introspection-based linguistics, which often missed the systematic, phraseological nature of real language. Corpus-Driven Linguistics did not merely add data to existing methods; it proposed a new epistemology for the field, where theory was an emergent property of systematic observation.
The Corpus-Driven position was powerful but also demanding. Many researchers found its strict inductivism impractical or even undesirable. They wanted to use corpus data to test, refine, and extend existing theories—whether from syntax, semantics, discourse analysis, or language acquisition—without abandoning those theories entirely. This pragmatic stance crystallized in the 1990s as Corpus-Based Linguistics.
Unlike the Corpus-Driven approach, Corpus-Based Linguistics treats corpora as a resource for empirical validation and discovery within an existing theoretical framework. A researcher might use a corpus to test a hypothesis about verb argument structure derived from Generative Linguistics, or to investigate patterns of modality predicted by Functional Linguistics. The corpus is a powerful tool, but it does not dictate the theoretical vocabulary. This framework absorbed the Corpus-Driven emphasis on empirical evidence while rejecting its claim that theory must emerge solely from data. It became the most widely practiced approach because of its flexibility: it could be adapted to almost any subfield of linguistics, from historical linguistics to sociolinguistics to language teaching. Its dominance reflects a practical compromise: the data are respected, but the researcher's theoretical commitments remain in the driver's seat.
At the same time that Corpus-Based Linguistics was becoming the mainstream, a third framework was emerging from within the Corpus-Driven tradition itself. Neo-Firthian Corpus Linguistics, associated with scholars like Sinclair, Michael Stubbs, and Susan Hunston, took the Corpus-Driven commitment to data-led discovery and gave it a specific theoretical direction. It drew on the work of J.R. Firth, who had argued that meaning is created through context and collocation—that "you shall know a word by the company it keeps."
The Neo-Firthian framework transformed the Corpus-Driven method into a positive theory of phraseological meaning. Its central concept is the "unit of meaning": a multi-word item (like "the fact that" or "a matter of") whose meaning and function cannot be reduced to its individual words. These units are identified through the analysis of collocation, colligation (grammatical patterning), semantic preference (the kind of meaning the unit typically co-occurs with), and semantic prosody (the evaluative or attitudinal meaning that spreads across the unit). This framework did not reject the Corpus-Driven approach; it deepened it by providing a coherent theoretical account of what corpus patterns actually reveal about language. It coexists with Corpus-Based Linguistics by occupying a different niche: it is less interested in testing existing theories and more interested in building a new, usage-based theory of meaning from the ground up.
Today, all three frameworks remain active, and their coexistence is not a sign of confusion but of a productive division of labor. Corpus-Based Linguistics is the default approach for most empirical work in linguistics. It is the method of choice for researchers who want to use corpus data to inform questions from other theoretical traditions—for example, testing claims about grammatical change in Historical-Comparative Linguistics, or investigating discourse strategies in Critical Discourse Analysis. Its strength is its flexibility and compatibility.
Corpus-Driven Linguistics continues as a minority but influential stance, particularly in lexicography, phraseology, and language teaching. Its insistence on minimal annotation and data-led discovery remains a critical check against the danger of theory-driven confirmation bias. It is the framework most likely to produce genuinely unexpected findings about language use.
Neo-Firthian Corpus Linguistics has become the most theoretically ambitious of the three. It is the leading framework for research on phraseology, semantic prosody, and the extended units of meaning that make up natural discourse. It has also found a natural ally in Usage-Based Linguistics and Cognitive Linguistics, which share its commitment to deriving linguistic categories from actual usage events rather than from innate grammatical structures.
All three frameworks agree on a foundational point: linguistic analysis must be grounded in empirical evidence from actual language use, not solely in introspection or idealized data. They share a commitment to the corpus as a central resource. Their disagreements center on the role of pre-existing theory. Corpus-Based Linguistics sees theory and data as partners, with theory guiding the inquiry. Corpus-Driven Linguistics insists that theory should be a product of the data, not a precondition. Neo-Firthian Corpus Linguistics agrees with the data-first principle but goes further by developing a specific theoretical apparatus—the unit of meaning—that it claims is uniquely discoverable through corpus methods. The debate is not about whether data matters, but about how much authority the data should have in shaping the concepts we use to describe language.
The internal debates of corpus linguistics are not isolated from the rest of the discipline. The rise of usage-based approaches in linguistics—from Cognitive Linguistics to Construction Grammar—has created a broader intellectual environment in which the Corpus-Driven and Neo-Firthian insistence on data-led theory formation resonates strongly. At the same time, the practical dominance of Corpus-Based methods has made corpus data a standard resource across almost every subfield of linguistics, from phonetics to pragmatics. The frameworks of corpus linguistics thus represent not just a set of technical choices, but a continuing argument about the nature of linguistic evidence and the proper relationship between observation and theory—an argument that lies at the heart of the discipline itself.