For most of the twentieth century, a radiologist’s report was a story. The story described shapes, densities, and shadows, but it was told in words—and two radiologists looking at the same film often told different stories. This tension between subjective visual judgment and the desire for objective, reproducible measurement defines the history of quantitative imaging. Over the past century, five distinct frameworks have emerged, each pushing the field further from the gestalt of the human eye toward numerical biomarkers that can be compared across patients, scanners, and institutions.
From the earliest X-ray films through the rise of computed tomography (CT) and magnetic resonance imaging (MRI), the dominant mode of image interpretation was qualitative. Radiologists trained to recognize patterns: a nodule’s shape, the texture of a liver, the symmetry of a brain. They relied on perceptual heuristics and clinical experience, not on pixel values. The film-based workflow made numerical extraction cumbersome; even after digital modalities like CT and MRI began producing pixel matrices in the 1970s, the clinical workflow remained anchored to visual inspection.
Qualitative reading was not without strengths. It was fast, flexible, and could integrate contextual knowledge—a patient’s age, symptoms, prior exams—into a single impression. But its Achilles’ heel was variability. Studies repeatedly showed that inter-reader agreement for tasks such as detecting pulmonary nodules or grading tumor response was modest at best. This variability became a practical pressure: how could imaging serve as a reliable endpoint in clinical trials or as a consistent guide to therapy if the same image could yield different conclusions?
The first systematic answer to that question was region-of-interest (ROI) measurement. As CT and MRI provided digital pixel values with physical meaning—Hounsfield units for CT, signal intensities for MRI—clinicians began drawing circles or polygons on a single image slice and computing summary statistics: mean attenuation, standard deviation, total area. These numbers could be compared across time points to track tumor shrinkage or disease progression.
ROI measurement coexisted with qualitative reading rather than replacing it. Radiologists still interpreted the overall image, but they supplemented their report with a few numerical values. The approach was simple and required no specialized software beyond the scanner console. Yet it had serious limitations. Manual ROI placement was operator-dependent; different readers might include different portions of a lesion or exclude necrotic regions. Moreover, a single mean value discarded all spatial information within the ROI—two lesions with identical average density but very different internal textures would yield the same number.
Texture analysis emerged to capture the spatial variation that ROI measurement ignored. Instead of collapsing a region into a single number, texture methods compute statistical properties of pixel intensity patterns. The gray-level co-occurrence matrix (GLCM), introduced in the 1970s and applied to medical images in the 1980s, quantifies how often pairs of pixels with specific intensities appear at a given distance and orientation. From the GLCM, features such as contrast, correlation, energy, and homogeneity can be derived. Histogram-based features—skewness, kurtosis, entropy—add further descriptors of the intensity distribution.
Texture analysis did not reject ROI measurement; it extended it. The ROI remained the spatial unit, but the feature set expanded from one or two numbers to dozens. This paradigm proved especially useful in oncology, where tumor heterogeneity—visible as texture on CT or MRI—correlated with aggressiveness and treatment response. The approach remains active today, often serving as the substrate for later frameworks. Its main limitation was that feature extraction and selection were manual and hypothesis-driven; researchers chose a handful of features based on prior knowledge, leaving potentially informative patterns undiscovered.
Radiomics took the logic of texture analysis and scaled it to an industrial level. Instead of a dozen handpicked features, radiomics pipelines extract hundreds—sometimes thousands—of quantitative descriptors from segmented volumes: shape, intensity, texture, and wavelet-transformed features. These features are then fed into machine learning classifiers to predict outcomes such as survival, mutation status, or treatment response.
The shift from texture analysis to radiomics was one of absorption and expansion. Radiomics absorbed the entire texture-analysis toolkit and added new families of features (e.g., Laplacian-of-Gaussian filters, fractal dimensions). It also introduced a new pressure: reproducibility. Because feature values can be sensitive to image acquisition parameters, reconstruction algorithms, and segmentation methods, the Image Biomarker Standardisation Initiative (IBSI) was launched in 2016 to establish reference standards. Despite these efforts, radiomics has faced criticism for overfitting. With hundreds of features and often small patient cohorts, many published radiomics signatures have failed to replicate in independent datasets. The framework’s strength—its ability to mine high-dimensional data—is also its vulnerability.
Deep learning entered quantitative imaging with a fundamentally different philosophy. Instead of handcrafting features, convolutional neural networks (CNNs) learn representations directly from pixel data. For segmentation tasks, architectures like U-Net can delineate tumors with accuracy rivaling human experts. For outcome prediction, transformers and attention-based models can process entire 3D volumes and learn which spatial patterns matter.
This framework transformed the relationship between feature engineering and prediction. Radiomics required explicit feature definitions; deep learning treats feature discovery as part of the optimization. The result has been state-of-the-art performance on many benchmarks, particularly when large annotated datasets are available. But large annotated datasets are scarce in medicine. Annotating a single tumor segmentation can take an expert radiologist hours, and privacy regulations limit data sharing. Deep learning models also suffer from an interpretability gap: a radiomics feature like “entropy” has a clear physical meaning, whereas a deep network’s internal representations are opaque. This opacity creates regulatory hurdles—how do you validate a model whose reasoning cannot be fully explained?
Today, radiomics and deep learning coexist as the leading frameworks, each with a distinct division of labor. Radiomics is preferred when datasets are small, interpretability is critical (e.g., for regulatory approval or clinical trial endpoints), and features need to be harmonized across scanners. Deep learning excels when large, well-curated datasets are available and raw predictive accuracy is the primary goal.
Both frameworks agree on a core principle: medical images contain quantitative information that can be extracted and used for decision-making. Both agree that reproducibility—across scanners, protocols, and institutions—is essential. Where they disagree is on the optimal route to that information. Radiomics advocates argue that handcrafted features grounded in physics and biology are more robust and interpretable. Deep learning proponents counter that learned representations can capture complex, nonlinear patterns that handcrafted features miss, and that interpretability can be addressed post hoc through saliency maps or concept attribution.
A growing hybrid approach bridges the two camps. Researchers feed radiomics features—not raw pixels—into deep learning classifiers, combining the interpretability and standardization of handcrafted features with the representational power of neural networks. Other work uses deep learning to segment tumors automatically and then extracts radiomics features from the resulting masks. This convergence suggests that the future of quantitative imaging may not be a winner-take-all contest but a layered pipeline in which each framework contributes what it does best.