cs.CVJun 8, 2026classified

End-to-End Optimization of Incoherent Imaging for Classification Under Detector-Limited Readout

Archer Wang, Joshua Chen, Sachin Vaidya, Marin Soljačić

cs.CV

Paper Guide Brief

Reading Brief

This paper investigates when end-to-end co-optimization of an incoherent imaging phase mask and a neural network classifier improves object classification over a conventional lens. It proves that under full detector readout, no passive phase mask can exceed the mutual information of the ideal channel, but under constrained readout (coarse sampling, limited measurements, or noise), optimized optics can substantially improve class separability, especially when discriminative content is at lower spatial frequencies than within-class variation.

Central Claim

Provides a theoretical framework and information-theoretic proof characterizing when end-to-end optical-digital co-design outperforms conventional lens-based imaging for classification, specifically under detector-limited readout conditions, and validates pre...

Contribution

Why It Matters

This matters because it provides the first rigorous theoretical characterization of when and why optical-digital co-design for classification is beneficial, pinpointing detector-limited readout and spectral structure as key factors, which...

Prerequisites

end-to-end optimization, phase mask optimization, incoherent imaging model, mutual information analysis, Mahalanobis distance separability

Atlas Placement

Computational Imaging (subfield)

Read If

You care about end-to-end optimization, phase mask optimization, incoherent imaging model.

Skip If

You only care about MNIST, FashionMNIST.

Methods

end-to-end optimizationphase mask optimizationincoherent imaging modelmutual information analysisMahalanobis distance separabilityangular spectrum propagation

Tasks

object classificationimage classification

Datasets

synthetic Gaussian dataMNISTFashionMNISTSVHN

Benchmarks

MNISTFashionMNISTSVHN

Noosaga Placements

Computational Imagingsubfield95%
The paper focuses on joint optimization of optical front-ends (e.g., metasurfaces, phase masks) and digital processing for imaging tasks, directly addressing computational imaging principles. The core model and experiments involve incoherent imaging, detector readout, and optical transfer functions.
End-to-end co-optimization of optical front-ends (e.g. metasurfaces) and neural network back-ends has been widely applied to imaging tasksWe use a standard scalar incoherent imaging model for a monochromatic, shift-invariant imaging system.The detector readout operator DW introduced in Section II-B maps the fine-grid intensity to a coarse-grid measurement array
Coded Aperture Imagingframework85%
The paper compares end-to-end optimized phase masks against a conventional focusing lens, which is a form of coded aperture or phase mask design. The framework of Coded Aperture Imaging is directly relevant as the paper analyzes and compares different phase mask designs (conventional lens vs. learned phase masks).
we compare a fixed conventional lens against end-to-end joint optimization of the optical phase together with the BatchNorm-plus-logistic-regression classifier.The conventional lens’ MTF is higher across most spatial frequencies, while the optimized phase mask’s MTF suppresses higher frequencies.we prove that under full detector readout, no passive incoherent phase mask exceeds the ideal-channel mutual information
Computer Visionsubfield90%
The paper is explicitly about object classification, a core computer vision task. It uses standard vision benchmarks (MNIST, FashionMNIST, SVHN) and analyzes class separability in the context of image classification. The primary arXiv category is cs.CV.
This paper focuses on object classification, a central imaging tasktest its predictions on synthetic data and standard benchmarks (MNIST, FashionMNIST, SVHN)arxiv_categories: [cs.CV]
Data-Driven Computational Imagingframework80%
The paper contrasts its end-to-end optimized approach with conventional, lens-based pipelines. This situates the work within the Data-Driven Computational Imaging framework, which uses learned, data-driven methods to design imaging systems. The paper specifically analyzes when this data-driven approach offers a benefit over a conventional lens.
End-to-end co-optimization of optical front-ends (e.g. metasurfaces) and neural network back-ends has been widely applied to imaging tasksthe jointly optimized system maintains high separability and accuracy even at small sjointly optimize the phase mask and a logistic regression classifier
Computational Photographyframework75%
The paper's goal of jointly designing optics and processing for a specific task (classification) is a central theme in Computational Photography, particularly for applications like high-speed, low-power, or embedded vision. The paper's focus on detector-limited readout, such as coarse spatial sampling and limited measurements, is typical of computational photography systems.
Much prior work on end-to-end optical design has focused on computational camerasIn high-speed perception settings, these costs directly limit response time; in embedded and always-on platforms, they can dominate the energy budgetFor example, a concrete application is high-throughput industrial inspection and sorting
Machine Learningsubfield60%
The paper uses a theoretical framework involving mutual information and Mahalanobis distance for classification, and employs neural networks (including ResNet-18 and logistic regression) as classifiers. It relates to learning theory and statistical classification models.
we assume a Bayes-optimal classifier on Y and use the mutual information I(C; Y) as our analysis proxyuse deep neural networks as the classifier to accommodate highly nonlinear decision boundariesThe Bayes-optimal classifier in this setting is linear, as in linear discriminant analysis (LDA).
Deep Learningsubfield50%
The paper uses neural networks (linear classifiers, ResNet-18, small CNN) as part of the end-to-end pipeline, but the main contribution is not a new deep learning architecture or method. Deep learning is a tool used to validate the theoretical framework.
using deep neural networks as the classifierMNIST and FashionMNIST use a grayscale ResNet-18 backboneFor SVHN, we use a CNN with approximately 5×10^5 parameters

Abstract

End-to-end co-optimization of optical front-ends (e.g. metasurfaces) and neural network back-ends has been widely applied to imaging tasks, yet a formalism characterizing when and why such systems outperform conventional lens-based imaging is largely lacking. This paper focuses on object classification, a central imaging task, and asks when end-to-end optimization of a phase mask for incoherent imaging improves performance over a conventional focusing lens. We find that these gains arise primarily under constrained detector readout and are limited under full detector readout. In the latter setting, we prove that no incoherent phase mask exceeds the ideal-channel mutual information between detector measurements and class labels; a conventional focusing lens approaches this ceiling, and joint optimization yields no empirical gain. When detector readout is constrained -- by coarse spatial sampling or a limited number of measurements -- optimized optics can substantially improve classification by increasing class separability in the detector measurements. These gains are largest under low detector noise and shrink as noise grows, because the optics shape the signal before it reaches the detector but cannot remove noise added afterward. The advantage also depends on the spectral structure of the task: co-design helps most when class-discriminative content is concentrated at lower spatial frequencies than within-class variation. We develop a theoretical framework formalizing these distinctions and test its predictions on synthetic data and standard benchmarks (MNIST, FashionMNIST, SVHN).

Paper Context

Source ContextWhole paper

Budget100,000 tokens

Coverage74,240 chars

Classified from the full extracted paper text (74,240 characters). The Paper Guide brief above is the user-facing synthesis; raw context is kept out of the page.

Full-paper context sent 74,240 of 74,240 extracted characters to classification.