Processing massive data streams with memory far smaller than the input is the central challenge of streaming algorithms. Unlike standard algorithms that store all data, a streaming algorithm sees each element once (or a few times) and must compute useful statistics using sublinear space—often logarithmic or polylogarithmic in the stream length. This constraint forces approximate, probabilistic answers, trading exactness for feasibility. The field has evolved through four major frameworks, each addressing different query types and responding to the limitations of earlier approaches.
Streaming sampling was the first systematic framework. The classic reservoir sampling algorithm (1987) solves a fundamental problem: maintain a uniform random sample of fixed size $k$ from a stream of unknown length $n$. The algorithm keeps the first $k$ items, then for each subsequent item at position $i$ (with $i > k$), it replaces a randomly chosen sampled item with probability $k/i$. This elegant mechanism guarantees that every element has probability $k/n$ of being in the final sample, without ever storing the total count.
Later work extended reservoir sampling to weighted streams (priority sampling) and to biased sampling for specific statistics. Sampling remains a versatile primitive for estimating quantities like stream length, averages, and subset sums. However, its scope is limited: a sample cannot reliably answer per-item queries such as “how many times did a particular element appear?” unless that element is extremely frequent. This gap motivated dedicated frequency estimation.
Frequency estimation emerged in 1996 to answer point queries—the frequency of a specific item—with sublinear space. The canonical data structure is the Count–Min Sketch (CM sketch). It uses $d$ hash functions $h1,\dots,hd$ mapping items to buckets in $d$ arrays of counters. On each arrival of an item $x$, increment $hj(x)$ for each $j$. To query the frequency of $x$, return $\minj\{\text{counter}[j][h_j(x)]\}$. The sketch guarantees that the estimate never underestimates and with high probability overestimates by at most $\varepsilon N$ using $O(\frac{1}{\varepsilon}\log\frac{1}{\delta})$ space, where $N$ is the stream length.
Unlike sampling, frequency estimation provides strong guarantees for every individual item, making it invaluable in network traffic monitoring and natural language processing for detecting heavy hitters. The framework coexists with earlier sampling: sampling is simpler for aggregate statistics, while frequency estimation gives per-element fidelity.
Simultaneously in 1996, a different framework—sketch-based approximation—appeared, targeting aggregate queries rather than per-item frequencies. The core idea is to represent the stream as a high-dimensional vector (each dimension counts one item) and then apply a random linear projection to a much lower-dimensional space. The Johnson–Lindenstrauss lemma guarantees that pairwise Euclidean distances are approximately preserved with high probability, enabling norm and inner product estimation.
The AMS sketch (Alon–Matias–Szegedy) for $L_2$ norm uses random sign vectors and tracks the inner product of the frequency vector with random vectors, yielding an $(\varepsilon,\delta)$-approximation. Similarly, Count Sketch (also 1996) uses random signs and hashing for heavy hitters. The key advantage of linear sketches is composability: sketches from distributed streams can be added linearly, merging seamlessly. This framework unifies many previously separate problems (distinct elements, moments, entropy) under a common algebraic lens.
Sketch-based approximation and frequency estimation both use hash functions but differ in queries: frequency estimation answers point queries; sketch-based approximation answers aggregate queries like norms and inner products. The two frameworks coexist and sometimes overlap—for instance, Count Sketch can serve both purposes.
Graph streaming algorithms extend streaming methods to combinatorial structures. The input is a stream of edges, and memory is constrained to $O(n\,\text{polylog}\,n)$ (the semi-streaming model) to allow storing information about vertices but not all edges. A representative problem is connectivity: maintain a spanning forest using random sampling of edges. The classic approach processes each edge; if it connects two different connected components, add it to the forest; otherwise discard. This uses $O(n\log n)$ space and exactly tracks connectivity after one pass. For approximate matching, algorithms use sketches to estimate degrees and sample edges greedily.
Graph streaming borrows heavily from earlier frameworks: sampling for sparsification and linear sketches for cut estimation and spectral properties. However, it introduces new constraints—the combinatorial structure of graphs means that simple vector representations lose edge relationships. Lower bounds show that many graph problems (e.g., maximum matching, minimum cut) require multiple passes or distortion. This framework remains highly active, with ongoing work on dynamic graph streams and trade-offs between passes and space.
All four frameworks remain active today. Streaming sampling is a go-to for simple applications. Frequency estimation dominates in high-speed network hardware where point queries matter. Sketch-based approximation provides the sharpest theoretical tools for norms, moments, and distributed streams. Graph streaming is a burgeoning area with open problems in both algorithms and lower bounds. There is broad agreement that sublinear space forces approximation and that randomization is essential for most problems. Disagreements arise over whether deterministic sketches can match randomized ones, how to best model multiple passes, and whether graph streaming should prioritize worst-case or average-case analysis. The field continues to draw from online algorithms (irrevocable decisions) and approximation algorithms (error guarantees), but its defining commitment—sublinear memory—makes it a distinct and vital part of algorithm design.