Designing algorithms for parallel computers forces a fundamental choice: should the model of computation be simple enough to reason about abstractly, or detailed enough to predict real performance on actual hardware? The history of parallel algorithms is shaped by this tension. Early models offered clean abstractions that made algorithm design and complexity analysis tractable, but they ignored communication costs, memory hierarchies, and synchronization overheads. Later models introduced more realistic cost parameters, often at the expense of portability and simplicity. No single model has won; instead, different frameworks now serve distinct roles in theory, high-performance computing, and GPU programming.
The Parallel Random Access Machine (PRAM) model, introduced in the late 1970s, was the first widely adopted abstraction for parallel computation. It assumes a collection of processors, each with its own local memory, all connected to a shared global memory. Any processor can read from or write to any global memory location in a single time step. This radical simplification made it possible to design parallel algorithms and analyze their complexity without worrying about communication topology or memory contention. PRAM algorithms are classified by how they handle simultaneous memory accesses: exclusive-read exclusive-write (EREW), concurrent-read exclusive-write (CREW), and concurrent-read concurrent-write (CRCW). The model's clean semantics enabled a rich body of theoretical results, including parallel versions of sorting, graph algorithms, and matrix operations. However, PRAM's assumption of unit-cost global memory access is deeply unrealistic. On real machines, communication between processors is orders of magnitude slower than local computation, and contention for shared memory can stall progress. Despite this, PRAM remains a cornerstone of parallel complexity theory, where it serves as a reference point for classifying problems into complexity classes such as NC (Nick's Class). Its role has narrowed: it is now primarily a theoretical tool, not a guide for practical algorithm engineering.
While PRAM dominated theoretical work, two other frameworks emerged in the late 1970s that took hardware constraints seriously. Systolic array algorithms, proposed by H. T. Kung and Charles Leiserson, model computation as a regular grid of processing elements that each perform simple operations and pass data only to their immediate neighbors. The systolic paradigm was designed for VLSI implementation: data flows rhythmically through the array, much like blood through a circulatory system. This model excels at matrix multiplication, convolution, and other signal-processing tasks where data reuse is high and communication is local. In contrast to PRAM's global memory, systolic arrays assume a fixed, regular interconnection network, making them less flexible but far more realistic for chip design.
At roughly the same time, the Boolean circuit model treated parallel computation as a family of logic gates arranged in a directed acyclic graph. Circuit depth measures parallel time, and circuit size measures total work. This model connected parallel algorithms directly to circuit complexity theory, allowing researchers to prove lower bounds and classify problems by their parallel feasibility. Both systolic arrays and Boolean circuits coexisted with PRAM in the early era, but they addressed different audiences: systolic arrays targeted hardware architects, Boolean circuits served complexity theorists, and PRAM aimed at algorithm designers who wanted architecture-independent reasoning.
By the mid-1980s, the data-parallel paradigm crystallized around the Single Instruction, Multiple Data (SIMD) model. In this view, a single control unit broadcasts an instruction to many processing elements, each operating on its own data in lockstep. The key insight is that many algorithms—especially those on arrays, images, and dense matrices—can be expressed as operations on entire data structures rather than on individual elements. The data-parallel model absorbed ideas from both PRAM (shared-memory abstraction) and systolic arrays (regular, predictable communication patterns), but it added a crucial discipline: all processors execute the same instruction at the same time, eliminating the need for complex synchronization. This made SIMD machines like the Connection Machine commercially viable for scientific computing. The data-parallel framework also introduced algorithmic building blocks such as prefix sums, parallel reductions, and segmented scans, which Guy Blelloch later systematized in his work on prefix sums and their applications. These building blocks became the foundation for a portable, architecture-independent style of parallel programming.
The early 1990s saw a decisive shift toward realistic cost models. Two frameworks emerged as competing responses to PRAM's shortcomings: the Bulk Synchronous Parallel (BSP) model and the LogP model. BSP, proposed by Leslie Valiant, divides computation into a sequence of supersteps. In each superstep, processors perform local work, then exchange data via a global communication network, and finally synchronize with a barrier. The model captures communication and synchronization costs with two parameters: the gap between successive communications and the time for a global barrier. BSP's strength is its simplicity: algorithms can be analyzed by counting supersteps and balancing local work against communication volume.
LogP, introduced by David Culler and colleagues, took a more fine-grained view. It models the network with four parameters: latency (L), overhead (o), gap (g), and the number of processors (P). Unlike BSP, LogP does not assume global synchronization; instead, it tracks point-to-point message costs and allows asynchronous communication. This made LogP more accurate for machines with high latency and low bandwidth, but also more complex to reason about. The disagreement between BSP and LogP was not merely technical—it reflected a deeper debate about whether global synchronization is a useful abstraction or an unrealistic simplification. BSP argued that barriers simplify algorithm design and can be implemented efficiently on many machines; LogP countered that barriers are expensive and that fine-grained, asynchronous models better reflect actual hardware. Both frameworks remain active: BSP is used in large-scale scientific computing (e.g., the Pregel graph processing system), while LogP has influenced performance modeling in clusters and supercomputers.
Concurrent with the realism turn, a different approach emerged that sidestepped the debate over cost models. Work-depth (or fork-join) parallelism, developed by Guy Blelloch and others, separates algorithm design from machine-specific cost accounting. An algorithm is expressed as a directed acyclic graph of tasks, where nodes represent computation and edges represent dependencies. The framework measures two quantities: work (total number of operations) and depth (length of the longest chain of dependencies). The critical insight is that any algorithm with work W and depth D can be scheduled on P processors in time roughly W/P + D, assuming an ideal scheduler. This analysis is portable across machines because it does not assume a particular communication topology or synchronization mechanism. Work-depth draws on PRAM's abstraction of unlimited parallelism but replaces PRAM's unit-cost memory with a dependency graph that exposes parallelism without hiding communication entirely. It also complements BSP: work-depth provides a machine-independent upper bound, while BSP refines that bound with communication costs. Today, work-depth analysis is the standard tool for reasoning about parallel algorithms in textbooks and in practice, especially for multicore programming with frameworks like Cilk and OpenMP.
The rise of graphics processing units (GPUs) in the late 2000s introduced a new hardware model that revived and transformed the data-parallel paradigm. NVIDIA's CUDA programming model popularized the Single Instruction, Multiple Thread (SIMT) abstraction. In SIMT, a group of threads (a warp) executes the same instruction on different data, but threads can diverge when they encounter conditional branches. This is a crucial evolution from strict SIMD: it allows fine-grained control flow while still achieving high throughput through lockstep execution within a warp. GPU parallel algorithms must contend with warp divergence, memory coalescing, and a deep memory hierarchy (global, shared, local, and register memory). The SIMT framework absorbed the data-parallel building blocks of prefix sums and reductions, but added hardware-specific constraints that make algorithm design more challenging. For example, a parallel reduction on a GPU must avoid bank conflicts in shared memory and ensure that memory accesses are coalesced into contiguous transactions. SIMT algorithms coexist with work-depth analysis: the work-depth model provides an upper bound on parallelism, but GPU-specific tuning requires a LogP-like understanding of latency and bandwidth. Today, GPU parallel algorithms dominate high-performance computing for dense linear algebra, deep learning, and scientific simulation.
No single framework has unified the field. Instead, each serves a distinct domain. PRAM remains the standard for theoretical complexity classification, especially for proving that a problem is in NC or P-complete. Systolic arrays are used in specialized hardware design, such as Google's Tensor Processing Units. The Boolean circuit model underpins lower bounds and circuit complexity theory. Data-parallel / SIMD algorithms live on in vectorized CPU instructions (e.g., AVX) and in the design of GPU kernels. BSP is the model of choice for large-scale graph processing and bulk-synchronous distributed computing. LogP guides performance modeling for clusters and message-passing systems. Work-depth / fork-join parallelism is the default framework for multicore algorithm analysis and for teaching parallel algorithms. SIMT / GPU algorithms are the practical workhorse for throughput-oriented computing.
What the leading frameworks agree on is that communication cost matters: any realistic model must account for the time to move data between processors or between levels of memory. They also agree that parallelism is not free—synchronization, contention, and load imbalance can erode speedup. Where they disagree is on the level of abstraction. Work-depth argues that machine-independent analysis is sufficient for algorithm design; BSP and LogP argue that cost parameters are essential for performance prediction. SIMT adds hardware-specific constraints that neither work-depth nor BSP fully capture. This pluralism is not a failure but a reflection of the diversity of parallel hardware. A student learning parallel algorithms today must be comfortable moving between these frameworks, using work-depth for initial design, BSP or LogP for communication analysis, and SIMT for GPU optimization.
The central tension between abstraction and realism remains unresolved. Each new generation of hardware—multicore CPUs, manycore GPUs, reconfigurable arrays, neuromorphic chips—revives the debate. The frameworks that survive are those that adapt: PRAM persists in theory, work-depth has become the lingua franca of parallel algorithm analysis, and SIMT has absorbed and extended the data-parallel tradition. The history of parallel algorithms is not a story of replacement but of coexistence, narrowing, and transformation, with each framework finding its niche in the ongoing effort to turn parallel hardware into usable speed.