Start here

Choose a framework

The timeline is your map. Pick a school or approach to see what it argues, what came before it, and what sits inside it.

Start from1972

Select a framework above to see its date range, sources, and workspace.

Parallel Computing: Competing Paths to Faster Computation

The central challenge of parallel computing is deceptively simple: how to make many processing elements work together on a single problem faster than one could alone. But the history of the field reveals no single answer. Instead, architects have pursued fundamentally different bets—on data parallelism, on memory organization, on thread-level concurrency, and on heterogeneity—each responding to the limitations of earlier designs and to shifting physical constraints such as wire delay, power density, and transistor budgets.

Early Data-Level Parallelism: SIMD, Vector, and Systolic

The earliest systematic attempts at parallelism focused on applying the same operation to many data elements simultaneously. SIMD Architectures (1972–Present) emerged from the observation that many scientific and graphics workloads repeatedly perform the same arithmetic on arrays of numbers. In a SIMD machine, a single instruction controls multiple processing units, each operating on its own data. The Illiac IV (1972) was the landmark example, though its complexity and programming difficulty limited immediate adoption. SIMD survives today in the vector units of modern CPUs and in GPU shader cores, where it provides a simple, energy-efficient way to exploit data-level parallelism.

Vector Processing (1977–Present) took a different approach: instead of many simple processing units, a vector processor uses deep pipelines and special registers to operate on entire vectors of data with a single instruction. The Cray-1 (1976) demonstrated that vector processing could deliver enormous throughput on scientific code without the programming headaches of SIMD. Vector processors and SIMD machines coexisted for decades, with vector machines dominating supercomputing while SIMD found a home in graphics and later in CPU multimedia extensions (e.g., SSE, AVX). The key difference: SIMD spreads work across many small units, while vector processing concentrates it in a single powerful pipeline that can be easily chained.

Systolic Arrays (1978–Present), proposed by H. T. Kung, offered a third data-level approach. A systolic array is a network of simple processing cells that rhythmically compute and pass data to neighbors, like blood pumping through a heart. This design was ideal for regular, compute-intensive tasks such as matrix multiplication and signal processing. Systolic arrays never became general-purpose processors, but they influenced later specialized accelerators (e.g., Google's TPU) and remain a key concept for domain-specific hardware. Unlike SIMD or vector processing, systolic arrays require the algorithm to be mapped onto a fixed topology, trading flexibility for extreme efficiency.

Dataflow Computing: A Radical Alternative That Faded

Dataflow Computing (1975–1990) rejected the control-driven model entirely. Instead of a program counter sequencing instructions, a dataflow machine executes an instruction as soon as all its input operands are available. This promised to expose massive fine-grained parallelism automatically, without the programmer or compiler having to manage it. Early projects like the MIT Tagged-Token Dataflow Machine and the Manchester Dataflow Machine showed feasibility, but the overhead of token matching and the difficulty of handling irregular control flow proved insurmountable. Dataflow computing never reached commercial viability and was largely abandoned by 1990. However, its ideas about data-driven execution influenced later Multithreaded Architectures, which use a similar principle of hiding latency by switching between threads when data is unavailable.

The Memory Model Divide: Distributed vs. Shared

By the early 1980s, architects building multiprocessors faced a fundamental choice: how should processors share data? Distributed-Memory Multicomputers (1980–Present) gave each processor its own private memory and required explicit message passing to exchange data. The Cosmic Cube (1985) at Caltech demonstrated that a hypercube network of simple nodes could scale to hundreds of processors, but programming required the programmer to partition data and manage communication manually. This model became the backbone of many supercomputers (e.g., IBM Blue Gene, Cray XT) and remains dominant in high-performance computing today, where scalability trumps ease of programming.

Shared-Memory Multiprocessing (1980–Present) took the opposite approach: all processors share a single address space, and communication happens implicitly through loads and stores. The Stanford DASH multiprocessor (1990) introduced directory-based cache coherence to maintain consistency across distributed memory banks, making shared memory scalable. Shared-memory machines are easier to program because the programmer does not need to manage data placement, but they face scalability limits due to coherence traffic and memory contention. The tension between these two models has never been resolved; modern systems often combine them, using shared memory within a node and message passing between nodes.

Hiding Latency: Multithreaded Architectures

As processors became faster than memory, the gap between compute speed and memory latency became a dominant bottleneck. Multithreaded Architectures (1990–Present) attacked this problem by allowing a processor to switch to another thread while waiting for a memory access, thus keeping the execution units busy. Early examples included the Tera MTA (1995) and the Sun Niagara (2005). Simultaneous multithreading (SMT), introduced in the Intel Pentium 4 and IBM Power5, allowed multiple threads to issue instructions in the same cycle, sharing the processor's functional units. Multithreading coexists with other forms of parallelism: it does not increase peak throughput but improves utilization, especially on workloads with high memory latency or irregular control flow.

The Multicore Shift: Chip Multiprocessing

By the early 2000s, power and heat constraints made it impossible to continue increasing single-core clock speeds. Chip Multiprocessing (2001–Present) placed multiple complete processor cores on a single die, each capable of running independent threads. IBM's Power4 (2001) was the first commercial server chip to do so, followed by Intel's Core 2 Duo and AMD's Opteron. This shift was a direct response to the end of Dennard scaling: instead of building a faster single core, architects replicated simpler cores to maintain throughput growth. Chip multiprocessing absorbed earlier shared-memory ideas, placing a small number of cores with a coherent cache hierarchy on one chip. It also created new challenges in programming, as software had to be explicitly parallel to benefit from multiple cores.

General-Purpose GPU Computing

Graphics processing units (GPUs) had long used SIMD-like architectures to render pixels in parallel. GPGPU Computing (2004–Present) made that power available for non-graphics workloads. The Brook language (2004) at Stanford showed that stream programming could map scientific computations onto GPUs. NVIDIA's CUDA (2007) and the OpenCL standard (2009) provided general-purpose programming models, turning GPUs into massively parallel accelerators. GPGPU computing revived and transformed the SIMD idea: modern GPU cores execute in a SIMT (single-instruction, multiple-thread) fashion, grouping threads into warps that execute the same instruction on different data. This approach excels at data-parallel tasks with regular control flow, such as deep learning and molecular dynamics, but struggles with irregular or highly divergent workloads.

Heterogeneous Computing: Combining Specialists

As chip multiprocessing and GPGPU matured, architects realized that no single type of processor is optimal for all workloads. Heterogeneous Computing (2008–Present) integrates different kinds of processing units—CPU cores, GPU cores, fixed-function accelerators—on the same chip or in the same system, each optimized for a specific class of tasks. AMD's Fusion (2011) and Apple's M-series chips (2020) are prominent examples. Heterogeneous computing is the culmination of earlier trends: it absorbs chip multiprocessing (multiple CPU cores), GPGPU (GPU cores), and even systolic-array-like accelerators (neural processing units). The key challenge is programming and scheduling: the system must decide which unit to use for each task, often with hardware-managed coherence and shared virtual memory to simplify data movement.

Current Landscape: Agreements and Disagreements

Today, the leading frameworks—SIMD, vector processing, systolic arrays, distributed-memory multicomputers, shared-memory multiprocessing, multithreading, chip multiprocessing, GPGPU, and heterogeneous computing—are not competing for dominance but occupy distinct niches. There is broad agreement that future performance gains will come from specialization and parallelism, not from faster single cores. Architects also agree that memory bandwidth and energy efficiency are the primary constraints, driving interest in near-memory computing and 3D-stacked memory. However, deep disagreements remain. The most persistent is the shared-memory vs. distributed-memory divide: shared memory simplifies programming but struggles to scale beyond a few hundred cores, while distributed memory scales to millions of cores but burdens the programmer. Another disagreement concerns the role of the GPU: some argue that GPUs will become the primary compute engine, with CPUs relegated to orchestration, while others believe that heterogeneous integration will blur the distinction. Finally, the revival of systolic arrays in deep learning accelerators (e.g., Google TPU) shows that old ideas can find new life when the workload is right. The field remains a vibrant ecosystem of competing and complementary approaches, each shaped by the physical realities of silicon and the ever-growing demand for speed.

Was this useful?

Sources

Framework Workspace

Choose a framework to open this workspace

Select a framework from the timeline above to see what it is, what it responds to, and which concepts sit inside it.

Overview

What this framework argues, what problem it answered, and why it mattered.

Concept map

Core concepts and the prerequisite order inside the framework.

Workflow tools

Verification, sources, generation status, and correction tools.

Loading vocabulary comparison...

Parallel Computing timeline

Parallel Computing: Competing Paths to Faster Computation

Early Data-Level Parallelism: SIMD, Vector, and Systolic

Dataflow Computing: A Radical Alternative That Faded

The Memory Model Divide: Distributed vs. Shared

Hiding Latency: Multithreaded Architectures

The Multicore Shift: Chip Multiprocessing

General-Purpose GPU Computing

Heterogeneous Computing: Combining Specialists

Current Landscape: Agreements and Disagreements

Parallel Computing timeline

Parallel Computing: Competing Paths to Faster Computation

Early Data-Level Parallelism: SIMD, Vector, and Systolic

Dataflow Computing: A Radical Alternative That Faded

The Memory Model Divide: Distributed vs. Shared

Hiding Latency: Multithreaded Architectures

The Multicore Shift: Chip Multiprocessing

General-Purpose GPU Computing

Heterogeneous Computing: Combining Specialists

Current Landscape: Agreements and Disagreements