Every computer architect faces a fundamental trade-off: a general-purpose processor can run any program, but it wastes energy and time on overhead—instruction fetch, decode, speculative execution—that a specialized design could avoid. Hardware accelerators are the result of betting on specialization. The history of the field is not a linear march toward faster chips but a series of competing architectural bets on where to draw the line between flexibility and efficiency, each shaped by the physical constraints of its era: wire delay, transistor budgets, and the growing gap between processor speed and memory access time.
The earliest accelerators did not replace the CPU but worked alongside it. Coprocessor Architecture (1950–1990) assigned specific, compute-intensive tasks—floating-point arithmetic, graphics rendering, I/O handling—to a separate unit that the CPU could invoke. The Intel 8087 math coprocessor, for example, handled floating-point operations that the 8086 CPU could only emulate slowly. Coprocessors preserved the CPU's role as the general-purpose orchestrator while offloading narrow, well-defined functions. Their limitation was granularity: each coprocessor was designed for one task, and the CPU still managed data movement and control flow.
Vector Processing (1960–2000) took a different approach. Instead of offloading a function, it offloaded a pattern of computation: applying the same operation to an entire array of data. The Cray-1 (1976) showed that vector pipelines could deliver enormous throughput for scientific workloads by exploiting data-level parallelism. Vector processors did not replace coprocessors; they coexisted, serving different niches. Where a coprocessor specialized in a function (e.g., matrix multiply), a vector processor specialized in a control pattern (single instruction, multiple data). The key difference was that vector processing required the data to be laid out in contiguous memory, a constraint that coprocessors did not impose. Over time, vector concepts were absorbed into CPU instruction-set extensions (e.g., AVX) and, later, into the SIMD lanes of GPUs, but the dedicated vector processor as a standalone machine faded as general-purpose CPUs incorporated its core insight.
By the 1980s, architects began questioning whether the von Neumann model—fetching instructions and data from memory—was the right foundation for acceleration. Systolic Array Architecture (1980–Present) proposed a radically different organization: a grid of processing elements, each connected only to its neighbors, through which data flowed rhythmically like blood through a heart. Each element performed a simple operation (e.g., multiply-accumulate) on data as it passed through. The result was a design with no instruction fetch, no control logic beyond a global clock, and peak efficiency for regular, data-parallel workloads like matrix multiplication and convolution. Systolic arrays were inflexible—a fixed topology could only accelerate a narrow class of algorithms—but within that class, they achieved throughput and energy efficiency that general-purpose processors could not match. For years, systolic arrays remained a niche research idea, but they were revived in the 2010s when Google's Tensor Processing Unit (TPU) deployed a systolic array for neural network inference, proving that the old idea could dominate a new domain.
Field-Programmable Gate Array (FPGA) and Reconfigurable Computing (1985–Present) offered a different kind of flexibility: post-fabrication reconfigurability. Unlike a systolic array, whose datapath is fixed at design time, an FPGA allows the user to wire logic blocks and routing resources after manufacturing, effectively creating a custom circuit for each application. This reconfigurability sits between the programmability of a CPU and the fixed efficiency of an ASIC. FPGAs trade some peak efficiency for the ability to adapt to new algorithms without fabricating new chips. They have found a lasting role in prototyping, low-volume applications, and domains where standards change faster than silicon cycles (e.g., network packet processing). Compared to GPUs, FPGAs offer finer-grained control over datapath width and memory layout, but at the cost of a steeper programming model and lower peak floating-point throughput.
The Graphics Processing Unit (GPU) and General-Purpose GPU (GPGPU) (1990–Present) began as a fixed-function pipeline for rendering triangles and textures. Over the 1990s and early 2000s, GPU architects gradually made the pipeline programmable, first in vertex and fragment shading, then in unified shader cores. The breakthrough was recognizing that the same hardware that could shade pixels could also run arbitrary data-parallel computations. NVIDIA's CUDA (2007) and AMD's OpenCL turned the GPU into a general-purpose accelerator for scientific computing, machine learning, and simulation. The GPU's organizing principle is massive multithreading: thousands of lightweight threads, grouped into warps or wavefronts, hide memory latency by switching threads on every stall. This design differs fundamentally from a systolic array's lockstep dataflow: GPUs tolerate irregular memory access patterns and divergent control flow, whereas systolic arrays require regular, predictable data movement. The GPU's programmability made it the dominant accelerator for deep learning in the 2010s, though it pays an energy overhead for that flexibility compared to a fixed systolic array.
Hardware/Software Co-Design (1990–Present) is not a hardware architecture but a methodological school that transformed how accelerators are built. Before co-design, hardware architects typically designed a chip, then handed it to software engineers who wrote drivers and compilers to use it. Co-design insisted that the two activities must happen together: the hardware interface, the software stack, and the application algorithms should be co-optimized from the start. This approach became essential as accelerators grew more complex. A GPU's performance, for example, depends as much on the compiler's ability to map loops to threads as on the number of ALUs. Co-design introduced systematic exploration of the hardware-software trade space, using simulation, high-level synthesis, and iterative refinement. Its lasting impact is visible in every modern accelerator project: the TPU, for instance, was co-designed with TensorFlow, and FPGA accelerators are often sold with a software framework that hides the hardware complexity.
Neuromorphic Architecture (1990–Present) takes inspiration from biological nervous systems, replacing the clocked, synchronous digital model with event-driven, spike-based computation. Neurons communicate only when their membrane potential crosses a threshold, producing an asynchronous stream of spikes. This design eliminates the energy cost of a global clock and of moving data that has not changed. IBM's TrueNorth (2014) and Intel's Loihi (2018) demonstrated that neuromorphic chips could run spiking neural networks at a fraction of the power of conventional digital accelerators. Neuromorphic architectures differ from Domain-Specific Architectures (DSAs) in a fundamental way: DSAs accelerate a specific application domain (e.g., matrix operations for ML) using conventional digital logic, whereas neuromorphic chips adopt a completely different computational model (spikes, synapses, plasticity) that is not easily mapped to standard arithmetic. The two frameworks are not in direct competition; neuromorphic designs target ultra-low-power edge sensing and real-time control, while DSAs dominate cloud inference and training.
Heterogeneous Computing (2000–Present) addresses a system-level problem: how to orchestrate multiple accelerators—CPUs, GPUs, FPGAs, DSPs, and custom ASICs—within a single platform. Rather than proposing a new accelerator design, heterogeneous computing provides the architectural glue: shared virtual memory, coherent interconnects, and runtime schedulers that dispatch tasks to the most suitable unit. The key insight is that no single accelerator is optimal for all workloads; a heterogeneous system can approach the efficiency of specialized hardware while retaining the flexibility to handle diverse applications. Heterogeneous computing depends on the earlier accelerator frameworks without replacing them; it is the infrastructure that lets them coexist. AMD's APU (Accelerated Processing Unit) and NVIDIA's Grace Hopper superchip are examples of this integration trend.
Domain-Specific Architecture (DSA) (2010–Present) narrows and intensifies the specialization bet that coprocessors and systolic arrays made decades earlier. A DSA is a processor designed from the ground up for a particular application domain—neural network inference, graph analytics, database query processing—rather than for a single function. DSAs exploit domain-specific knowledge to eliminate unnecessary generality: they use custom data types (e.g., 8-bit floating point), specialized memory hierarchies (e.g., software-managed scratchpads), and fixed-function datapaths for common operations. The Google TPU (systolic array for ML), the Microsoft Catapult (FPGA for Bing search ranking), and the Cerebras Wafer-Scale Engine (massive 2D mesh for sparse linear algebra) are all DSAs. Compared to GPUs, DSAs trade away programmability for higher efficiency within their target domain. Compared to coprocessors, DSAs are more ambitious: they accelerate not just one function but a whole class of algorithms, and they often include their own instruction set and compiler.
Near-Memory Computing (2010–Present) challenges an assumption shared by almost all earlier accelerator frameworks: that data must be moved to the processor. As the gap between processor speed and memory bandwidth (the "memory wall") widened, moving data became the dominant cost in both energy and latency. Near-memory computing places computation physically close to memory—on the same die, in the same package, or inside the memory array itself—so that data moves only a short distance. This is not a new idea; it revives the logic-in-memory concepts of the 1960s, but modern 3D-stacked memory (e.g., HBM, Hybrid Memory Cube) makes it practical. Samsung's HBM-PIM and UPMEM's processing-in-memory DIMMs are commercial examples. Near-memory computing coexists with DSAs and GPUs: it does not replace them but adds a new layer in the memory hierarchy where simple operations (e.g., vector addition, search, filter) can be performed without moving data to the main processor. Its main limitation is that the compute logic must be simple and area-constrained, since it competes with memory cells for silicon area.
Today, no single accelerator framework dominates all domains. GPUs remain the workhorses for deep learning training and high-performance computing, thanks to their programmability and mature software ecosystems. DSAs, especially systolic arrays, have captured inference in the cloud, where the workload is fixed and efficiency is paramount. FPGAs occupy niches in low-latency networking, financial trading, and prototyping. Neuromorphic chips are emerging in battery-powered edge devices. Near-memory computing is being explored for data-intensive workloads like database analytics and sparse linear algebra. Heterogeneous computing provides the system-level integration that lets these accelerators work together.
The leading frameworks agree on one thing: the era of scaling general-purpose processors by adding more cores is over. Specialization is the only path to continued performance gains under power constraints. They disagree on how much specialization is enough. GPU advocates argue that a programmable, massively parallel design can cover a broad enough range of workloads to justify its generality. DSA proponents counter that the efficiency gains from domain-specific customization are so large that they outweigh the cost of designing multiple chips. Neuromorphic architects argue that the entire digital paradigm is wasteful and that event-driven, analog-inspired computation is the only way to approach biological efficiency. These disagreements are not being resolved; they are driving a productive pluralism in which each framework finds its own domain of advantage.
Near-memory computing and DSAs are converging in some designs: placing a small DSA near memory (e.g., a systolic array on a logic layer stacked with DRAM) combines the data-movement savings of near-memory computing with the efficiency of domain-specific logic. Similarly, GPUs are absorbing DSA-like features, such as tensor cores that are essentially small systolic arrays embedded in the GPU datapath. The boundaries between frameworks are blurring, but the underlying tension—specialization versus programmability—remains the engine of innovation.