How can an agent learn to make good decisions when the consequences of its actions are delayed and the environment is unknown? This is the credit assignment problem that has driven reinforcement learning (RL) since its earliest days. An agent must figure out which past actions deserve credit or blame for a reward received much later, and it must do so while exploring an unfamiliar world. Over six decades, RL researchers have produced a family of frameworks that answer this question in different ways: some rely on a known model of the world, others learn entirely from experience; some estimate the value of states or actions, others directly optimize a policy; some work with a single agent, others with many; some operate on flat time steps, others on hierarchical abstractions. The frameworks have not replaced one another in a clean succession. Instead, they have coexisted, absorbed each other's insights, and recombined into hybrid algorithms that dominate modern practice.
In 1957, Richard Bellman introduced dynamic programming (DP) as a method for optimal control when the environment's dynamics are fully known. DP relies on the Bellman equation, which expresses the value of a state as the immediate reward plus the discounted value of the best next state. If the transition probabilities and reward function are given, DP can compute an optimal policy by iteratively updating value estimates. The framework established the mathematical foundation for all later RL: the idea of bootstrapping—updating estimates based on other estimates—and the recursive structure of value functions. But DP assumed a perfect model, and its computational cost grew explosively with the number of states—the curse of dimensionality. For any realistic problem, the model would be unknown or the state space too large. DP therefore served as an ideal benchmark rather than a practical learning algorithm. The frameworks that followed each relaxed one or more of its assumptions.
By 1983, Barto, Sutton, and Anderson proposed a different architecture that kept DP's two-value structure but dropped the need for a known model. Their actor-critic method maintained two separate components: an actor that selects actions and a critic that evaluates the state the actor reaches. The critic learns a value function from experience, and its evaluation signal—the temporal-difference error—drives improvements to the actor. This separation was a decisive break from DP: the agent no longer needed a model of the world, and learning could proceed incrementally from real experience. Early actor-critic methods suffered from high variance in their policy updates, but the architecture itself proved remarkably durable. It introduced the idea of using a learned baseline (the critic) to reduce variance, a technique that later became central to policy gradient methods. Actor-critic methods did not disappear; they persisted as a flexible hybrid that could absorb insights from both value-based and policy-based approaches.
In 1988, Sutton formalized temporal-difference (TD) learning, which directly addressed the credit assignment problem by updating value estimates based on the difference between successive predictions. Unlike DP, TD learning required no model; unlike Monte Carlo methods, it did not wait until the end of an episode to make an update. TD learning bootstraps: it updates the value of a state using the value of the next state, which is itself an estimate. This makes learning faster and more data-efficient, but it also introduces bias. Sutton's TD(λ) algorithm unified the spectrum from pure bootstrapping (λ=0) to Monte Carlo (λ=1), giving practitioners a tunable trade-off between bias and variance. TD learning became the backbone of nearly all subsequent value-based RL. Its key limitation was that it learned a value function for the policy being followed—on-policy learning—which constrained exploration.
In 1989, Watkins introduced Q-learning, which extended TD learning to learn the value of state-action pairs (Q-values) rather than just state values. The critical innovation was off-policy learning: Q-learning could learn the optimal policy's value function while following a different, exploratory policy. This decoupling of behavior and learning allowed agents to reuse past experience and explore more freely. Q-learning converged to the optimal Q-values under mild conditions, at least in tabular settings where each state-action pair had its own entry. The framework's simplicity and theoretical guarantees made it the most widely used value-based method for years. But tabular Q-learning could not scale to large or continuous state spaces; every new state required a new entry. The need for function approximation—using a learned function to generalize across states—became the next frontier.
While value-based methods learned directly from experience, a separate line of work asked whether agents should first learn a model of the environment and then plan within it. In 1991, Sutton's Dyna architecture integrated learning, planning, and reacting: the agent learned a model from real experience, used that model to simulate additional experience (planning), and updated its value function from both real and simulated data. Model-based RL offered a potential advantage in sample efficiency—by planning, the agent could learn from fewer real interactions. But learned models are imperfect; errors in the model compound during planning, leading to policies that exploit model inaccuracies. The tension between model-based and model-free RL became one of the subfield's enduring methodological debates. Model-based methods excel when simulation is cheap and the model is accurate; model-free methods are more robust when the model is hard to learn. Modern RL often combines both, using a learned model for planning while maintaining a model-free value function as a fallback.
In 1992, Williams introduced REINFORCE, a policy gradient method that directly parameterized the policy and updated its parameters by following an estimate of the gradient of expected return. Unlike value-based methods, which first learn a value function and then derive a policy, policy gradient methods optimized the policy itself. This made them natural for continuous action spaces, where argmax over Q-values is impractical. The trade-off was high variance: REINFORCE's gradient estimate, based on complete episodes, fluctuated wildly. Policy gradient methods coexisted with value-based methods because each had complementary strengths. Value-based methods were sample-efficient for discrete actions; policy gradient methods handled continuous actions and stochastic policies more naturally. The two lines eventually merged in modern actor-critic algorithms (A3C, PPO, SAC), which use a learned value function as a baseline to reduce variance while still optimizing a parameterized policy. This synthesis absorbed both traditions rather than letting one dominate.
In 1994, Littman extended RL to settings with multiple interacting agents by framing the problem as a Markov game. In multi-agent RL (MARL), each agent's reward depends on the joint actions of all agents, and the environment becomes non-stationary from any single agent's perspective—the other agents are learning and changing their behavior. This broke the Markov decision process assumption that had underpinned all earlier frameworks. MARL introduced concepts from game theory, such as Nash equilibria and correlated equilibria, as solution concepts. The framework did not replace single-agent RL; it coexisted as a specialization for problems like robotics coordination, traffic control, and economic simulations. MARL remains an active area because the non-stationarity challenge resists simple solutions; agents must balance learning, exploration, and strategic reasoning about others.
In 1997, Parr and Russell, and independently Sutton, Precup, and Singh (1999), introduced hierarchical RL (HRL) to handle tasks with long time horizons. HRL extended the RL framework with temporal abstraction: instead of choosing primitive actions at every time step, an agent could choose higher-level actions (options) that themselves invoke lower-level policies until a termination condition is met. The options framework formalized this as a semi-Markov decision process, where actions take variable amounts of time. HRL shared with model-based RL the idea of imposing structure on the learning problem—in this case, hierarchical structure rather than a learned world model. The challenge was that the hierarchy itself had to be learned or hand-designed; discovering useful subgoals and abstractions automatically remained an open problem. HRL did not replace flat RL; it added a layer of abstraction that could be combined with any underlying learning algorithm.
In 2013, Mnih et al. combined Q-learning with deep neural networks to create the Deep Q-Network (DQN), which learned to play Atari games directly from raw pixels. This was not a new algorithmic framework in the sense of a new Bellman equation or gradient estimator; it was a demonstration that deep neural networks could serve as function approximators for RL, scaling to high-dimensional inputs that had been intractable for tabular or linear methods. Deep RL introduced new challenges: instability when combining nonlinear function approximation with bootstrapping, catastrophic forgetting, and sample inefficiency. Techniques like experience replay and target networks addressed some of these issues. Deep RL transformed the field because it could be combined with virtually every earlier framework: deep value-based methods (DQN), deep policy gradient methods (PPO, SAC), deep actor-critic methods (A3C), deep model-based methods (Dreamer), deep hierarchical methods, and deep multi-agent methods. It did not replace earlier frameworks; it became the infrastructure through which they were applied to real-world problems.
Today, no single framework dominates. Value-based methods remain the standard for discrete-action problems with large state spaces, especially when combined with deep networks. Policy gradient methods are preferred for continuous control and robotics. Actor-critic methods have become the default architecture for most deep RL applications, absorbing both value-based and policy-gradient insights. Model-based methods are experiencing a revival in domains where sample efficiency matters, such as robotics and scientific discovery. Hierarchical RL is used in long-horizon planning tasks, though automatic hierarchy discovery remains an active research frontier. Multi-agent RL is essential for any application involving multiple learners. The frameworks agree on the core mathematical language—Markov decision processes, value functions, and policy optimization—but disagree on which inductive biases are most important: model accuracy versus model-free robustness, temporal abstraction versus flat reactivity, value estimation versus direct policy search. The deepest disagreement is probably between model-based and model-free approaches: should an agent invest effort in learning a world model, or should it learn directly from experience? The answer depends on the problem, and the field has learned to treat the frameworks as a toolkit rather than a competition.