Reinforcement learning emerged from mid-20th-century optimal control and dynamic programming, which provided the mathematical foundation of Markov decision processes. The Dynamic Programming paradigm, rooted in Bellman's equations, established the principle of solving complex sequential decision problems through value functions and policy iteration. This framework assumed a complete model of the environment and dominated early theoretical work, setting the stage for later sampling-based approaches that could operate without such models.
The 1980s and 1990s saw the development of Monte Carlo Methods, which learned directly from episodic experience without requiring a dynamic model. This was complemented by the breakthrough of Temporal Difference Learning, which blended ideas from dynamic programming and Monte Carlo to enable incremental learning from incomplete sequences. Temporal difference methods, exemplified by algorithms like Q-learning, became a core paradigm for model-free prediction and control, emphasizing bootstrapping and online updates.
To handle large or continuous state spaces, the Function Approximation paradigm integrated reinforcement learning with supervised learning architectures, using linear approximators or neural networks to generalize across states. This period also solidified the durable rivalry between Model-Based RL, which learns an explicit environment model for planning, and Model-Free RL, which directly learns value functions or policies from interaction. These agendas represented distinct schools on how to balance sample efficiency, computational complexity, and robustness.
The late 1990s and 2000s witnessed the rise of Policy Gradient Methods, which framed reinforcement learning as stochastic optimization over parameterized policies, enabling direct policy search in high-dimensional action spaces. This approach contrasted with value-based methods and expanded the algorithmic toolkit. The 2010s ushered in the Deep Reinforcement Learning paradigm, where deep neural networks served as powerful function approximators, unifying representation learning with decision-making and achieving breakthroughs in complex domains like game playing and robotics.
Today, the field continues to evolve within these established frameworks, with ongoing synthesis between model-based and model-free approaches, advancements in off-policy and multi-agent learning, and scaling challenges. The historical spine remains anchored by Dynamic Programming, Monte Carlo Methods, Temporal Difference Learning, Function Approximation, the model-based versus model-free divide, Policy Gradient Methods, and Deep Reinforcement Learning as the principal durable agendas.