A robot exploring an unknown environment faces a chicken-and-egg problem: to navigate accurately it needs a map, but to build a map it needs to know where it is. This circular dependency, compounded by sensor noise and dynamic surroundings, defines the simultaneous localization and mapping (SLAM) problem. Over four decades, SLAM frameworks have competed on how to represent the world, manage uncertainty, and scale to large environments. Seven major frameworks mark the field’s evolution from recursive filters to modern learning-based systems.
The first unified probabilistic treatment of SLAM emerged in the mid-1980s. Smith, Self, and Cheeseman (1986) formalized the idea that a robot’s pose and the positions of landmarks should be tracked jointly via an extended Kalman filter (EKF). The key insight was that errors in the robot’s pose and landmark estimates are correlated—ignoring those correlations leads to inconsistent maps. The EKF maintained a full covariance matrix across all landmarks and the robot state, providing a principled way to fuse noisy odometry and observations. However, the quadratic growth of the covariance with the number of landmarks limited this approach to small-scale environments. Filter-based SLAM addressed the fundamental need for uncertainty management, but its computational bottleneck prompted alternative representations.
While filter-based SLAM focused on sparse landmark maps, a parallel line of work targeted dense spatial representation for navigation. Moravec and Elfes (1989) introduced occupancy grid mapping, which discretizes the environment into cells, each storing a probability of being occupied. This representation is sensor-agnostic—it can integrate sonar, laser, or camera data—and directly supports path planning by labeling free space. Occupancy grids do not solve the full SLAM problem on their own; they typically assume a known pose or rely on separate localization. In practice, they complement estimation frameworks: filter-based or later graph-based SLAM can provide pose estimates while occupancy grids serve as the mapping back-end, especially for indoor mobile robots.
The late 1990s brought a paradigm shift from recursive filtering to optimization over a graph of poses and constraints. Lu and Milios (1997) demonstrated globally consistent alignment of laser scans, and subsequent work formalized SLAM as a sparse graph: nodes represent robot poses (and sometimes landmarks), and edges represent spatial constraints from odometry or loop closures. The new architecture split the problem into a front-end (data association, constraint extraction) and a back-end (graph optimization). This separation allowed scalable solutions: constraints are added incrementally, and the back-end solves a nonlinear least-squares problem that exploits sparsity for efficiency. Graph-based SLAM replaced filtering for most large-scale applications and became the backbone upon which later visual and semantic systems would be built.
As cameras became ubiquitous, SLAM researchers turned to vision as the primary sensor. Feature-based visual SLAM extracts sparse, repeatable keypoints (e.g., corners, descriptors) from images and uses them as landmarks. PTAM (Klein and Murray 2007) introduced a landmark architecture: one thread tracks the camera against a map of features, while another thread builds and refines the map—parallel tracking and mapping. This framework inherited the graph-based back-end for optimization but narrowed the sensor to monocular or stereo cameras. Feature-based methods excel in textured environments but struggle in low-texture or dynamic scenes. They represent the dominant visual SLAM approach for many years, exemplified by the widely adopted ORB-SLAM (Mur-Artal et al. 2015).
An alternative philosophy argues that extracting sparse features discards useful information. Direct SLAM operates on raw pixel intensities, aligning images photometrically without explicit feature detection. DTAM (Newcombe et al. 2011) reconstructed dense depth maps for each frame, while LSD-SLAM (Engel et al. 2014) achieved semi-dense mapping in real time. By using all pixels with sufficient gradient, direct methods can handle low-texture scenes better than feature-based ones. However, they are sensitive to lighting changes and require high frame rates. The direct versus feature-based debate is a living disagreement about representation: sparse geometric stability versus dense photometric richness. Both approaches coexist today, often hybridized in modern systems.
Geometric maps are sufficient for localization but not for higher-level reasoning. Semantic SLAM enriches the map with object-level recognition. SLAM++ (Salas-Moreno et al. 2013) detected and tracked known object models as landmarks, enabling the robot to understand that a chair is a chair, not just a cluster of points. This framework layers semantic labels onto a geometric (often graph-based) back-end. Semantic information helps with data association (objects are persistent and recognizable) and enables goal-directed behavior. Semantic SLAM does not replace earlier frameworks; it extends them, and many recent systems combine object detection with feature-based or direct SLAM.
Deep learning has permeated every component of the SLAM pipeline. Learned depth estimation from single images can replace stereo or RGB-D sensors; learned feature extractors (e.g., SuperPoint, D2-Net) improve robustness; end-to-end pose regression networks attempt to bypass explicit geometry. CNN-SLAM (Tateno et al. 2017) demonstrated a monocular system that learned depth and matching jointly. Learning-based SLAM often retains a classical back-end for optimization, but fully differentiable systems (e.g., Droid-SLAM) aim to train the entire SLAM pipeline from data. The core question is whether learned components supplant or merely augment classical modules. Currently, learned front-ends with geometric back-ends dominate, and the field debates the importance of explicit geometric priors.
Today, the graph-based back-end is near-universal, serving as the optimization core for feature-based, direct, semantic, and learning-based systems. Filter-based SLAM persists only for small-scale or resource-constrained platforms. The major active debates cluster around representation: sparse versus dense (features vs. raw data), geometric versus semantic, and handcrafted versus learned. Data association—the problem of matching observations to map elements—remains a unifying challenge that each framework addresses differently. Most leading systems combine multiple ideas: for instance, ORB-SLAM3 uses a graph back-end with feature-based front-end; Droid-SLAM replaces feature extraction with learned depth and optical flow while retaining a differentiable optimization layer. The field agrees on the necessity of uncertainty handling and loop closure, but disagrees on how much of the pipeline should be learned and whether explicit geometric models are still needed. The competition among these frameworks has driven steady progress toward robust, real-time perception for robots operating in the wild.