cs.LGJun 8, 2026classified

Rethinking the Divergence Regularization in LLM RL

Jiarui Yao, Xiangxin Zhou, Penghui Qi, Wee Sun Lee, Liefeng Bo, Tianyu Pang

cs.LG

Paper Guide Brief

Reading Brief

This paper proposes Divergence Regularized Policy Optimization (DRPO), a new reinforcement learning algorithm for fine-tuning large language models. DRPO replaces the hard binary mask used in DPPO with a smooth advantage-weighted quadratic regularizer on the absolute probability shift of sampled tokens, thereby providing continuous gradient weights that attenuate diverging updates and offer corrective signals beyond the trust-region boundary. Experiments across multiple model scales, architectures, and precision settings demonstrate improved training stability and efficiency over PPO, GRPO, SPO, and DPPO.

Central Claim

A novel policy optimization algorithm (DRPO) that replaces the hard divergence-based mask of DPPO with a smooth, advantage-weighted quadratic regularizer, yielding bounded and continuous gradient weights that preserve the Binary-TV trust-region geometry while...

Contribution

Why It Matters

DRPO is the first method to combine a divergence-based (Binary-TV) trust-region geometry with a smooth quadratic regularizer, achieving bounded gradient weights and corrective signals beyond the trust-region boundary, which improves stability and efficiency in off-policy LLM RL.

Prerequisites

Divergence Regularized Policy Optimization, advantage-weighted quadratic regularizer, Binary-TV trust region, smooth gradient weight, off-policy RL

Atlas Placement

Reinforcement Learning (subfield)

Read If

You care about Divergence Regularized Policy Optimization, advantage-weighted quadratic regularizer, Binary-TV trust region.

Skip If

You only care about AIME 2024, AIME 2025.

Methods

Divergence Regularized Policy Optimizationadvantage-weighted quadratic regularizerBinary-TV trust regionsmooth gradient weightoff-policy RLtrust-region control

Tasks

LLM post-trainingmathematical reasoningRL fine-tuning

Datasets

DAPO datasetmath problems

Benchmarks

AIME 2024AIME 2025

Noosaga Placements

Policy Gradient Methodsframework95%
DRPO is a policy gradient method that extends the trust-region approach of DPPO by replacing the hard mask with a smooth regularizer, and it is compared against PPO and GRPO which are also policy gradient methods.
We propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift.DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary.Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.
Reinforcement Learningsubfield95%
The paper proposes a new policy optimization algorithm (DRPO) for reinforcement learning, directly addressing trust-region control, policy gradients, and off-policy optimization in LLM RL.
Reinforcement learning (RL) has become a key component of post-training large language models (LLMs).We propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift.DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary.
Deep Reinforcement Learningframework85%
DRPO is a deep RL method applied to LLMs, extending the deep RL paradigm by introducing a new regularizer for trust-region control in off-policy settings.
Reinforcement learning (RL) has become a key component of post-training large language models (LLMs).We propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift.Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.
Deep Learningsubfield80%
The method is applied to large language models (LLMs), which are deep learning models, and the paper discusses token-level policies, vocabularies, and transformer architectures.
Reinforcement learning (RL) has become a key component of post-training large language models (LLMs).During training, an LLM is optimized as an autoregressive token-level policy that generates a response and receives a scalar reward.Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.
Actor-Critic Methodsframework80%
The paper compares DRPO against PPO and GRPO, which are actor-critic methods (though GRPO uses group-relative advantages instead of a learned critic).
Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism.Across all six settings, our DRPO consistently enables stable and efficient training, matching or exceeding the best evaluation accuracy achieved by the baselines.
Natural Language Processingsubfield70%
The paper focuses on RL for LLMs, which are central to NLP, and discusses token-level policies, vocabularies, and language model fine-tuning.
Reinforcement learning (RL) has become a key component of post-training large language models (LLMs).In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization.The importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies.
Large Language Modelsframework70%
The paper is situated within the context of large language models (LLMs), which are a key framework in NLP, and the method is designed for LLM RL fine-tuning.
Reinforcement learning (RL) has become a key component of post-training large language models (LLMs).In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization.
Machine Learningsubfield60%
The paper deals with general machine learning concepts such as regularization, optimization, and trust regions, but the primary focus is on RL for LLMs.
We propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift.DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary.

Abstract

Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.

Paper Context

Source ContextWhole paper

Budget100,000 tokens

Coverage64,412 chars

Classified from the full extracted paper text (64,412 characters). The Paper Guide brief above is the user-facing synthesis; raw context is kept out of the page.

Full-paper context sent 64,412 of 64,412 extracted characters to classification.