FinePO Algorithm: Fine-Grained Policy Optimization
- FinePO is a reinforcement learning algorithm that decomposes responses into distinct intent-plus-action steps to enable fine-grained policy optimization.
- It assigns rewards at the step level using a learned FinePRM, reducing gradient variance by redistributing the global trajectory advantage based on individual step quality.
- Empirical results demonstrate significant performance gains in tasks like chart understanding, with FinePO outperforming traditional trajectory-level RL methods.
FinePO Algorithm
FinePO ("Fine-grained Policy Optimization") is a reinforcement learning algorithm designed to address the limitations of trajectory-level credit assignment in LLMs and multimodal LLMs (MLLMs). Instead of uniformly propagating a scalar reward across all actions within a generated response, FinePO performs fine-grained, step-level credit assignment by leveraging a learned reward model. This approach improves both the alignment and efficiency of policy optimization in complex multi-step reasoning tasks, particularly those requiring explicit, visible intermediate actions, such as chart understanding and visual reasoning (Huang et al., 9 Jan 2026).
1. Motivation and High-Level Objectives
FinePO's central aim is to resolve the coarse granularity of reward signal assignment inherent to standard RL for LLMs and MLLMs. In conventional approaches, an overall reward is computed for a complete trajectory or output , and this reward is typically broadcast identically to each token or decision in the sequence. This uniform assignment constrains the model's ability to distinguish skillful sub-decisions from erroneous ones within a single response.
FinePO introduces step-level differentiation by:
- Decomposing each output trajectory into explicit “intent + action” steps,
- Scoring every individual step using a learned Fine-grained Process Reward Model (FinePRM),
- Redistributing global trajectory-level advantage among steps in proportion to their assessed quality.
This approach enables the model to reinforce correct sub-decisions and penalize flawed ones within the same response, which empirically reduces gradient variance and leads to more robust policy improvements (Huang et al., 9 Jan 2026).
2. Fine-grained Process Reward Model (FinePRM)
FinePRM (𝒫) is a parameterized evaluator that provides scalar step-level scores essential to FinePO’s per-step advantage computation. Its core characteristics are:
- Inputs: For each step , the FinePRM receives the step’s textual intent, the drawing action parameters, and the before/after visual context (i.e., the state of the image before and after the action).
- Architecture: Built upon an MLLM backbone, FinePRM aggregates these inputs via a prompt template and classifies each step into one of four quality categories (Excellent, Acceptable, Poor, Unacceptable), assigning numeric scores .
- Dataset construction: The FinePRM is trained on a dataset of 473K labeled examples sourced through visual-to-text annotation and text-to-image distillation. Annotations are balanced across quality labels (2:4:3:1 ratio).
- Score aggregation: For each trajectory, FinePO computes a length-weighted intra-trajectory mean score and calculates deviations from the mean for per-step refinement.
This step-level reward modeling is a prerequisite for effective fine-grained policy optimization in tasks where the intermediate reasoning process is explicitly observable (Huang et al., 9 Jan 2026).
3. Mathematical Foundation
The FinePO algorithm formalizes fine-grained credit assignment as follows:
- Trajectory decomposition: Each sampled response (from a batch of candidates for a given prompt) is decomposed into reasoning steps .
- Cross-trajectory advantage:
measures whether is above or below the cohort mean in final correctness.
- Step-level process scoring:
quantifies the quality of each step.
- KL-based action regularization: To discourage domination by "easy" actions, a regularizer,
is added, where is the prior distribution, the current empirical policy, and are hyperparameters.
- Intra-trajectory normalization and deviation:
with the step token length.
- Fine-grained redistribution of advantage:
The advantage is clipped as:
with scaling and bounding per-step correction.
- Policy gradient update:
Model parameters are updated accordingly.
This formulation allows precise redistribution of cross-trajectory advantage in accordance with local step quality, achieving lower signal-to-noise in the RL signal (Huang et al., 9 Jan 2026).
4. Algorithmic Workflow and Pseudocode
The FinePO training loop for a batch of prompts consists of the following steps:
- Candidate sampling: For each prompt, sample candidate responses under the current policy.
- Terminal reward evaluation: Compute the final correctness reward for each response, then the group-wise advantage .
- Step-level analysis:
- Decompose each into its explicit steps.
- For each :
- Score using FinePRM: ,
- Compute KL regularization offset: ,
- Aggregate: .
- Calculate mean , deviations , and per-step advantages as in Section 3.
- Loss computation: Form the policy gradient loss .
- Parameter update: Update model weights via stochastic gradient descent.
- Statistics update: Update action history for maintaining .
Key hyperparameters used include samples per prompt, , , , , learning rate , and temperature $1.0$ (Huang et al., 9 Jan 2026).
5. Empirical Findings, Ablations, and Theoretical Insights
- Variance reduction and stability: By attributing rewards at the step level, FinePO empirically achieves lower policy gradient variance and smoother convergence dynamics compared to trajectory-level RL approaches.
- Ablation results:
- Removal of FinePO (cold start only) results in a significant drop (5–12 points) in benchmark accuracy.
- Substituting FinePO with naive group-level RL yields 2-point lower ChartQA performance (75.12 vs 77.20).
- Using random (non-learned) step-level rewards in FinePRM degrades performance by more than one point.
- Excluding KL regularization induces action imbalance, though its impact on raw scores is modest.
- Disabling the explicit sketching (externalized steps) leads to catastrophic failures (≈30% decrease in performance).
- Comparative performance:
- SketchVL-7B trained with FinePO outperforms the base Qwen2.5VL-7B model across chart-based reasoning benchmarks, with overall multi-dataset average gain ≈7.23%.
- On ChartQA: +1.96 (83.96 vs 82.00), on ChartBench: +0.33, on EvoChart-QA: +3.84.
- Gains on general-purpose tasks (MathVista, MMStar) and for smaller models (SketchVL-3B vs Qwen2.5VL-3B: +15.32 on ChartQA).
These results indicate that fine-grained credit assignment yields consistent improvements over both pure supervised imitation learning and traditional RL approaches (Huang et al., 9 Jan 2026).
6. Integration into Model Training and Distinctive Features
FinePO is integrated into the broader SketchVL training regime as follows:
- Cold-start phase: The model first undergoes supervised "Sketch-CoT" training on 50K reasoning samples constructed to teach multi-step visual reasoning.
- FinePO reinforcement phase: Fine-tuning is performed with the FinePO loop on 9K diverse prompts, with decoder LoRA adapters frozen and KL-based action regularization applied to maintain action diversity.
- Distinctive properties:
- Supports explicit, visible step decomposition for per-step reward assignment—a prerequisite in domains like chart understanding where intermediate reasoning can be externally annotated and validated.
- Policy regularization via KL-divergence to match a prior action distribution, promoting action diversity and discouraging overuse of trivial operations.
- Generates on-the-fly, token-level advantage tensors rather than broadcasting a scalar across the entire output.
Practical implementation uses distributed action history tracking and integration with the ms-swift framework for scalable RL fine-tuning (Huang et al., 9 Jan 2026).
7. Context, Practical Guidance, and Implications
FinePO's demonstration, particularly within the SketchVL chart reasoning framework, establishes its practical utility in settings requiring nuanced, step-level credit assignment. The empirical results suggest that:
- FinePO is most beneficial when step-level annotation or evaluation is feasible.
- The learned FinePRM is critical; using random or constant process reward quickly degrades performance.
- KL regularization mainly addresses the action distribution collapse, especially with increasing training steps.
- FinePO’s benefits are consistent across both large (7B) and small (3B) model variants, with larger proportional gains for smaller models.
A plausible implication is that the general principle of explicit intra-trajectory reward redistribution is likely transferable to other structured reasoning and policy optimization domains, whenever stepwise decomposition and step-level reward models are viable (Huang et al., 9 Jan 2026).