STITCH-OPE: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation

Abstract

Off-policy evaluation (OPE) estimates the performance of a target policy using offline data collected from a behavior policy, and is crucial in domains such as robotics or healthcare where direct interaction with the environment is costly or unsafe. Existing OPE methods are ineffective for high-dimensional, long-horizon problems, due to exponential blow-ups in variance from importance weighting or compounding errors from learned dynamics models. To address these challenges, we propose STITCH-OPE, a model-based generative framework that leverages denoising diffusion for long-horizon OPE in high-dimensional state and action spaces. Starting with a diffusion model pre-trained on the behavior data, STITCH-OPE generates synthetic trajectories from the target policy by guiding the denoising process using the score function of the target policy. STITCH-OPE proposes two technical innovations that make it advantageous for OPE: (1) prevents over-regularization by subtracting the score of the behavior policy during guidance, and (2) generates long-horizon trajectories by stitching partial trajectories together end-to-end. We provide a theoretical guarantee that under mild assumptions, these modifications result in an exponential reduction in variance versus long-horizon trajectory diffusion. Experiments on the D4RL and OpenAI Gym benchmarks show substantial improvement in mean squared error, correlation, and regret metrics compared to state-of-the-art OPE methods.

STITCH-OPE Workflow

The STITCH-OPE framework operates through a systematic workflow that combines conditional diffusion modeling with guided trajectory generation:

Workflow Explanation

Data Preprocessing: Behavior data is segmented into partial trajectories of length w, enabling more flexible trajectory composition during guided diffusion.
Conditional Diffusion Training: A diffusion model is trained on sub-trajectories conditioned on initial states, learning to generate dynamically feasible behavior patterns while maintaining broader coverage of the behavior dataset.
Guided Trajectory Generation: During inference, the diffusion model generates target policy trajectories using dual guidance:
- Positive guidance from the target policy score function
- Negative guidance from the behavior policy score function to prevent over-regularization
Illustration of positive and negative guidance in the diffusion process
Trajectory Stitching: Generated sub-trajectories are seamlessly stitched together end-to-end to form complete long-horizon trajectories, minimizing compounding errors while preserving compositionality.
Off-Policy Evaluation: The stitched trajectories are evaluated using an empirical reward function to estimate the target policy's expected return, providing robust OPE estimates even with significant distribution shift.

Behavior Comparison

Comparison between expected behavior and STITCH-OPE generated trajectories across different policies.

Expected Behavior

STITCH-OPE

Unguided

Guided (Suboptimal Target Policy)

Guided (Optimal Target Policy)

Visual comparison demonstrating STITCH-OPE's ability to generate trajectories that closely match expected behavior across different guidance scenarios, from unguided generation to optimal policy guidance.

Results

We evaluate STITCH-OPE on D4RL and OpenAI Gym benchmarks, demonstrating exceptional performance across multiple evaluation metrics. STITCH-OPE achieves superior policy ranking accuracy with excellent Spearman correlation, enabling reliable identification of the best-performing policies. The method excels in regret minimization, consistently selecting near-optimal policies with very low standard error. Additionally, STITCH-OPE shows outstanding mean squared error reduction compared to state-of-the-art baselines, validating our theoretical analysis of exponential variance reduction through trajectory stitching and negative behavior guidance.

Experimental results showing STITCH-OPE's superior performance across D4RL and OpenAI Gym benchmarks compared to state-of-the-art OPE methods.

Guidance Coefficient Sensitivity Analysis

The effectiveness of STITCH-OPE critically depends on the balance between the positive guidance coefficient α (target policy score) and the negative guidance coefficient λ (behavior policy score). Our analysis reveals that negative guidance is essential for preventing over-regularization when addressing distribution shift between behavior and target policies. The optimal performance occurs when we use negative guidance with 0 < λ < α, confirming that subtracting the behavior policy score makes the optimization problem easier and leads to better OPE estimates.

Hopper environment: Performance landscape showing optimal regions with α ∈ [0.1, 0.5] and λ ≤ 0.5α, confirming that negative guidance prevents over-regularization.

Walker2D environment: Consistent results showing that moderate negative guidance (λ ≈ 0.25α to 0.75α) optimizes both correlation and error metrics.

These sensitivity analyses demonstrate that the theoretical foundation of our negative behavior guidance is empirically validated. The negative term prevents the guided diffusion process from collapsing to high-density regions under the behavior policy where the target policy likelihood might be small, thus enabling effective generalization to out-of-distribution scenarios while maintaining trajectory feasibility.

BibTeX

@InProceedings{stitch_ope,
title={{STITCH}-{OPE}: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation},
author={Hossein Goli and Michael Gimelfarb and Nathan Samuel de Lara and Haruki Nishimura and Masha Itkina and Florian Shkurti},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=AghtKxDf7f}}