Closed-loop framework with two diffusion model-based policies: an evaluator to predict human intent, and a copilot to provide optimal trajectories and ensure smooth control transitions during safety-critical situations.
Shared autonomy in driving requires anticipating human behavior, flagging risk before it becomes unavoidable, and transferring control safely and smoothly.
We propose Diffusion-SAFE, a closed-loop framework built on two diffusion models: an evaluator that predicts multimodal human-intent action sequences for probabilistic risk detection, and a safety-guided copilot that steers its denoising process toward safe regions using the gradient of a map-based safety certificate. When risk is detected, control is transferred through partial diffusion: the human plan is forward-noised to an intermediate level and denoised by the safety-guided copilot. The forward-diffusion ratio \( \rho \) acts as a continuous takeover knob—small \( \rho \) keeps the output close to human intent, while increasing \( \rho \) shifts authority toward the copilot, avoiding the mixed-unsafe pitfall of action-level blending.
Unlike methods relying on hand-crafted score functions, our diffusion formulation supports both safety evaluation and plan generation directly from demonstrations. We evaluate Diffusion-SAFE in simulation and on a real ROS-based race car, achieving a 98.5% handover success rate while maintaining smooth transitions.
Diffusion-SAFE architecture: The evaluator model processes observations and action sequences, sampling future action sequences aligned with human intent in a simulated environment. The copilot model generates and executes expert action sequences when the human performance score falls below a predefined threshold \( \tau_{NLL} \).
Noise Estimator Architecture: U-Net design with residual connections, positional embedding of step \( t \), and conditioning vector \( \mathbf{C}_{0:t_{\text{obs}}} \). Double convolution block (DC in the figure).
Algorithm 1: Safety-Guided Copilot Reverse Process
The core of Diffusion-SAFE is a safety-guided copilot: a conditional diffusion model trained on expert driving demonstrations whose reverse (denoising) process is steered at every step by a map-based safety certificate.
The safety certificate is built from a Signed Distance Field (SDF) computed on the track map. The top row below shows overhead camera images of the real-world tracks; the bottom row shows the corresponding SDF grids, where the green boundary marks the track edge and the field value at any point gives the signed distance to the nearest boundary.
At each denoising step \( k \), the copilot:
Because guidance is applied inside the reverse process rather than as a post-hoc filter, the copilot produces trajectories that are jointly high-likelihood under the expert distribution and safe with respect to the map.
Safety map construction: Real-world track images (top) and their SDF representations (bottom)
Effect of safety guidance: Unsafe rate (%) vs. forward diffusion ratio \( \rho \). The shaded region highlights the reduction achieved by safety guidance.
Here we showcase four different simulated scenes randomly generated in Gym CarRacing-v2.
By changing \( \rho \), we can adjust the balance between preserving human input and following the safe behavior of the copilot: when \( \rho \) is small, human intent is well preserved, and thus with limited alignment to \( P_{\text{copilot}} \); in contrast, larger values \( \rho \) would lead the system to prioritize aligning with the copilot policy over human input.
Comparison of Our Partial Diffusion Method and Simple Blending in the Handover Process:
Our approach (via forward diffusion ratio \( \rho \) )
Simply blend: \( \mathbf{a}_{blend} = k \mathbf{a}_{H} + (1 - k) \mathbf{a}_{copilot} \)
(a) The human plan (\( \rho = 0 \)) collides with an obstacle; increasing \( \rho \) steers the trajectory toward the safe copilot plan (\( \rho = 1 \)).
(b) ADE to the human plan (blue) increases with \( \rho \) while ADE to the copilot plan (red) decreases, confirming smooth interpolation between the two endpoints.
The figures above show that partial diffusion smoothly interpolates between the human plan and the copilot plan in trajectory space. However, smoothness alone does not guarantee safety. Below we compare our method against simple action-level blending: because the safe-action set is often nonconvex, linearly mixing two individually safe actions can produce an unsafe outcome, whereas our partial diffusion approach remains safe throughout the handover.
Comparison of simple blending and our partial diffusion method. Simple blending produces unsafe trajectories, while our method remains safe throughout the handover.
Action-level smoothness during handover. Our diffusion-native handover (blue) yields smooth transitions, while simple action blending (orange) exhibits erratic oscillations.
Ablation studies are conducted for both the evaluator and the copilot. The horizon unit is measured in steps, where each step corresponds to 0.1s. \( \textbf{Bold} \) indicates the best result, while \( \textit{Italic} \) indicates the second-best result.
Ablation table for the evaluator model
Ablation table for the copilot model
Table III: Handover comparison in simulation and real world
In this work, we utilize the ability of the diffusion policy to inherently express multimodal distributions. Our method is compared to the following multimodal methods: LSTM-GMM and Behavior Transformers models (BET). The results are summarized in the following tables. The horizon unit is measured in steps, where each step corresponds to 0.1s. \( \textbf{Bold} \) indicates the best result, while \( \textit{Italic} \) indicates the second-best result.
Baseline Comparison table for the evaluator model
Baseline Comparison table for the copilot model
We evaluate our framework on a ROS-controlled race car with onboard compute (Jetson Orin Nano), tracked by a 13-camera OptiTrack system (see figure below). At each time step \( t \), the motion-capture pipeline provides the vehicle pose on the ground plane, \( \mathbf{p}^{\mathrm{world}}_t=(x_t,y_t,\theta_t) \), streamed to ROS via VRPN. A cropper node projects \( \mathbf{p}^{\mathrm{world}}_t \) to pixel coordinates and extracts an ego-centric image observation \( I_t \) from the overhead camera, while a vel node computes finite-difference velocities \( \mathbf{v}_t \). A sync node time-aligns all streams to produce the training tuple \( (I_t,\,\mathbf{a}_t,\,\mathbf{v}_t,\,\mathbf{p}_t) \).
Real-World Experiment Results: Columns represent unseen maps. Rows represent different initial conditions ('start' points), human-driver temporal strategies, and correlated handovers.
Here are four real-world demos showing smooth and successful handovers on various unseen tracks.