Asynchronous execution of action-chunking robot policies via initial noise selection — not trajectory steering.
We introduce PAINT — Prefix-Anchored INiTial noise — a training-free inference method. Instead of correcting a policy mid-generation, PAINT chooses the right starting noise so the unmodified flow naturally produces a chunk that continues the motion already underway.
Training-free · no gradients · no policy modification
The one-paragraph version
Robot policies predict action chunks — short bursts of future moves. Generating one takes time, so the robot keeps acting while the next chunk is computed. By the time it arrives, the robot has already moved on, and the new chunk must pick up exactly where the motion is — or it jerks at the seam.
Prior methods fix this by steering the generation toward the executed actions. PAINT asks a different question: what if the trajectory never needed correcting? We invert the flow ODE to find an initial noise whose unmodified rollout already satisfies the constraint — then run the policy exactly as trained.
The problem
Under asynchronous inference the robot can’t wait. It keeps executing the current chunk while the next one is generated — and advances d steps in the meantime.
So the first d actions of the new chunk must match the last d actions already executed. Violate it and the robot lurches: jerky, unsafe motion at every chunk boundary. The gap only widens as larger VLAs push inference latency past the controller’s clock.
A t-1[s + i] = At[i] for i = 0 … d-1
Toggle: a chunk sampled independently jumps at the boundary; a prefix-consistent chunk continues the motion.
The key idea
Both want the same prefix. They differ in where they intervene — during generation, or before it.
Push the velocity field toward the prefix at each denoising step — using backprop, retraining, or extra compute. Effective, but the orthogonal part of the correction can drag the trajectory off the policy’s learned flow.
Pick the starting noise x₀* so the standard ODE lands on the constraint on its own. The velocity field is never touched, so the chunk stays faithful to the policy’s own distribution.
This works because optimal-transport flow matching is approximately local: the noise at each position mostly governs the action at the same position. So inverting from a desired prefix recovers a noise that reproduces it — under the unmodified forward pass.
How PAINT works
Step through the algorithm — or press play. Watch the predicted chunk go from a jump at the boundary to a clean continuation, just by changing where generation starts.
…
Every line is a forward model call. No vector-Jacobian products at deployment — so PAINT slots cleanly into graph-compiled runtimes like TensorRT.
Results
Across simulation and hardware, PAINT meets or improves on Real-Time Chunking (RTC) on both task success and prefix consistency — while never touching the policy.
| Method | SR ↑ | ATR ↓ | CON ↓ |
|---|---|---|---|
| Block Stacking · single-arm | |||
| TE | 0.55 | 28.27 | – |
| RTC | 0.75 | 16.04 | 0.030 |
| PAINT | 0.75 | 15.32 | 0.023 |
| Toy in Drawer · single-arm | |||
| TE | 0.60 | 23.19 | – |
| RTC | 0.75 | 17.85 | 0.025 |
| PAINT | 0.85 | 17.08 | 0.023 |
| Banana in Pot · single-arm | |||
| TE | 0.60 | 29.57 | – |
| RTC | 0.70 | 30.15 | 0.031 |
| PAINT | 0.70 | 29.79 | 0.026 |
| Towel Flinging · bimanual | |||
| TE | 0.51 | 17.13 | – |
| RTC | 0.76 | 6.98 | 0.028 |
| PAINT | 0.79 | 7.44 | 0.023 |
| Shorts Folding · bimanual | |||
| TE | 0.90 | 33.82 | – |
| RTC | 0.90 | 18.79 | 0.027 |
| PAINT | 0.95 | 19.77 | 0.025 |
| Part Placing · humanoid | |||
| TE | 0.50 | 68.51 | – |
| RTC | 0.70 | 18.28 | 0.030 |
| PAINT | 0.70 | 17.32 | 0.021 |
PAINT (ours). Best value per task & metric in bold. 20 trials per method–task pair. TE is synchronous, so it has no CON score (—).
Across 12 force-control environments and rising inference delay, naive async degrades sharply while temporal ensembling and B-spline smoothing barely help — neither conditions on the executed prefix.
PAINT-Euler holds the strongest delay robustness and the lowest prefix mismatch, consistently above RTC — without any gradient computation. Each point aggregates 2,048 trials.
ablation Among inversion methods, backward Euler best balances quality and cost — on par with the 2× costlier DPM-Solver.
86 ±2 ms PAINT vs 113 ±3 ms RTC
Faster than gradient-based steering on smaller VLAs.
311 ±7 ms PAINT vs 213 ±4 ms RTC
The trade-off: extra forward passes replace backward ones.
Real-world tasks
Single-arm, bimanual (ALOHA) and humanoid — on two VLA architectures, under natural network inference delay.
Precision pick-and-place — grasp and stack a block.
Multi-stage contact: open, place, close.
Grasp a deformable object and drop it in a pot.
Dynamic deformable manipulation, scored by flatness.
Long-horizon bimanual cloth folding.
Precision alignment with a dexterous hand.