Under Review

Start right,
arrive right.

Asynchronous execution of action-chunking robot policies via initial noise selection — not trajectory steering.

We introduce PAINT — Prefix-Anchored INiTial noise — a training-free inference method. Instead of correcting a policy mid-generation, PAINT chooses the right starting noise so the unmodified flow naturally produces a chunk that continues the motion already underway.

Anonymous Under Review

Paper (PDF) arXiv Code (Soon) Video (Soon)

Training-free · no gradients · no policy modification

The one-paragraph version

A noise problem,
not a steering problem.

Robot policies predict action chunks — short bursts of future moves. Generating one takes time, so the robot keeps acting while the next chunk is computed. By the time it arrives, the robot has already moved on, and the new chunk must pick up exactly where the motion is — or it jerks at the seam.

Prior methods fix this by steering the generation toward the executed actions. PAINT asks a different question: what if the trajectory never needed correcting? We invert the flow ODE to find an initial noise whose unmodified rollout already satisfies the constraint — then run the policy exactly as trained.

simulated benchmarks (Kinetix)

real-world manipulation tasks

robot embodiments: single-arm, bimanual, humanoid

0grad

gradients, retraining or policy edits

The problem

The prefix constraint

Under asynchronous inference the robot can’t wait. It keeps executing the current chunk while the next one is generated — and advances d steps in the meantime.

So the first d actions of the new chunk must match the last d actions already executed. Violate it and the robot lurches: jerky, unsafe motion at every chunk boundary. The gap only widens as larger VLAs push inference latency past the controller’s clock.

A _t-1[s + i] = A_t[i] for i = 0 … d-1

Toggle: a chunk sampled independently jumps at the boundary; a prefix-consistent chunk continues the motion.

The key idea

Two ways to satisfy the constraint

Both want the same prefix. They differ in where they intervene — during generation, or before it.

Prior work — e.g. RTC

Velocity steering

Push the velocity field toward the prefix at each denoising step — using backprop, retraining, or extra compute. Effective, but the orthogonal part of the correction can drag the trajectory off the policy’s learned flow.

PAINT — ours

Noise selection

Pick the starting noise x₀* so the standard ODE lands on the constraint on its own. The velocity field is never touched, so the chunk stays faithful to the policy’s own distribution.

This works because optimal-transport flow matching is approximately local: the noise at each position mostly governs the action at the same position. So inverting from a desired prefix recovers a noise that reproduces it — under the unmodified forward pass.

How PAINT works

Six steps, all forward-only

Step through the algorithm — or press play. Watch the predicted chunk go from a jump at the boundary to a clean continuation, just by changing where generation starts.

01 / 06

Sample fresh noise

…

Algorithm 1 — PAINT (Prefix-Anchored INiTial noise) Require: observation oₜ, executed prefix, delay d, ODE steps N 1 x₀ᶠʳᵉᵉ ~ 𝒩(0, I) 2 x₁ⁿᵃⁱᵛᵉ ← πθ(x₀ᶠʳᵉᵉ, oₜ) # naive forward pass 3 x₁ᵗᵃʳᵍᵉᵗ ← [ executed prefix | x₁ⁿᵃⁱᵛᵉ[d:] ] # build target 4 xτ ← x₁ᵗᵃʳᵍᵉᵗ 5 for τ = 1, 1−1/N, … , 1/N: 6 xτ ← xτ − (1/N)·vπ(xτ, oₜ, τ) # backward Euler 7 x₀ⁱⁿᵛ ← xτ 8 x₀* ← [ x₀ⁱⁿᵛ[:d] | x₀ᶠʳᵉᵉ[d:] ] # Mao re-painting rule 9 Aₜ ← πθ(x₀*, oₜ) # final forward pass 10 return Aₜ

Every line is a forward model call. No vector-Jacobian products at deployment — so PAINT slots cleanly into graph-compiled runtimes like TensorRT.

Results

Matches or beats gradient-based steering

Across simulation and hardware, PAINT meets or improves on Real-Time Chunking (RTC) on both task success and prefix consistency — while never touching the policy.

SR ↑ success rate ATR ↓ avg. time of successful rollouts (s) CON ↓ prefix consistency error

Method	SR ↑	ATR ↓	CON ↓
Block Stacking · single-arm
TE	0.55	28.27	–
RTC	0.75	16.04	0.030
PAINT	0.75	15.32	0.023
Toy in Drawer · single-arm
TE	0.60	23.19	–
RTC	0.75	17.85	0.025
PAINT	0.85	17.08	0.023
Banana in Pot · single-arm
TE	0.60	29.57	–
RTC	0.70	30.15	0.031
PAINT	0.70	29.79	0.026
Towel Flinging · bimanual
TE	0.51	17.13	–
RTC	0.76	6.98	0.028
PAINT	0.79	7.44	0.023
Shorts Folding · bimanual
TE	0.90	33.82	–
RTC	0.90	18.79	0.027
PAINT	0.95	19.77	0.025
Part Placing · humanoid
TE	0.50	68.51	–
RTC	0.70	18.28	0.030
PAINT	0.70	17.32	0.021

Method	SR ↑	ATR ↓	CON ↓
Block Stacking · single-arm
TE	0.20	58.35	–
RTC	0.50	15.30	0.036
PAINT	0.50	14.90	0.034
Toy in Drawer · single-arm
TE	0.15	53.27	–
RTC	0.65	36.35	0.032
PAINT	0.65	30.39	0.028
Banana in Pot · single-arm
TE	0.10	63.20	–
RTC	0.55	37.41	0.028
PAINT	0.55	39.64	0.026
Towel Flinging · bimanual
TE	0.30	25.07	–
RTC	0.54	17.04	0.039
PAINT	0.80	15.16	0.027
Shorts Folding · bimanual
TE	0.40	52.32	–
RTC	0.65	29.64	0.028
PAINT	0.70	30.54	0.027
Part Placing · humanoid
N/A	checkpoint targets manipulators, not the humanoid platform

PAINT (ours). Best value per task & metric in bold. 20 trials per method–task pair. TE is synchronous, so it has no CON score (—).

Simulated benchmark · Kinetix

Across 12 force-control environments and rising inference delay, naive async degrades sharply while temporal ensembling and B-spline smoothing barely help — neither conditions on the executed prefix.

PAINT-Euler holds the strongest delay robustness and the lowest prefix mismatch, consistently above RTC — without any gradient computation. Each point aggregates 2,048 trials.

ablation Among inversion methods, backward Euler best balances quality and cost — on par with the 2× costlier DPM-Solver.

Figure 3 — success rate and consistency vs. inference delay on the Kinetix benchmark

Latency · GR00T-N1.5

86 ±2 ms PAINT vs 113 ±3 ms RTC

Faster than gradient-based steering on smaller VLAs.

Latency · π₀

311 ±7 ms PAINT vs 213 ±4 ms RTC

The trade-off: extra forward passes replace backward ones.

Real-world tasks

Six tasks, three embodiments

Single-arm, bimanual (ALOHA) and humanoid — on two VLA architectures, under natural network inference delay.

Block Stacking

single-arm · ALOHA

Precision pick-and-place — grasp and stack a block.

Toy in Drawer

single-arm · ALOHA

Multi-stage contact: open, place, close.

Banana in Pot

single-arm · ALOHA

Grasp a deformable object and drop it in a pot.

Towel Flinging

bimanual · ALOHA

Dynamic deformable manipulation, scored by flatness.

Shorts Folding

bimanual · ALOHA

Long-horizon bimanual cloth folding.

Part Placing

humanoid

Precision alignment with a dexterous hand.

A noise problem, not a steering problem.

The prefix constraint

Two ways to satisfy the constraint

Velocity steering

Noise selection

Six steps, all forward-only

Sample fresh noise

Matches or beats gradient-based steering

Simulated benchmark · Kinetix

Six tasks, three embodiments

Block Stacking

Toy in Drawer

Banana in Pot

Towel Flinging

Shorts Folding

Part Placing

A noise problem,
not a steering problem.