Why do policies trained in simulation fail on real robots?

Almost always because of physics mismatch, not visual mismatch. Wrong mass causes failed grasps; wrong friction causes failed stacking; wrong collision geometry causes phantom interpenetration the policy learned to exploit. Visual differences (lighting, textures, exposure) are a smaller and more easily addressed problem than physical differences (mass, friction, contact dynamics, mass distribution).

How accurate do physics values need to be for sim-to-real transfer?

Mass within 15-20% of measured values and friction coefficients within ±0.1 are sufficient for most robotic manipulation and locomotion policies — both inside typical domain randomization variance bands. Tighter accuracy is rarely worth the engineering cost; randomizing within those bands is what produces robust policies. Critical exception: precision tasks (peg-in-hole, surgical manipulation, dexterous grasping) where you should measure ground truth on hero assets.

Does domain randomization eliminate the need for accurate physics?

No — it amplifies it. Domain randomization needs a realistic baseline to randomize around. Randomizing mass uniformly between 0.1kg and 10kg for every object produces nonsensical training data that hurts policies more than it helps. The right approach is calibrated estimates within physically plausible ranges (15-20% mass accuracy, ±0.1 friction), then DR within typical material-class variance bands centered on those estimates.

Sim-to-real transfer: why physics accuracy matters more than visual fidelity

The “sim-to-real gap” is one of the most-discussed phenomena in robot learning. Here’s a sharper way to frame it: the sim-to-real gap is almost entirely a physics gap, not a visual gap. Photorealistic rendering helps perception models. It does not help manipulation policies whose training data has wrong mass, wrong friction, or simplified collision meshes.

This post walks through the research, the specific failure modes, the calibration accuracy that actually transfers, and why investing in physics accuracy beats investing in photorealism for most robotics policies.

The four ways sim-to-real fails

Robotics teams who deploy simulation-trained policies typically observe failures in one of four categories. In order of frequency:

1. Mass mismatch — failed manipulation

A policy trained on a 100g object that’s actually 500g grasps it but drops it on lift, or applies excessive torque and damages the gripper. A policy trained on a 2kg object that’s actually 800g grips with overwhelming force, slipping or crushing.

The fix is rarely “measure every real object’s mass” — that doesn’t scale. The fix is calibrated mass estimates during simulation, randomized over realistic variance bands during training. Policies that train on 100g–500g objects with proper randomization handle 200g real objects far better than policies trained on a single fixed mass.

2. Friction mismatch — failed stacking, sliding, walking

Stacking a tower of blocks in simulation with friction = 0.3, then trying it in reality with friction = 0.8: the real blocks hold; the simulated trained policy expects them to slip and applies corrective forces that knock the tower over. The same problem in reverse for slipping floors during locomotion.

Friction is harder to calibrate than mass because it depends on the contact — same materials, different surface finish, different friction. Material identification helps but isn’t perfect. Domain randomization within material-class variance is the standard mitigation.

3. Collision mesh artifacts — phantom interpenetration

If the simulated collision mesh is the convex hull of an object with deep concavities (a teapot, a horseshoe, a chair), the simulator allows objects to pass through the missing concavity. The policy learns to exploit this — manipulating in ways that work in simulation but are physically impossible in reality.

This failure mode is insidious because the policy looks fine in simulation. It only fails on real hardware, often catastrophically. Always validate collision meshes capture functional concavities for any object the policy will manipulate.

4. Mass distribution mismatch — bad rotation dynamics

Asymmetric objects (tools, bottles, electronics) have off-center centers of mass and non-diagonal inertia tensors. Simulators that default to uniform-density assumptions get rotation dynamics meaningfully wrong. The policy learns to flip a tool that “always lands handle-down” because of how the simulated tool rotates — a behavior that fails on real tools whose mass distribution differs.

Most off-the-shelf simulators handle this correctly if you populate PhysicsMassAPI with computed COM and inertia tensors. They handle it badly with default-zero values.

The research evidence

The empirical case for physics accuracy over visual fidelity is well-established:

NVIDIA GR00T N1 Benchmark (2025): training combining synthetic and real data with physics-accurate assets produced 40% better real-world performance than visual-only synthetic training of the same data scale. (source)
NeRF2Physics (CVPR 2024) by Princeton and CMU: physical-property estimation accuracy of ~15% on mass and ~0.1 on friction is sufficient for transfer of manipulation policies; outside that range, performance collapses. (paper)
Domain randomization meta-analysis (OpenAI, ETH Zurich, multiple groups): DR consistently outperforms photorealistic rendering for manipulation transfer, especially when DR is centered on calibrated rather than uniform-random parameter distributions.
Sim-to-real benchmarks for dexterous manipulation: friction-aware physics during training improves real-world success rates by 2–4× compared to friction-uniform training, even with identical visual rendering.

The pattern across this literature is consistent: physics accuracy and physics randomization are first-order. Photorealism is second-order — useful for perception, marginal for control.

What “good enough” actually means

The reflexive answer is “as accurate as possible.” The pragmatic answer is more interesting.

For most manipulation and locomotion policies, the calibration target is:

Property	Sufficient accuracy	Marginal beyond
Mass	±15–20% of measured	Diminishing returns; DR variance dominates
Static friction	±0.1 coefficient	Within material-class variance
Dynamic friction	±0.1 coefficient	Same
Restitution	±0.15	Most contact is plastic, restitution rarely dominant
Center of mass	Within 1–2 cm for normal-sized objects	Tight for small precision parts
Inertia tensor	Computed from geometry + mass distribution (not uniform)	Measured for hero assets

These aren’t aspirational targets — they’re the bands inside which good domain randomization makes the policy robust. Tighter accuracy doesn’t help; looser accuracy hurts. The objective is “calibrated within DR variance,” not “exact.”

The exceptions are precision tasks: peg-in-hole assembly with sub-millimeter clearances, surgical manipulation, dexterous in-hand reorientation. For these, the calibration target is tight (±5% mass, ±0.05 friction), and you should measure ground truth on hero assets rather than relying on AI estimation alone.

Why uniform-random domain randomization fails

A common mistake: “physics doesn’t matter; we’ll randomize over a wide range.” This produces worse policies, not better ones.

If you randomize an object’s mass uniformly between 0.01kg and 100kg, you’re training the policy on 5g feathers and 100kg anvils interchangeably. The policy learns conservative behaviors that fail at all mass scales because none of the training distribution matches reality.

The right approach is calibrated baselines + variance bands:

Estimate mass with reasonable accuracy (say, 200g for a coffee mug).
Randomize within a realistic band (180–220g, or wider for material variation: 150–250g).
Train; the policy learns to handle objects in the band, not a uniform distribution over implausible values.

This is why AI-driven physics estimation (Rigyd and similar tools) helps even — especially — when you intend to use domain randomization downstream. The estimates provide the calibrated center; DR provides the variance.

Visual fidelity has its place — but it’s not the bottleneck

Photorealistic rendering matters for perception:

Vision models trained on photorealistic synthetic data transfer better to real RGB input.
Lighting variation in synthetic data improves robustness to real lighting.
Texture diversity helps generalization to new objects.

But the gap from “decent rendering” to “photorealistic rendering” is small for most policies, while the gap from “wrong physics” to “calibrated physics” is large. If you have one engineer-week of budget, spending it on physics accuracy compounds far more than spending it on photorealism for any task involving manipulation or contact.

There’s a related observation: most teams over-invest in visual fidelity because it’s visible. Wrong physics shows up as policy failure on real hardware, weeks after the work was done. Wrong rendering is obvious in the first frame. Optimizing for what’s visible isn’t always the right priority.

A practical sim-to-real workflow

If your goal is policies that transfer cleanly:

Estimate physics with reasonable accuracy — ±15-20% mass, ±0.1 friction. AI-automated tools (Rigyd) or manual measurement on hero assets, depending on scale.
Compute inertia from geometry + estimated mass distribution. Never use the simulator’s default identity inertia.
Generate proper collision meshes. Convex decomposition for graspables; primitives for static obstacles. Validate that functional concavities are captured.
Train with domain randomization centered on calibrated baselines. Variance bands of ±20% on mass, ±0.1 on friction, ±0.05 on COM offset for normal objects.
Validate on real hardware early and often. The first real-robot test reveals what your simulator got wrong; iterate on physics calibration based on failures.

For research-grade policies, add:

Lab-measure hero assets — the 5–10 objects most central to your benchmark.
System identification. Run real-robot experiments to identify residual sim-to-real gaps and correct simulator parameters accordingly.
Real-data fine-tuning. Even a small amount of real-world data fine-tuning closes most of the remaining gap.

What this means for asset preparation

The implication for how you prepare simulation assets:

Calibrated physics on every asset beats high-fidelity physics on a few.
Diversity at scale (1,000+ objects) outperforms quality at small scale (50 hero objects) for policy generalization.
Automated SimReady generation is the right default for any project beyond a handful of hero assets — the time saved on manual physics annotation buys you more diversity, more scenarios, and more validation cycles.
Domain randomization is a multiplier on calibration, not a substitute for it. Both matter; neither alone is enough.

This is why the asset preparation bottleneck — the four-hours-per-asset traditional workflow — limits robotics deployment more than algorithmic innovation does for most teams. Algorithms are well-understood; the data infrastructure is the bottleneck.

If you’re standing up a new simulation environment in 2026, the practical priority order is:

Get physics calibrated correctly on every asset (manual for hero, automated for the long tail).
Build domain randomization on top of those calibrated baselines.
Then invest in rendering quality, if perception requires it.
Validate on real hardware and iterate.

Most sim-to-real failures I see in practice are explained by skipping step 1 — assuming visual fidelity or default physics is enough. It usually isn’t. The teams shipping policies that actually generalize are the ones treating physics calibration as the foundation everything else builds on.