How much real-world data can synthetic data replace?

For most robotics perception tasks, 60-80% of training data can come from synthetic with proper domain randomization, with 20-40% real data for fine-tuning. NVIDIA's GR00T N1 benchmark showed 40% better real-world performance from combined synthetic + real training compared to real-only training of the same data scale. Pure-synthetic with no real-world fine-tuning works for some tasks (object detection in known environments) but rarely transfers cleanly to manipulation policies without at least some real data.

Why doesn't photorealistic rendering alone close the sim-to-real gap?

Because perception failures from visual mismatch are usually smaller than physics failures from mass and friction mismatch. A model trained on photorealistic synthetic data still fails at manipulation tasks if the underlying physics is wrong, the policy learns to grasp objects with the wrong mass, leading to drops on real hardware regardless of how good the visual transfer was. The robust pattern is photorealistic rendering AND physically accurate simulation; both are needed, neither alone is enough.

What asset metadata does synthetic data generation actually require?

At minimum: per-prim semantic class labels (via SemanticsAPI in USD), instance IDs for multi-instance scenes, and physics properties (mass, friction, collision) for any object that's manipulated. For specific tasks, also: keypoints (graspable points, attachment points), 6-DOF poses, depth maps (auto-generated by Replicator), and segmentation masks (also auto-generated). The asset preparation work is the same as for policy training, physics and semantics matter for both.

Synthetic data generation for computer vision in robotics

Real-world labeled data for robotics is expensive. Capturing 100,000 images of a robot manipulating warehouse SKUs, with bounding boxes and grasp poses, takes weeks of operator time and produces a dataset that’s obsolete the moment a new SKU enters the catalog. Synthetic data, generated from simulation with automatic labels, promises unlimited scale and instant relabeling.

The catch: synthetic data is only as good as the simulation that produces it. Wrong physics, wrong semantics, or wrong asset diversity, and the resulting model fails on real hardware in ways that look mysterious until you trace them back to the synthetic training set.

This guide covers the complete pipeline: what assets and metadata you need, how to use NVIDIA Replicator and similar tools, the role of domain randomization, and how to validate that your synthetic data actually transfers.

What “synthetic data” means in robotics

A typical synthetic robotics dataset includes:

RGB images, rendered from simulator cameras, with photorealistic lighting and materials
Depth maps, pixel-perfect ground truth, free from real-sensor noise
Segmentation masks, instance and semantic, automatically generated from object metadata
Bounding boxes, 2D and 3D, with object class labels
Keypoints, graspable points, joint positions, attachment landmarks
6-DOF poses, object positions and orientations in camera/world frames
Physics annotations, mass, friction, contact forces (for tasks where physics matters)

The structural advantage over real data: every label is automatic and exact. No human-in-the-loop annotation, no boundary errors on segmentation masks, no measurement uncertainty on poses.

What physics has to do with computer vision

A common assumption: “synthetic data for vision is purely about rendering.” Half-true. For pure object-detection tasks (find the box in the image), rendering quality dominates. For manipulation tasks (grasp the box, place it on the shelf), the underlying physics determines whether the data is useful even when the rendering is identical.

Why: the value of a synthetic dataset for manipulation isn’t just the labels, it’s the distribution of object configurations it captures. A simulation with wrong physics produces unrealistic configurations (objects float, stacks balance impossibly, robots interact with objects that move wrong). Vision models trained on those configurations learn incorrect priors about how the world works.

Concrete example: a vision model trained on synthetic stacking data where the physics produces too-stable stacks (over-friction) learns to expect rigid contacts. Deployed in reality, it under-predicts stack instability and the controller it informs makes overconfident grasp decisions, producing drops.

The robust pattern: photorealistic rendering AND physically accurate simulation. Both matter; neither alone is enough.

Asset preparation requirements

For synthetic data generation, every asset needs everything required for policy training plus a few extras:

Visual fidelity, PBR materials, high-resolution textures, normal maps
Physics, mass, friction, collision meshes (same as for training)
Semantic labels, SemanticsAPI with class label and optionally instance ID
Keypoints (task-specific), annotated graspable points, attachment points
Pose canonical frame, well-defined “up” and “front” for orientation labels

The good news: 90% of this is the same SimReady asset preparation that policy training requires. A SimReady asset is also a synthetic-data-ready asset.

The synthetic data pipeline

The end-to-end pipeline runs through five stages:

Stage 1: Scene composition

Generate diverse scenes by procedurally combining assets. Variables:

Object selection, random subset from your asset library
Object placement, random positions within physically plausible regions
Object orientation, random poses, with physics simulation to settle into stable configurations
Background, random environments, lighting setups, ground textures

The goal is diversity, not realism per se. A vision model trained on 100 distinct scenes overfits to those specific configurations; a model trained on 100,000 procedurally varied scenes generalizes.

Stage 2: Domain randomization

Randomize visual parameters per scene:

Lighting, direction, intensity, color temperature, number of lights
Camera, position, FOV, exposure, focal length, motion blur
Material parameters, roughness, metallic, base color (within plausible ranges)
Backgrounds, random environments, HDRIs, textures
Object jitter, small position/orientation perturbations from rest poses

The breadth of randomization matters. Narrow randomization produces models that fail on small distribution shifts; wide randomization produces robust models, but only if the underlying physics keeps configurations realistic.

Stage 3: Rendering

Generate the actual RGB images plus all auxiliary outputs (depth, masks, etc.).

Major tools:

NVIDIA Isaac Sim Replicator, Omniverse-native, USD-driven, parallel rendering on RTX GPUs. The reference implementation for robotics synthetic data.
NVIDIA Omniverse Replicator, same engine, accessible from Omniverse runtime
Unity Perception, Unity-based, popular for non-NVIDIA workflows
Custom rendering pipelines, for teams with very specific requirements

Rendering throughput on a single RTX 6000-class GPU: ~10-50 images/sec at 1080p with full label generation. A 1M-image dataset takes 6-30 hours depending on scene complexity and label types.

Stage 4: Label generation

Auxiliary outputs are generated alongside the RGB. Replicator handles this automatically given the asset metadata:

Instance segmentation, pixel-perfect, derived from per-prim instance IDs
Semantic segmentation, derived from SemanticsAPI labels
Depth, exact, from the rendering Z-buffer
2D bounding boxes, derived from instance masks
3D bounding boxes, from object USD prim transforms
6-DOF poses, same source

The labels are exact by construction. This is the structural advantage of synthetic data: no human annotators introducing errors, no edge cases mishandled.

Stage 5: Dataset curation

Raw output from rendering needs filtering:

Reject failed configurations, physics drops where objects ended up out-of-frame or in unrealistic poses
Balance class distributions, ensure rare classes aren’t underrepresented
Sanity-check physics outcomes, exclude scenes with obvious physics artifacts
Verify label correctness, automated checks that masks align with rendered objects

Curated datasets are what feed model training. Skipping curation produces models that learn the simulator’s bugs.

Domain randomization done right

The literature on domain randomization is rich, but two principles dominate:

1. Randomize broadly within physically plausible bounds. Wide-uniform random over impossible values (e.g., gravity = 100 m/s²) produces worse models. Wide random within plausible bounds (e.g., gravity = 9.5-10.0 m/s²) produces better generalization than fixed values.

2. Center on calibrated values, not random ones. If your asset’s mass should be 200g, randomize within 180-220g. Don’t randomize uniformly between 0 and 1000g, that’s training on impossible objects.

The data structure that makes this work: per-asset calibrated physics + per-scene variance bands. AI-driven asset preparation gets you the calibrated values; the rendering pipeline handles the variance.

Common pitfalls in synthetic-data programs

Patterns that consistently produce failed transfer:

Photorealism over physics. Spending months perfecting rendering while leaving physics at default values. Real-world failures show up as manipulation drops, traceable to the synthetic training distribution being physically unrealistic.
Narrow object diversity. 200 objects with 1,000 variations each fails to generalize as well as 10,000 distinct objects with no variations. Diversity dominates.
Skipping semantic labels. Adding semantic data later is harder than annotating during asset preparation. Capture labels as part of SimReady generation.
Default physics + photorealistic rendering. The worst combination: looks great, fails in transfer because the physics is silently wrong.
Real-data ablation skipped. Synthetic-only training works for some tasks but rarely manipulation. Always plan for 20-40% real-world data fine-tuning.
Rendering for one camera setup. Real robots have varying camera configurations. Randomize FOV, position, focal length, and exposure across the dataset.

Validation: how to know your synthetic data works

Standard validation patterns:

Holdout-real test set. Train on synthetic, evaluate on a small real-world labeled set. Track gap over training duration.
Per-class performance. Synthetic biases often show up as class-imbalanced failures. Watch class-by-class metrics, not just aggregates.
Sim-to-real ablation study. Train models with varying synthetic/real ratios. Identify the inflection point where synthetic ratio stops adding value (typically 60-80% synthetic).
Domain randomization sweep. Train models with progressively wider DR ranges. Performance peaks at “as wide as physically plausible,” collapses outside that.

How automated asset prep helps

The asset-preparation bottleneck dominates the time-to-first-synthetic-dataset for most teams. Manual SimReady authoring at 4 hours per asset means a 10,000-asset library takes 40,000 engineer-hours, synthetic data programs that wait for that finish 12-18 months after kickoff.

AI-automated SimReady generation (covered in How to create SimReady assets without manual modeling) reduces this to ~5 minutes per asset. A 10,000-asset library is ~3 days of automated processing. The downstream synthetic-data program starts on week one rather than month twelve.

The math: if synthetic data unlocks $1M of robotics-pipeline value over a year, the cost difference between waiting 12 months vs starting in week one is most of that $1M. Asset preparation speed is often the rate-limiting step in robotics ROI.

Practical recommendations

If you’re standing up a synthetic-data program for robotics computer vision:

Author assets with both physics and semantics from day one. Photorealism alone isn’t enough; add labels and physics or rebuild later.
Use AI-automated SimReady generation for any asset library beyond ~50 hero objects. Manual doesn’t scale to the diversity vision models need.
Domain-randomize broadly within plausible bounds. Centered on calibrated baselines, not uniform-random over impossible ranges.
Plan for mixed synthetic + real training. 60-80% synthetic, 20-40% real-world fine-tuning is a robust default.
Validate on real holdouts continuously. Synthetic-only metrics are unreliable. Track real-world performance from week one.
Build the pipeline as a continuous service, not a one-time data dump. New SKUs, new equipment, new scenarios should flow into synthetic data automatically.

The difference between a synthetic-data program that ships robots and one that produces beautiful renders no robot ever uses comes down to whether physics, semantics, and diversity were treated as first-class concerns alongside rendering. The teams getting that right are running real-time policy training on synthetic data today; the teams that didn’t are still trying to figure out why their photorealistic dataset doesn’t generalize.

The asset preparation work that powers synthetic data is the same work that powers policy training. Get it right once, and both downstream pipelines benefit.