Synthetic data generation for computer vision in robotics
Real-world labeled robotics data is expensive and slow to capture. Synthetic data is fast and unlimited — but only useful if the underlying simulation has correct physics, semantic labels, and domain randomization. Here's the complete pipeline.
Real-world labeled data for robotics is expensive. Capturing 100,000 images of a robot manipulating warehouse SKUs, with bounding boxes and grasp poses, takes weeks of operator time and produces a dataset that’s obsolete the moment a new SKU enters the catalog. Synthetic data — generated from simulation with automatic labels — promises unlimited scale and instant relabeling.
The catch: synthetic data is only as good as the simulation that produces it. Wrong physics, wrong semantics, or wrong asset diversity, and the resulting model fails on real hardware in ways that look mysterious until you trace them back to the synthetic training set.
This guide covers the complete pipeline: what assets and metadata you need, how to use NVIDIA Replicator and similar tools, the role of domain randomization, and how to validate that your synthetic data actually transfers.
What “synthetic data” means in robotics
A typical synthetic robotics dataset includes:
- RGB images — rendered from simulator cameras, with photorealistic lighting and materials
- Depth maps — pixel-perfect ground truth, free from real-sensor noise
- Segmentation masks — instance and semantic, automatically generated from object metadata
- Bounding boxes — 2D and 3D, with object class labels
- Keypoints — graspable points, joint positions, attachment landmarks
- 6-DOF poses — object positions and orientations in camera/world frames
- Physics annotations — mass, friction, contact forces (for tasks where physics matters)
The structural advantage over real data: every label is automatic and exact. No human-in-the-loop annotation, no boundary errors on segmentation masks, no measurement uncertainty on poses.
What physics has to do with computer vision
A common assumption: “synthetic data for vision is purely about rendering.” Half-true. For pure object-detection tasks (find the box in the image), rendering quality dominates. For manipulation tasks (grasp the box, place it on the shelf), the underlying physics determines whether the data is useful even when the rendering is identical.
Why: the value of a synthetic dataset for manipulation isn’t just the labels — it’s the distribution of object configurations it captures. A simulation with wrong physics produces unrealistic configurations (objects float, stacks balance impossibly, robots interact with objects that move wrong). Vision models trained on those configurations learn incorrect priors about how the world works.
Concrete example: a vision model trained on synthetic stacking data where the physics produces too-stable stacks (over-friction) learns to expect rigid contacts. Deployed in reality, it under-predicts stack instability and the controller it informs makes overconfident grasp decisions, producing drops.
The robust pattern: photorealistic rendering AND physics-accurate simulation. Both matter; neither alone is enough.
Asset preparation requirements
For synthetic data generation, every asset needs everything required for policy training plus a few extras:
- Visual fidelity — PBR materials, high-resolution textures, normal maps
- Physics — mass, friction, collision meshes (same as for training)
- Semantic labels —
SemanticsAPIwith class label and optionally instance ID - Keypoints (task-specific) — annotated graspable points, attachment points
- Pose canonical frame — well-defined “up” and “front” for orientation labels
The good news: 90% of this is the same SimReady asset preparation that policy training requires. A SimReady asset is also a synthetic-data-ready asset.
The synthetic data pipeline
The end-to-end pipeline runs through five stages:
Stage 1: Scene composition
Generate diverse scenes by procedurally combining assets. Variables:
- Object selection — random subset from your asset library
- Object placement — random positions within physically plausible regions
- Object orientation — random poses, with physics simulation to settle into stable configurations
- Background — random environments, lighting setups, ground textures
The goal is diversity, not realism per se. A vision model trained on 100 distinct scenes overfits to those specific configurations; a model trained on 100,000 procedurally varied scenes generalizes.
Stage 2: Domain randomization
Randomize visual parameters per scene:
- Lighting — direction, intensity, color temperature, number of lights
- Camera — position, FOV, exposure, focal length, motion blur
- Material parameters — roughness, metallic, base color (within plausible ranges)
- Backgrounds — random environments, HDRIs, textures
- Object jitter — small position/orientation perturbations from rest poses
The breadth of randomization matters. Narrow randomization produces models that fail on small distribution shifts; wide randomization produces robust models — but only if the underlying physics keeps configurations realistic.
Stage 3: Rendering
Generate the actual RGB images plus all auxiliary outputs (depth, masks, etc.).
Major tools:
- NVIDIA Isaac Sim Replicator — Omniverse-native, USD-driven, parallel rendering on RTX GPUs. The reference implementation for robotics synthetic data.
- NVIDIA Omniverse Replicator — same engine, accessible from Omniverse runtime
- Unity Perception — Unity-based, popular for non-NVIDIA workflows
- Custom rendering pipelines — for teams with very specific requirements
Rendering throughput on a single RTX 6000-class GPU: ~10-50 images/sec at 1080p with full label generation. A 1M-image dataset takes 6-30 hours depending on scene complexity and label types.
Stage 4: Label generation
Auxiliary outputs are generated alongside the RGB. Replicator handles this automatically given the asset metadata:
- Instance segmentation — pixel-perfect, derived from per-prim instance IDs
- Semantic segmentation — derived from
SemanticsAPIlabels - Depth — exact, from the rendering Z-buffer
- 2D bounding boxes — derived from instance masks
- 3D bounding boxes — from object USD prim transforms
- 6-DOF poses — same source
The labels are exact by construction. This is the structural advantage of synthetic data: no human annotators introducing errors, no edge cases mishandled.
Stage 5: Dataset curation
Raw output from rendering needs filtering:
- Reject failed configurations — physics drops where objects ended up out-of-frame or in unrealistic poses
- Balance class distributions — ensure rare classes aren’t underrepresented
- Sanity-check physics outcomes — exclude scenes with obvious physics artifacts
- Verify label correctness — automated checks that masks align with rendered objects
Curated datasets are what feed model training. Skipping curation produces models that learn the simulator’s bugs.
Domain randomization done right
The literature on domain randomization is rich, but two principles dominate:
1. Randomize broadly within physically plausible bounds. Wide-uniform random over impossible values (e.g., gravity = 100 m/s²) produces worse models. Wide random within plausible bounds (e.g., gravity = 9.5-10.0 m/s²) produces better generalization than fixed values.
2. Center on calibrated values, not random ones. If your asset’s mass should be 200g, randomize within 180-220g. Don’t randomize uniformly between 0 and 1000g — that’s training on impossible objects.
The data structure that makes this work: per-asset calibrated physics + per-scene variance bands. AI-driven asset preparation gets you the calibrated values; the rendering pipeline handles the variance.
Common pitfalls in synthetic-data programs
Patterns that consistently produce failed transfer:
- Photorealism over physics. Spending months perfecting rendering while leaving physics at default values. Real-world failures show up as manipulation drops, traceable to the synthetic training distribution being physically unrealistic.
- Narrow object diversity. 200 objects with 1,000 variations each fails to generalize as well as 10,000 distinct objects with no variations. Diversity dominates.
- Skipping semantic labels. Adding semantic data later is harder than annotating during asset preparation. Capture labels as part of SimReady generation.
- Default physics + photorealistic rendering. The worst combination: looks great, fails in transfer because the physics is silently wrong.
- Real-data ablation skipped. Synthetic-only training works for some tasks but rarely manipulation. Always plan for 20-40% real-world data fine-tuning.
- Rendering for one camera setup. Real robots have varying camera configurations. Randomize FOV, position, focal length, and exposure across the dataset.
Validation: how to know your synthetic data works
Standard validation patterns:
- Holdout-real test set. Train on synthetic, evaluate on a small real-world labeled set. Track gap over training duration.
- Per-class performance. Synthetic biases often show up as class-imbalanced failures. Watch class-by-class metrics, not just aggregates.
- Sim-to-real ablation study. Train models with varying synthetic/real ratios. Identify the inflection point where synthetic ratio stops adding value (typically 60-80% synthetic).
- Domain randomization sweep. Train models with progressively wider DR ranges. Performance peaks at “as wide as physically plausible,” collapses outside that.
How automated asset prep helps
The asset-preparation bottleneck dominates the time-to-first-synthetic-dataset for most teams. Manual SimReady authoring at 4 hours per asset means a 10,000-asset library takes 40,000 engineer-hours — synthetic data programs that wait for that finish 12-18 months after kickoff.
AI-automated SimReady generation (covered in How to create SimReady assets without manual modeling) reduces this to ~5 minutes per asset. A 10,000-asset library is ~3 days of automated processing. The downstream synthetic-data program starts on week one rather than month twelve.
The math: if synthetic data unlocks $1M of robotics-pipeline value over a year, the cost difference between waiting 12 months vs starting in week one is most of that $1M. Asset preparation speed is often the rate-limiting step in robotics ROI.
Practical recommendations
If you’re standing up a synthetic-data program for robotics computer vision:
- Author assets with both physics and semantics from day one. Photorealism alone isn’t enough; add labels and physics or rebuild later.
- Use AI-automated SimReady generation for any asset library beyond ~50 hero objects. Manual doesn’t scale to the diversity vision models need.
- Domain-randomize broadly within plausible bounds. Centered on calibrated baselines, not uniform-random over impossible ranges.
- Plan for mixed synthetic + real training. 60-80% synthetic, 20-40% real-world fine-tuning is a robust default.
- Validate on real holdouts continuously. Synthetic-only metrics are unreliable. Track real-world performance from week one.
- Build the pipeline as a continuous service, not a one-time data dump. New SKUs, new equipment, new scenarios should flow into synthetic data automatically.
The difference between a synthetic-data program that ships robots and one that produces beautiful renders no robot ever uses comes down to whether physics, semantics, and diversity were treated as first-class concerns alongside rendering. The teams getting that right are running real-time policy training on synthetic data today; the teams that didn’t are still trying to figure out why their photorealistic dataset doesn’t generalize.
The asset preparation work that powers synthetic data is the same work that powers policy training. Get it right once, and both downstream pipelines benefit.
Frequently asked questions
How much real-world data can synthetic data replace?
For most robotics perception tasks, 60-80% of training data can come from synthetic with proper domain randomization, with 20-40% real data for fine-tuning. NVIDIA's GR00T N1 benchmark showed 40% better real-world performance from combined synthetic + real training compared to real-only training of the same data scale. Pure-synthetic with no real-world fine-tuning works for some tasks (object detection in known environments) but rarely transfers cleanly to manipulation policies without at least some real data.
Why doesn't photorealistic rendering alone close the sim-to-real gap?
Because perception failures from visual mismatch are usually smaller than physics failures from mass and friction mismatch. A model trained on photorealistic synthetic data still fails at manipulation tasks if the underlying physics is wrong — the policy learns to grasp objects with the wrong mass, leading to drops on real hardware regardless of how good the visual transfer was. The robust pattern is photorealistic rendering AND physics-accurate simulation; both are needed, neither alone is enough.
What asset metadata does synthetic data generation actually require?
At minimum: per-prim semantic class labels (via SemanticsAPI in USD), instance IDs for multi-instance scenes, and physics properties (mass, friction, collision) for any object that's manipulated. For specific tasks, also: keypoints (graspable points, attachment points), 6-DOF poses, depth maps (auto-generated by Replicator), and segmentation masks (also auto-generated). The asset preparation work is the same as for policy training — physics and semantics matter for both.
Skip the manual physics work
Upload any 3D model and get a SimReady OpenUSD asset in minutes. Mass, friction, collision meshes — all calibrated automatically.