The Supercombo
Architecture
A thorough technical breakdown of Comma AI's end-to-end driving model — its design decisions, trade-offs, and what each component is really doing.
00 Pipeline at a Glance
Supercombo is a single unified neural network ingesting raw camera frames and a handful of scalars, outputting a complete scene representation in one 20 Hz forward pass on a Snapdragon 845.
01 Inputs
Primary — Stacked Frame Tensor
Camera frames are converted to YUV420 (planar) and two consecutive frames are stacked channel-wise, giving the CNN implicit access to optical flow without a separate flow estimator:
Why YUV? Hardware ISPs output YUV natively. Skipping the conversion saves compute on-device. YUV also separates luminance (geometry, edges) from chrominance, which can encourage cleaner feature hierarchies. Why two frames? The difference between frames encodes velocity and moving objects implicitly — no dedicated flow estimator needed.
Side Inputs (post-CNN, injected into GRU)
02 CNN Backbone
A custom ResNet variant designed for 128×256 YUV input. It global-average-pools its final feature volume to produce a 512-d scene summary for the GRU — no spatial feature map preserved.
| Choice | What Comma Did | Why |
|---|---|---|
| Resolution | 128×256 | ~4× smaller than detection models. Enough for road geometry; critical for 20 Hz on mobile SoC. |
| Output | Global avg pool → 512-d | Forces entire scene into a fixed token. Efficient but discards spatial localisation. |
| Architecture | Custom ResNet | Residual connections for gradient flow through ~13M params. Proven for structured prediction. |
| Input channels | 12-ch YUV pair | Atypical — first conv layer randomly initialised, no ImageNet transfer possible. |
The Calibration Transform
Before the CNN, frames are reprojected into a calibrated coordinate frame — a virtual camera pointing straight ahead, level with the road. This removes pitch/roll variation, so road geometry always appears at a near-constant location in the image. A manual inductive bias that dramatically simplifies what the CNN must learn.
03 GRU Temporal Module
A single GRU cell with 512 hidden units — the model's entire working memory. CNN embedding plus side inputs are concatenated as the GRU input vector each timestep.
Why GRU over LSTM? GRU merges cell and hidden state — ~25% fewer parameters than LSTM of equal width. For sequences measured in seconds on a latency-constrained device, GRU's effective context window is acceptable. The extra LSTM gate complexity isn't justified here.
State training protocol: First batch of each segment runs forward but doesn't update weights (warmup). State carries over between batches within the same segment (persistence). State zeroed at segment boundaries, matching deployment behaviour (reset).
04 Output Heads
The GRU output fans into independent FC branches, one per task, concatenated to (N, 6472). Multi-task learning: gradients from lane-line detection improve the shared representation used for path planning and vice versa.
Probabilistic Outputs
Almost every output is a distribution, not a point estimate. Waypoints as (μ, σ) Laplacian pairs. Wide σ → conservative control; narrow σ → confident control. The uncertainty is live information for the downstream planner.
05 Design Rationale
Why End-to-End?
Traditional ADAS stacks separate perception, tracking, prediction, and planning — each handoff introduces error and latency. Supercombo collapses all of these, allowing gradients to flow end-to-end and the model to develop joint representations serving all tasks simultaneously.
Multi-Hypothesis Paths + Quadratic Time Spacing
The road ahead is genuinely multimodal — at a highway exit you either stay on or exit. A single path averages between modes, producing an impossible trajectory. 4 candidate paths with probabilities lets the model represent ambiguity. The 33 waypoints are spaced quadratically — denser near-term where errors matter most, sparser at 10s where uncertainty is inherently high.
Calibrated Frame
Reprojecting into a calibrated frame removes camera-to-world projection from the learning problem entirely. All predictions are expressed in a coordinate system aligned with the flat road plane — a manually engineered inductive bias that significantly simplifies the function the network must learn.
06 Loss Functions
KL Divergence (distillation baseline)
Penalise the student for diverging from the teacher's distribution. Stable but flawed — it averages across hypotheses, washing out multimodality. Result in experiments: ~20 hours to converge.
Winner-Takes-All Laplacian NLL ✓
At each step, find the winning hypothesis (closest to ground truth) and only apply loss there. Zero gradient for all others. Forces diverse, committed hypotheses instead of collapsing to the mean. Shared by Comma AI — result: ~1 hour to converge (20× faster).
Why Laplacian over Gaussian? Driving errors have heavy tails — routine frames dominate, but rare frames fail badly. Laplacian NLL is robust to outliers (L1 in the mean) where Gaussian NLL (L2) would over-penalise rare events.
07 Training Loop
Batch Construction
Each batch contains sequences from N different one-minute segments (N = batch size). Adjacent frames from the same drive are near-identical and cause overfitting. Empirically: batch size 8 overfit badly; batch size 28 converged well. Cost: ~56 CPU cores to load 28 parallel video streams.
Custom Data Loader
PyTorch's DataLoader doesn't natively support parallel sequence loading across different files. A custom loader was built: each worker owns one segment, fills a shared-memory queue; a background collation process assembles batches. Result: ~150ms latency per batch, ~175ms GPU transfer.
Optimizer
Adam + weight decay (L2=1e-4), LR=1e-3, ReduceLROnPlateau scheduler (factor 0.75, patience 3). Optional gradient clipping. Conservative and well-tested.
The 0.9 Series
Evolution
How the supercombo architecture was systematically rebuilt across the 0.9 release series — backbone swap, end-to-end planning, new inputs, and training overhaul.
00 v0.9 at a Glance
The 0.9 series spanned three years and introduced the most fundamental changes since the original supercombo. The overall shape (backbone → temporal → heads) remains, but each component was substantially redesigned:
01 Version Timeline
Nov 2022
Architecture Redesign — 10× Feature Richness
Internal feature space information content increased tenfold to ~700 bits. Less reliance on previous frames (more reactive). Trained in 36 hours from scratch vs. the previous one-week timeline. Introduced Experimental Mode with E2E longitudinal: model can stop for traffic lights and slow for turns without hand-coded logic.
Jul 2023
Navigate on openpilot — Map Image Input
When navigation is active, a compressed map image of the route ahead is fed into the model. Encoded via a learned neural compressor (VAE-style), the map provides context for upcoming forks, exits, and turns — context that pure vision can't see around corners.
Nov 2023
FastViT + E2E Lateral + Navigation Instructions
Three simultaneous changes: (1) EfficientNet → FastViT (Hybrid Vision Transformer — biggest architecture change since 0.9.0), (2) lateral MPC moved inside the model (direct trajectory output, replacing classical MPC), (3) navigation instruction vector (ternary left/straight/right at 20m resolution, ±500m ahead) added as new input.
Feb 2024
Direct Curvature Output (Los Angeles Model)
Model now directly outputs a desired curvature value for lateral control — a single scalar mapping directly to steering. Collapses perception→plan→MPC→curvature to perception→curvature. Simpler interface, prepares for RL fine-tuning.
Feb 2025
ISP Pipeline + More GPU Headroom
Image processing pipeline moved to the ISP (dedicated hardware), freeing significant GPU time for the driving model. Power draw reduced 0.5W. Sets up headroom for larger model variants in the 0.10 series.
02 The FastViT Backbone
FastViT (Apple Research, 2023) is a Hybrid Vision Transformer — convolutional stages in early layers handle local feature extraction efficiently; transformer-style attention in later stages enables long-range spatial reasoning. Best of both worlds for mobile deployment.
| Property | Custom ResNet (v0.8) | FastViT (v0.9.5+) |
|---|---|---|
| Architecture | Pure CNN, residual blocks | Hybrid: Conv early + ViT later |
| Spatial reasoning | Implicit in conv kernels only | Explicit attention over spatial locations |
| Long-range context | Limited to receptive field | Global attention in later stages |
| Feature information | ~70 bits (est.) | ~700 bits (10× improvement) |
| Mobile efficiency | Snapdragon 845 optimised | Retokenization for depthwise conv efficiency |
The Feature Information Content Jump
The 0.9.0 release reports the internal feature space going from ~70 to ~700 bits of information content. This means the representation at the GRU input encodes 10× more semantically distinct states — the difference between a network that barely knows "road or not" and one that simultaneously understands lane structure, occlusion relationships, and scene geometry.
03 EfficientNet → FastViT: Why Switch?
Between v0.8.11 and v0.9.5, Comma used EfficientNet before switching to FastViT. EfficientNet is strong — compound scaling optimises width, depth, and resolution simultaneously. Why not keep it?
For driving, the task is fundamentally spatial and relational: where is the lane relative to the car, where is the lead relative to the lane, what is the curve geometry ahead. EfficientNet's purely local inductive bias is weaker at these relational queries than attention-based models. FastViT's hybrid approach retains CNN efficiency while allowing global relationships to emerge in the later stages.
04 End-to-End Lateral Planning
This is the conceptually largest change in the 0.9.x series — moving from a model that predicts a path (which a classical MPC converts to steering) to a model that directly outputs executable control.
The Three Stages of Evolution
Predict → MPC → Curvature
Model outputs path waypoints. Classical MPC converts path to trajectory. Kinematic approximations convert trajectory to curvature. Three handoffs, three sources of error and latency.
Predict → Learned MPC → Curvature
The MPC is absorbed into the model (New Lemon Pie). Outputs a smooth, executable lateral trajectory directly. MPC is now learned and differentiable. One fewer external handoff.
Direct Curvature Output
Los Angeles Model: single desired curvature value output directly. The entire pipeline collapses to neural network → one control scalar. Maximally end-to-end for lateral control.
Direct Longitudinal
Same evolution planned for longitudinal control. The 0.10 series begins this transition using the new world model (Tomb Raider) for training supervision.
05 New Inputs in v0.9
Map Image (v0.9.4)
When Navigate on openpilot is active, a compressed route map is fed into the model. Encoded by a learned neural compressor (similar to a VAE encoder), then concatenated as an additional side input. This allows the model to "see around corners" — if navigation shows a left turn 400m ahead, the model can begin adjusting lane position before the turn is visually apparent.
Navigation Instructions Vector (v0.9.5)
The map image alone can be ambiguous about timing. A ternary instruction vector is added: for each 20m segment from -500m to +500m ahead, the value is -1 (left), 0 (straight), or 1 (right) — a precise 50-element vector.
06 Training Changes
Reprojective Simulator (v0.9.0)
Training now uses a reprojective simulator — a differentiable renderer synthesizing what the camera would see from a different position/orientation. This expands the training distribution by augmenting real data with counterfactual views, and enables training on lateral behaviour hard to collect from real drives.
Lateral + Longitudinal Simulation
From v0.9.0, training simulates both lateral and longitudinal behaviour simultaneously, allowing the model to learn to slow for curves and stop for traffic lights in Experimental Mode. Previously, pure imitation learning on human drives couldn't teach these behaviours because humans rarely drive at the limit.
Desire Ground-Truthing Stack
A new desire GT pipeline accurately labels when lane changes happen, what type they are, and when they complete. Previously the desire input during training was noisy — the new stack enables dramatically better lane change behaviour.
Anti-Cheating Regularisation
Early simulator training found the model "hugging" lane edges in laneless mode — a degenerate shortcut that doesn't generalise. Anti-cheating regularisation was added to the simulator training to prevent these solutions.
07 v0.8 vs v0.9 Comparison
| Dimension | v0.8.11 | v0.9.x (latest) |
|---|---|---|
| Backbone | Custom ResNet, global pool | FastViT (Hybrid ViT) |
| Feature richness | ~70 bits | ~700 bits (10×) |
| Lateral planning | Waypoints → external MPC | Direct curvature output |
| Navigation input | None | Map image + instruction vector |
| Training simulator | Limited | Reprojective + lateral + longitudinal |
| Experimental mode | No | Yes (stops for lights, slows for turns) |
| Train time (baseline) | ~1 week | ~36 hours |
| RL-readiness | Non-differentiable end-to-end | E2E differentiable → RL possible |
The Improved
Architecture
A ground-up redesign synthesizing best ideas from modern autonomous driving research — what supercombo would look like built with current knowledge and no legacy constraints.
00 Vision & Goals
The v0.8 and v0.9 architectures are constrained by 20Hz on a Snapdragon 845. The proposed architecture asks: what is the right architecture given modern hardware and the full weight of 2024–2026 research?
| Problem | Current (v0.9) | Fix |
|---|---|---|
| Spatial compression | Global pool → flat 512-d token | Spatial BEV token grid |
| Temporal memory | GRU: 512-float state | Causal transformer over N past BEV frames |
| Multi-modal futures | 4 fixed hypotheses + WTA | Flow matching over full trajectory distribution |
| Sensor fusion | Vision + map only | Camera + map + speed + IMU + CAN |
| Scene geometry | Calibrated front-view frame | Lifted perspective → BEV (LSS) |
| Training signal | Imitation + simulator | World model + RL fine-tuning |
01 Architecture Overview
Five stages: multi-input encoding, BEV lifting, causal temporal attention, query-based decoding, and trajectory generation.
02 Backbone
Keep FastViT (it's a good choice) but remove the global average pool. Output the full spatial token grid at 1/8 input resolution — for 128×256 input, this is 16×32 = 512 tokens, each a 256-d vector.
Lane Line Precision
With spatial tokens, the lane line head can attend to specific pixel regions rather than reasoning from a scene-level average. Sub-pixel accuracy becomes achievable.
Lead Detection at Range
Lead vehicle detection benefits enormously from spatial locality. Spatial tokens let detection heads directly attend to where the vehicle is in the image — no need for the GRU to "remember" object locations.
03 BEV Representation
Instead of a calibrated front-view frame, explicitly lift image features into a Bird's Eye View grid using Lift-Splat-Shoot (LSS). Geometric relationships become explicit in the representation rather than implied.
| Property | Calibrated Front-View (current) | BEV (improved) |
|---|---|---|
| Geometric accuracy | Projective — distances distorted | Metric — distances accurate |
| Multi-camera fusion | Complex per-camera handling | Trivial — sum BEV grids |
| Object geometry | Requires projection | Direct in BEV space |
| Occlusion handling | Camera-frame dependent | Occluded cells naturally empty |
| Compute overhead | None (manual transform) | DepthNet + voxel pooling (~15-20%) |
04 Temporal Module
Replace the GRU with a causal transformer over a sliding window of N=16 past BEV frames (at 5Hz = 3.2 seconds of history). KV-cache makes streaming inference O(1) amortised per new frame.
| Property | GRU (current) | Causal Transformer (improved) |
|---|---|---|
| Memory capacity | 512 floats | 16 × S × C tokens (~100× more) |
| Recall mechanism | Lossy compression into state | Direct attention to any past frame |
| Near-miss memory | May vanish from GRU state | Explicitly accessible via attention |
| Interpretability | Opaque state vector | Attention weights = what was recalled |
| Inference cost | O(1) per step | O(1) amortised with KV-cache |
| Training parallelism | Sequential (BPTT through time) | Fully parallel over time axis |
05 Output Heads
Query-Based Decoder (DETR-style)
Replace fixed-index FC concatenation with learnable query vectors per task. Each query cross-attends to the BEV representation and decodes its output. More flexible, better spatial localisation, naturally handles variable-cardinality sets (arbitrary number of lane lines or lead vehicles).
Flow Matching Trajectory Head
Replace the 4-hypothesis WTA path head with a conditional flow matching model. Given BEV context and temporal representation, it learns a mapping from noise → future trajectory. At inference, sample K trajectories to get a full distribution over futures.
Continuous Multimodality
WTA with 4 hypotheses discretises the future into 4 modes. Flow matching produces a continuous distribution — all futures reachable, weighted by probability. Handles rare scenarios naturally.
Temporal Correlation
WTA generates waypoints semi-independently. Diffusion/FM over the full trajectory respects temporal correlations — if turning left at 2s, still turning at 3s. Produces physically plausible trajectories.
Inference Cost
Flow matching requires multiple denoising steps. With DDIM / FM, 1–4 steps is sufficient for a distilled model. ~3–5× more compute than an FC head — manageable on modern hardware.
WTA Laplacian NLL Still Useful
For non-path outputs (lane lines, leads, pose), the WTA Laplacian NLL approach from v0.8 is still excellent and should be retained. Only replace the path head with flow matching.
06 Training Strategy
Stage 1: Pretraining on Large-Scale Video
Pretrain the BEV backbone and temporal transformer on large-scale unlabelled driving video using a self-supervised objective — predict the next BEV frame or reconstruct masked tokens. Builds general road scene understanding without requiring GT labels.
Stage 2: Multi-Task Imitation Learning
Fine-tune on labelled data with the full multi-task loss: flow matching for paths, WTA Laplacian NLL for lanes/leads/pose. This is the current approach, now applied to richer representations.
Stage 3: World Model Fine-Tuning
Train a world model (video prediction conditioned on ego actions) alongside the driving model. Use it to score trajectories: trajectories that predict future frames matching real-world outcomes are rewarded. Enables learning from near-misses and unusual events without manual labelling. This is the direction of openpilot 0.10's Tomb Raider world model.
Stage 4: RL from Simulator
Fine-tune with RL in a high-fidelity simulator (MetaDrive or CARLA), using rewards based on comfort, progress, collision avoidance, and traffic rule compliance. The fully E2E differentiable architecture enables direct policy gradient through the trajectory head — this is only possible because planning is inside the model.
07 All Three Models Compared
| Component | v0.8.11 | v0.9.x | Improved |
|---|---|---|---|
| Backbone | Custom ResNet | FastViT (Hybrid) | EfficientViT / FastViT-L (no global pool) |
| Spatial representation | Global avg pool → 512-d | Partial spatial tokens | Full BEV lifting (LSS) |
| Temporal module | GRU, 512-d state | GRU, 512-d (richer input) | Causal transformer + KV-cache |
| Sensor fusion | Vision only | Vision + map | Vision + map + speed + IMU + CAN |
| Path output | 4 hypotheses + WTA | 4 hypotheses + direct curvature | Flow matching (full distribution) |
| Output head API | FC branches, flat concat | FC branches, flat concat | DETR-style query decoder |
| Lateral control | Waypoints → MPC | Direct curvature ✓ | Direct curvature (retained) |
| Training | Distillation only | Imitation + simulator | Pretrain → imitation → world model → RL |
| Multi-camera | No | No | Native (BEV fusion) |
| RL-ready | No | Partial (lateral only) | Yes — fully E2E differentiable |
| Interpretability | Low | Low (+ attention in backbone) | Medium (BEV grid + attention maps) |