openpilot v0.8.11 · supercombo
// baseline · ~13M params · ResNet + GRU · 6472-d output

The Supercombo
Architecture

A thorough technical breakdown of Comma AI's end-to-end driving model — its design decisions, trade-offs, and what each component is really doing.

00 Pipeline at a Glance

Supercombo is a single unified neural network ingesting raw camera frames and a handful of scalars, outputting a complete scene representation in one 20 Hz forward pass on a Snapdragon 845.

Input Stage
YUV Frame Pair
(N,12,128,256)
Two consecutive frames YUV420, reprojected to calibrated frame, channel-packed.
Backbone
ResNet CNN
→ (N,512)
Custom ResNet variant, globally pooled to a flat vector. Shared across all tasks.
Temporal
GRU (512)
+ desire, traffic
Merges visual features with recurrent state and side inputs. Working memory.
Decoding
FC Heads
→ (N,6472)
Independent dense branches per task, concatenated into one flat output.

01 Inputs

Primary — Stacked Frame Tensor

Camera frames are converted to YUV420 (planar) and two consecutive frames are stacked channel-wise, giving the CNN implicit access to optical flow without a separate flow estimator:

# Frame encoding (per frame) → two frames stacked Y plane: 128×256 # Full luma — high spatial resolution U plane: 64×128 # Chroma blue — half resolution V plane: 64×128 # Chroma red — half resolution # → pre-processed to (N, 12, 128, 256)

Why YUV? Hardware ISPs output YUV natively. Skipping the conversion saves compute on-device. YUV also separates luminance (geometry, edges) from chrominance, which can encourage cleaner feature hierarchies. Why two frames? The difference between frames encodes velocity and moving objects implicitly — no dedicated flow estimator needed.

Side Inputs (post-CNN, injected into GRU)

Injected into GRU
desire
One-hot, 8 classes: straight/lane change L-R/turn L-R/keep L-R. Intent conditioning.
(N,8)
traffic_convention
LHT vs RHT one-hot. Single bit that flips all lane geometry assumptions.
(N,2)
recurrent_state
GRU hidden state fed back from previous timestep. Zero-initialized at segment start.
(N,512)
Notable Absences
Ego speed / CAN
Not provided. Must be inferred from optical flow between the two frames.
IMU / GPS
Used for GT label creation only. Not a live model input.
Map / HD map
Pure vision-only. Generalizable but requires stronger feature learning.
Design Note
Injecting desire/traffic after the CNN is deliberate — routing intent is a planning concern, not a low-level visual feature. The CNN handles pure perception; the GRU merges it with intent.

02 CNN Backbone

A custom ResNet variant designed for 128×256 YUV input. It global-average-pools its final feature volume to produce a 512-d scene summary for the GRU — no spatial feature map preserved.

ChoiceWhat Comma DidWhy
Resolution128×256~4× smaller than detection models. Enough for road geometry; critical for 20 Hz on mobile SoC.
OutputGlobal avg pool → 512-dForces entire scene into a fixed token. Efficient but discards spatial localisation.
ArchitectureCustom ResNetResidual connections for gradient flow through ~13M params. Proven for structured prediction.
Input channels12-ch YUV pairAtypical — first conv layer randomly initialised, no ImageNet transfer possible.

The Calibration Transform

Before the CNN, frames are reprojected into a calibrated coordinate frame — a virtual camera pointing straight ahead, level with the road. This removes pitch/roll variation, so road geometry always appears at a near-constant location in the image. A manual inductive bias that dramatically simplifies what the CNN must learn.

Design Insight
Global pooling is a deliberate trade — it makes the model fast and memory-efficient on a phone-class chip, but at the cost of spatial precision. Transformer-based designs recover this by operating on spatial token grids instead.

03 GRU Temporal Module

A single GRU cell with 512 hidden units — the model's entire working memory. CNN embedding plus side inputs are concatenated as the GRU input vector each timestep.

# Conceptual forward pass at time t cnn_feat = backbone(frame_pair_t) # (N, 512) side = concat(desire, traffic) # (N, 10) gru_in = concat(cnn_feat, side) # (N, 522) h_t, out = GRU(gru_in, h_prev) # h_t: (N,512) → fed to output heads

Why GRU over LSTM? GRU merges cell and hidden state — ~25% fewer parameters than LSTM of equal width. For sequences measured in seconds on a latency-constrained device, GRU's effective context window is acceptable. The extra LSTM gate complexity isn't justified here.

State training protocol: First batch of each segment runs forward but doesn't update weights (warmup). State carries over between batches within the same segment (persistence). State zeroed at segment boundaries, matching deployment behaviour (reset).

Key Limitation
All driving history must be compressed into 512 floats. Anything the GRU can't fit is permanently forgotten. This is the single biggest architectural bottleneck of the v0.8 model.

04 Output Heads

The GRU output fans into independent FC branches, one per task, concatenated to (N, 6472). Multi-task learning: gradients from lane-line detection improve the shared representation used for path planning and vice versa.

Output
Dims
Description
path_plan
4×33×2
4 candidate paths × 33 waypoints × (μ, σ). Quadratically spaced to 10s / 192m.
path_prob
4
Softmax probability per path hypothesis. Drives winner-takes-all selection.
lane_lines
4×33×2
Left/right boundaries + outer edges as (μ, σ) Laplacians at each waypoint.
lane_line_probs
8
Presence probability per lane line at t=0s and t=2s.
road_edges
2×33×2
Left and right road edges, same format as lane lines.
lead
2×51
Lead vehicle MDN (position, velocity, accel) at t=0s and t=2s.
lead_prob
3
Lead detection probability at current, +2s, +4s.
pose
12
6-DoF ego-motion for current and next frame. Acts as self-supervision signal.
long_v / long_a
200 ea.
Longitudinal velocity and acceleration profiles for ACC/AEB planning.
desire_state / meta
8 + 4
Predicted maneuver class and engagement metadata.

Probabilistic Outputs

Almost every output is a distribution, not a point estimate. Waypoints as (μ, σ) Laplacian pairs. Wide σ → conservative control; narrow σ → confident control. The uncertainty is live information for the downstream planner.

05 Design Rationale

Why End-to-End?

Traditional ADAS stacks separate perception, tracking, prediction, and planning — each handoff introduces error and latency. Supercombo collapses all of these, allowing gradients to flow end-to-end and the model to develop joint representations serving all tasks simultaneously.

Multi-Hypothesis Paths + Quadratic Time Spacing

The road ahead is genuinely multimodal — at a highway exit you either stay on or exit. A single path averages between modes, producing an impossible trajectory. 4 candidate paths with probabilities lets the model represent ambiguity. The 33 waypoints are spaced quadratically — denser near-term where errors matter most, sparser at 10s where uncertainty is inherently high.

Calibrated Frame

Reprojecting into a calibrated frame removes camera-to-world projection from the learning problem entirely. All predictions are expressed in a coordinate system aligned with the flat road plane — a manually engineered inductive bias that significantly simplifies the function the network must learn.

06 Loss Functions

KL Divergence (distillation baseline)

Penalise the student for diverging from the teacher's distribution. Stable but flawed — it averages across hypotheses, washing out multimodality. Result in experiments: ~20 hours to converge.

Winner-Takes-All Laplacian NLL ✓

At each step, find the winning hypothesis (closest to ground truth) and only apply loss there. Zero gradient for all others. Forces diverse, committed hypotheses instead of collapsing to the mean. Shared by Comma AI — result: ~1 hour to converge (20× faster).

# WTA Laplacian NLL (winner-takes-all) for each step: winner = argmin L1(hypotheses, gt) loss = LaplacianNLL(hypotheses[winner]) # zero grad for losers

Why Laplacian over Gaussian? Driving errors have heavy tails — routine frames dominate, but rare frames fail badly. Laplacian NLL is robust to outliers (L1 in the mean) where Gaussian NLL (L2) would over-penalise rare events.

KL Divergence
~20h
WTA Laplacian NLL
~1h

07 Training Loop

Batch Construction

Each batch contains sequences from N different one-minute segments (N = batch size). Adjacent frames from the same drive are near-identical and cause overfitting. Empirically: batch size 8 overfit badly; batch size 28 converged well. Cost: ~56 CPU cores to load 28 parallel video streams.

Custom Data Loader

PyTorch's DataLoader doesn't natively support parallel sequence loading across different files. A custom loader was built: each worker owns one segment, fills a shared-memory queue; a background collation process assembles batches. Result: ~150ms latency per batch, ~175ms GPU transfer.

Optimizer

Adam + weight decay (L2=1e-4), LR=1e-3, ReduceLROnPlateau scheduler (factor 0.75, patience 3). Optional gradient clipping. Conservative and well-tested.

openpilot v0.9.0–0.9.8 · 2022–2025
// EfficientNet→FastViT · E2E lateral · 700-bit features · nav inputs

The 0.9 Series
Evolution

How the supercombo architecture was systematically rebuilt across the 0.9 release series — backbone swap, end-to-end planning, new inputs, and training overhaul.

00 v0.9 at a Glance

The 0.9 series spanned three years and introduced the most fundamental changes since the original supercombo. The overall shape (backbone → temporal → heads) remains, but each component was substantially redesigned:

Backbone (new)
FastViT
Hybrid ViT
EfficientNet replaced by a Hybrid Vision Transformer. Spatial reasoning via attention.
Temporal
GRU
~700-bit features
Same GRU structure but fed richer representations. 10× more information content.
New Inputs
Map + NavVec
image + (50,)
Compressed map image and navigation instruction vector as additional side inputs.
Outputs (new)
Direct Curvature
→ single value
Lateral planning moved inside the model. Outputs executable control directly.
The Big Picture
The 0.9 series marks a shift from a perception-and-predict model toward a true end-to-end planner. By 0.9.6, the model directly outputs control actions rather than intermediate representations for a classical MPC to consume.

01 Version Timeline

v0.9.0
Nov 2022
BackboneTraining

Architecture Redesign — 10× Feature Richness

Internal feature space information content increased tenfold to ~700 bits. Less reliance on previous frames (more reactive). Trained in 36 hours from scratch vs. the previous one-week timeline. Introduced Experimental Mode with E2E longitudinal: model can stop for traffic lights and slow for turns without hand-coded logic.

v0.9.4
Jul 2023
New Input

Navigate on openpilot — Map Image Input

When navigation is active, a compressed map image of the route ahead is fed into the model. Encoded via a learned neural compressor (VAE-style), the map provides context for upcoming forks, exits, and turns — context that pure vision can't see around corners.

v0.9.5
Nov 2023
BackbonePlanningNew Input

FastViT + E2E Lateral + Navigation Instructions

Three simultaneous changes: (1) EfficientNet → FastViT (Hybrid Vision Transformer — biggest architecture change since 0.9.0), (2) lateral MPC moved inside the model (direct trajectory output, replacing classical MPC), (3) navigation instruction vector (ternary left/straight/right at 20m resolution, ±500m ahead) added as new input.

v0.9.6
Feb 2024
PlanningVision

Direct Curvature Output (Los Angeles Model)

Model now directly outputs a desired curvature value for lateral control — a single scalar mapping directly to steering. Collapses perception→plan→MPC→curvature to perception→curvature. Simpler interface, prepares for RL fine-tuning.

v0.9.8
Feb 2025
Infra

ISP Pipeline + More GPU Headroom

Image processing pipeline moved to the ISP (dedicated hardware), freeing significant GPU time for the driving model. Power draw reduced 0.5W. Sets up headroom for larger model variants in the 0.10 series.

02 The FastViT Backbone

FastViT (Apple Research, 2023) is a Hybrid Vision Transformer — convolutional stages in early layers handle local feature extraction efficiently; transformer-style attention in later stages enables long-range spatial reasoning. Best of both worlds for mobile deployment.

PropertyCustom ResNet (v0.8)FastViT (v0.9.5+)
ArchitecturePure CNN, residual blocksHybrid: Conv early + ViT later
Spatial reasoningImplicit in conv kernels onlyExplicit attention over spatial locations
Long-range contextLimited to receptive fieldGlobal attention in later stages
Feature information~70 bits (est.)~700 bits (10× improvement)
Mobile efficiencySnapdragon 845 optimisedRetokenization for depthwise conv efficiency

The Feature Information Content Jump

The 0.9.0 release reports the internal feature space going from ~70 to ~700 bits of information content. This means the representation at the GRU input encodes 10× more semantically distinct states — the difference between a network that barely knows "road or not" and one that simultaneously understands lane structure, occlusion relationships, and scene geometry.

FastViT's Key Innovation
FastViT uses "retokenization" — spatial tokens are fused using depthwise convolutions before attention, dramatically reducing the cost of transformer stages. This makes it viable on mobile hardware without sacrificing the spatial reasoning advantages of attention.

03 EfficientNet → FastViT: Why Switch?

Between v0.8.11 and v0.9.5, Comma used EfficientNet before switching to FastViT. EfficientNet is strong — compound scaling optimises width, depth, and resolution simultaneously. Why not keep it?

# EfficientNet: compound scaling, purely convolutional EfficientNet-B0: 5.3M params # excellent classification accuracy/cost trade-off # Inductive bias: local operations only — limited relational reasoning # FastViT: hybrid convolutional + attention FastViT-T8: 4.0M params # similar or smaller, better spatial reasoning # Retokenization enables global context at low cost

For driving, the task is fundamentally spatial and relational: where is the lane relative to the car, where is the lead relative to the lane, what is the curve geometry ahead. EfficientNet's purely local inductive bias is weaker at these relational queries than attention-based models. FastViT's hybrid approach retains CNN efficiency while allowing global relationships to emerge in the later stages.

Broader Trend
The switch reflects the field's convergence on CNNs for local feature extraction + attention for global spatial reasoning. The hybrid is the pragmatic compromise for mobile deployment.

04 End-to-End Lateral Planning

This is the conceptually largest change in the 0.9.x series — moving from a model that predicts a path (which a classical MPC converts to steering) to a model that directly outputs executable control.

The Three Stages of Evolution

// Stage 1 · v0.8 era

Predict → MPC → Curvature

Model outputs path waypoints. Classical MPC converts path to trajectory. Kinematic approximations convert trajectory to curvature. Three handoffs, three sources of error and latency.

Latency: high · RL: impossible (non-differentiable)
// Stage 2 · v0.9.5

Predict → Learned MPC → Curvature

The MPC is absorbed into the model (New Lemon Pie). Outputs a smooth, executable lateral trajectory directly. MPC is now learned and differentiable. One fewer external handoff.

Latency: medium · RL: partially possible
// Stage 3 · v0.9.6

Direct Curvature Output

Los Angeles Model: single desired curvature value output directly. The entire pipeline collapses to neural network → one control scalar. Maximally end-to-end for lateral control.

Latency: minimal · RL: fully possible ✓
// Stage 4 · 0.10+

Direct Longitudinal

Same evolution planned for longitudinal control. The 0.10 series begins this transition using the new world model (Tomb Raider) for training supervision.

Status: in progress
Why This Matters
Moving planning inside the model means gradients from actual driving outcomes can flow back through the planner during training. Classical MPC is non-differentiable — you can only train perception. With E2E planning, you can train from a driving reward, which is the prerequisite for RL.

05 New Inputs in v0.9

Map Image (v0.9.4)

When Navigate on openpilot is active, a compressed route map is fed into the model. Encoded by a learned neural compressor (similar to a VAE encoder), then concatenated as an additional side input. This allows the model to "see around corners" — if navigation shows a left turn 400m ahead, the model can begin adjusting lane position before the turn is visually apparent.

Navigation Instructions Vector (v0.9.5)

The map image alone can be ambiguous about timing. A ternary instruction vector is added: for each 20m segment from -500m to +500m ahead, the value is -1 (left), 0 (straight), or 1 (right) — a precise 50-element vector.

# Navigation instruction encoding nav_instructions: shape (50,) # -500m to +500m at 20m resolution # Values: -1=turn left, 0=straight, 1=turn right # Combined with map image → significant NoO performance improvement
Why Both?
Map image answers "what does the road look like?" (lane count, curve shape, intersection geometry). Instruction vector answers "exactly when and which way?" They're complementary — image for context, vector for precision timing.

06 Training Changes

Reprojective Simulator (v0.9.0)

Training now uses a reprojective simulator — a differentiable renderer synthesizing what the camera would see from a different position/orientation. This expands the training distribution by augmenting real data with counterfactual views, and enables training on lateral behaviour hard to collect from real drives.

Lateral + Longitudinal Simulation

From v0.9.0, training simulates both lateral and longitudinal behaviour simultaneously, allowing the model to learn to slow for curves and stop for traffic lights in Experimental Mode. Previously, pure imitation learning on human drives couldn't teach these behaviours because humans rarely drive at the limit.

Desire Ground-Truthing Stack

A new desire GT pipeline accurately labels when lane changes happen, what type they are, and when they complete. Previously the desire input during training was noisy — the new stack enables dramatically better lane change behaviour.

Anti-Cheating Regularisation

Early simulator training found the model "hugging" lane edges in laneless mode — a degenerate shortcut that doesn't generalise. Anti-cheating regularisation was added to the simulator training to prevent these solutions.

07 v0.8 vs v0.9 Comparison

Dimensionv0.8.11v0.9.x (latest)
BackboneCustom ResNet, global poolFastViT (Hybrid ViT)
Feature richness~70 bits~700 bits (10×)
Lateral planningWaypoints → external MPCDirect curvature output
Navigation inputNoneMap image + instruction vector
Training simulatorLimitedReprojective + lateral + longitudinal
Experimental modeNoYes (stops for lights, slows for turns)
Train time (baseline)~1 week~36 hours
RL-readinessNon-differentiable end-to-endE2E differentiable → RL possible
What Stayed the Same
GRU as temporal module. Probabilistic outputs (Laplacian μ/σ). Calibrated frame reprojection. Multi-task learning across all heads. The core mathematical skeleton of supercombo is preserved — what changed is the quality of representations flowing through it.
Proposed Architecture · Research Synthesis 2024–2026
// BEV · Causal Transformer · Diffusion Planning · World Model Training

The Improved
Architecture

A ground-up redesign synthesizing best ideas from modern autonomous driving research — what supercombo would look like built with current knowledge and no legacy constraints.

00 Vision & Goals

The v0.8 and v0.9 architectures are constrained by 20Hz on a Snapdragon 845. The proposed architecture asks: what is the right architecture given modern hardware and the full weight of 2024–2026 research?

Design Principles
(1) Preserve spatial precision — don't discard location information. (2) Rich selective temporal memory — recall specific events, not a blurry state average. (3) Full joint distribution over futures — not fixed hypotheses. (4) Everything differentiable end-to-end for RL.
ProblemCurrent (v0.9)Fix
Spatial compressionGlobal pool → flat 512-d tokenSpatial BEV token grid
Temporal memoryGRU: 512-float stateCausal transformer over N past BEV frames
Multi-modal futures4 fixed hypotheses + WTAFlow matching over full trajectory distribution
Sensor fusionVision + map onlyCamera + map + speed + IMU + CAN
Scene geometryCalibrated front-view frameLifted perspective → BEV (LSS)
Training signalImitation + simulatorWorld model + RL fine-tuning

01 Architecture Overview

Five stages: multi-input encoding, BEV lifting, causal temporal attention, query-based decoding, and trajectory generation.

Input · Keep
Camera Frames
YUV420, 2 frames, (N,12,H,W).
Input · New
Ego Speed + IMU
Velocity, yaw rate, 3-axis accel → sensor tokens.
Input · New
CAN State
Steering angle, brake, throttle.
Backbone · Improved
FastViT-L — Spatial Tokens (no global pool)
Outputs 2D feature map (H/8 × W/8 × C). Each token = spatial image region. Spatial precision preserved.
Sensor Encoder · New
MLP Encoder
Speed + IMU + CAN → conditioning tokens appended to image tokens.
Representation · New
Perspective-to-BEV Lifting (LSS / BEVDet)
Predict depth distribution per pixel. Splat image features onto a BEV grid via outer product of depth distribution × image features. Produces explicit top-down scene representation (e.g. 200×200 @ 0.5m = 100×100m).
Temporal Module · New
Causal Transformer (N=16 past BEV frames, KV-cache)
Attention over sliding window of 16 past BEV snapshots. Each head can attend to any past frame. RoPE positional encoding. KV-cache for O(1) amortised inference per step.
Decoding · Improved
Query-Based Decoder (DETR-style)
Learnable queries per task. Each query cross-attends to BEV tokens. No fixed-index concatenation.
Path Output · New
Flow Matching Head
Conditional flow matching → sample full trajectory distribution at inference.

02 Backbone

Keep FastViT (it's a good choice) but remove the global average pool. Output the full spatial token grid at 1/8 input resolution — for 128×256 input, this is 16×32 = 512 tokens, each a 256-d vector.

# v0.9 (current): global pool discards spatial info spatial_map = FastViT(frames) # (N, 512, 8, 16) feat_vector = GlobalAvgPool(spatial_map) # (N, 512) — all spatial info gone # Improved: preserve spatial tokens spatial_tokens = FastViT(frames) # (N, 256, 16, 32) — 512 spatial tokens # Each token represents ~7.5° × 6° patch. Lanes still localisable.
// Benefit 1

Lane Line Precision

With spatial tokens, the lane line head can attend to specific pixel regions rather than reasoning from a scene-level average. Sub-pixel accuracy becomes achievable.

↑ Lane accuracy · ↑ Curve entry prediction
// Benefit 2

Lead Detection at Range

Lead vehicle detection benefits enormously from spatial locality. Spatial tokens let detection heads directly attend to where the vehicle is in the image — no need for the GRU to "remember" object locations.

↑ Detection range · ↑ Cut-in scenarios

03 BEV Representation

Instead of a calibrated front-view frame, explicitly lift image features into a Bird's Eye View grid using Lift-Splat-Shoot (LSS). Geometric relationships become explicit in the representation rather than implied.

# Lift-Splat-Shoot (simplified) for each image pixel (u, v): depth_dist = DepthNet(features[u,v]) # predict depth distribution feat_3d = outer_product(depth_dist, img_feat) # (D_bins, C) bev_grid += voxel_pool(feat_3d) # project to 3D → flatten to BEV # Result: (200, 200, C) BEV map at 0.5m/cell = 100m × 100m
PropertyCalibrated Front-View (current)BEV (improved)
Geometric accuracyProjective — distances distortedMetric — distances accurate
Multi-camera fusionComplex per-camera handlingTrivial — sum BEV grids
Object geometryRequires projectionDirect in BEV space
Occlusion handlingCamera-frame dependentOccluded cells naturally empty
Compute overheadNone (manual transform)DepthNet + voxel pooling (~15-20%)
Multi-Camera Upside
Once in BEV, adding a second camera (wide-angle, rear) is trivial — lift it to the same BEV grid and sum the features. Generalises naturally to comma 3X's multiple cameras and future hardware without any architectural changes.

04 Temporal Module

Replace the GRU with a causal transformer over a sliding window of N=16 past BEV frames (at 5Hz = 3.2 seconds of history). KV-cache makes streaming inference O(1) amortised per new frame.

# Causal transformer temporal module bev_history = [bev_t-15, ..., bev_t-1, bev_t] # 16 BEV snapshots tokens = flatten(bev_history) # (N, 16*S, C), S=spatial tokens/frame out = CausalTransformer(tokens) # attend past → present # KV-cache: new step only computes Q for current frame; K,V cached for past # → amortised O(1) per step, matching GRU's inference cost
PropertyGRU (current)Causal Transformer (improved)
Memory capacity512 floats16 × S × C tokens (~100× more)
Recall mechanismLossy compression into stateDirect attention to any past frame
Near-miss memoryMay vanish from GRU stateExplicitly accessible via attention
InterpretabilityOpaque state vectorAttention weights = what was recalled
Inference costO(1) per stepO(1) amortised with KV-cache
Training parallelismSequential (BPTT through time)Fully parallel over time axis
Engineering Note
KV-cache is critical. Without it, attending over 16 BEV frames is expensive per step. With KV-cache, each new step only computes Q for the current frame; K and V for all past frames are cached and reused. This keeps latency bounded regardless of window length.

05 Output Heads

Query-Based Decoder (DETR-style)

Replace fixed-index FC concatenation with learnable query vectors per task. Each query cross-attends to the BEV representation and decodes its output. More flexible, better spatial localisation, naturally handles variable-cardinality sets (arbitrary number of lane lines or lead vehicles).

# Query-based decoding path_queries = learnable_embedding(4) # 4 path hypotheses lane_queries = learnable_embedding(8) # flexible lane count lead_queries = learnable_embedding(16) # multiple leads simultaneously for each query set: output = CrossAttention(query, bev_tokens) # learn where to attend in BEV

Flow Matching Trajectory Head

Replace the 4-hypothesis WTA path head with a conditional flow matching model. Given BEV context and temporal representation, it learns a mapping from noise → future trajectory. At inference, sample K trajectories to get a full distribution over futures.

// Advantage 1

Continuous Multimodality

WTA with 4 hypotheses discretises the future into 4 modes. Flow matching produces a continuous distribution — all futures reachable, weighted by probability. Handles rare scenarios naturally.

↑ Rare scenario coverage · ↑ Calibrated uncertainty
// Advantage 2

Temporal Correlation

WTA generates waypoints semi-independently. Diffusion/FM over the full trajectory respects temporal correlations — if turning left at 2s, still turning at 3s. Produces physically plausible trajectories.

↑ Trajectory smoothness · ↑ Physical plausibility
// Trade-off

Inference Cost

Flow matching requires multiple denoising steps. With DDIM / FM, 1–4 steps is sufficient for a distilled model. ~3–5× more compute than an FC head — manageable on modern hardware.

Distill to few-step model for on-device use
// Retain

WTA Laplacian NLL Still Useful

For non-path outputs (lane lines, leads, pose), the WTA Laplacian NLL approach from v0.8 is still excellent and should be retained. Only replace the path head with flow matching.

Keep what works ✓

06 Training Strategy

Stage 1: Pretraining on Large-Scale Video

Pretrain the BEV backbone and temporal transformer on large-scale unlabelled driving video using a self-supervised objective — predict the next BEV frame or reconstruct masked tokens. Builds general road scene understanding without requiring GT labels.

Stage 2: Multi-Task Imitation Learning

Fine-tune on labelled data with the full multi-task loss: flow matching for paths, WTA Laplacian NLL for lanes/leads/pose. This is the current approach, now applied to richer representations.

Stage 3: World Model Fine-Tuning

Train a world model (video prediction conditioned on ego actions) alongside the driving model. Use it to score trajectories: trajectories that predict future frames matching real-world outcomes are rewarded. Enables learning from near-misses and unusual events without manual labelling. This is the direction of openpilot 0.10's Tomb Raider world model.

Stage 4: RL from Simulator

Fine-tune with RL in a high-fidelity simulator (MetaDrive or CARLA), using rewards based on comfort, progress, collision avoidance, and traffic rule compliance. The fully E2E differentiable architecture enables direct policy gradient through the trajectory head — this is only possible because planning is inside the model.

The Key Enabler
All four stages are only possible because the architecture is fully end-to-end differentiable — from pixel to control action. Every gradient signal can flow through every component. This is the fundamental prerequisite the entire training pyramid depends on.

07 All Three Models Compared

Componentv0.8.11v0.9.xImproved
BackboneCustom ResNetFastViT (Hybrid)EfficientViT / FastViT-L (no global pool)
Spatial representationGlobal avg pool → 512-dPartial spatial tokensFull BEV lifting (LSS)
Temporal moduleGRU, 512-d stateGRU, 512-d (richer input)Causal transformer + KV-cache
Sensor fusionVision onlyVision + mapVision + map + speed + IMU + CAN
Path output4 hypotheses + WTA4 hypotheses + direct curvatureFlow matching (full distribution)
Output head APIFC branches, flat concatFC branches, flat concatDETR-style query decoder
Lateral controlWaypoints → MPCDirect curvature ✓Direct curvature (retained)
TrainingDistillation onlyImitation + simulatorPretrain → imitation → world model → RL
Multi-cameraNoNoNative (BEV fusion)
RL-readyNoPartial (lateral only)Yes — fully E2E differentiable
InterpretabilityLowLow (+ attention in backbone)Medium (BEV grid + attention maps)
What Stays the Same Across All Three
Probabilistic outputs with Laplacian (μ, σ). Multi-task learning across all heads. WTA loss principle for non-path outputs. The calibrated frame intuition lives on in BEV. The mathematical skeleton of supercombo is preserved — what changes is the fidelity of geometric representation and the richness of temporal memory flowing through it.