openpilot v0.8.11 · supercombo

// baseline · ~13M params · ResNet + GRU · 6472-d output

The Supercombo
Architecture

A thorough technical breakdown of Comma AI's end-to-end driving model — its design decisions, trade-offs, and what each component is really doing.

00 Pipeline at a Glance

Supercombo is a single unified neural network ingesting raw camera frames and a handful of scalars, outputting a complete scene representation in one 20 Hz forward pass on a Snapdragon 845.

Input Stage

YUV Frame Pair

(N,12,128,256)

Two consecutive frames YUV420, reprojected to calibrated frame, channel-packed.

Backbone

ResNet CNN

→ (N,512)

Custom ResNet variant, globally pooled to a flat vector. Shared across all tasks.

Temporal

GRU (512)

+ desire, traffic

Merges visual features with recurrent state and side inputs. Working memory.

Decoding

FC Heads

→ (N,6472)

Independent dense branches per task, concatenated into one flat output.

01 Inputs

Primary — Stacked Frame Tensor

Camera frames are converted to YUV420 (planar) and two consecutive frames are stacked channel-wise, giving the CNN implicit access to optical flow without a separate flow estimator:

# Frame encoding (per frame) → two frames stacked Y plane: 128×256 # Full luma — high spatial resolution U plane: 64×128 # Chroma blue — half resolution V plane: 64×128 # Chroma red — half resolution # → pre-processed to (N, 12, 128, 256)

Why YUV? Hardware ISPs output YUV natively. Skipping the conversion saves compute on-device. YUV also separates luminance (geometry, edges) from chrominance, which can encourage cleaner feature hierarchies. Why two frames? The difference between frames encodes velocity and moving objects implicitly — no dedicated flow estimator needed.

Side Inputs (post-CNN, injected into GRU)

Injected into GRU

desire

One-hot, 8 classes: straight/lane change L-R/turn L-R/keep L-R. Intent conditioning.

(N,8)

traffic_convention

LHT vs RHT one-hot. Single bit that flips all lane geometry assumptions.

(N,2)

recurrent_state

GRU hidden state fed back from previous timestep. Zero-initialized at segment start.

(N,512)

Notable Absences

Ego speed / CAN

Not provided. Must be inferred from optical flow between the two frames.

❌

IMU / GPS

Used for GT label creation only. Not a live model input.

❌

Map / HD map

Pure vision-only. Generalizable but requires stronger feature learning.

❌

Design Note

Injecting desire/traffic after the CNN is deliberate — routing intent is a planning concern, not a low-level visual feature. The CNN handles pure perception; the GRU merges it with intent.

02 CNN Backbone

A custom ResNet variant designed for 128×256 YUV input. It global-average-pools its final feature volume to produce a 512-d scene summary for the GRU — no spatial feature map preserved.

Choice	What Comma Did	Why
Resolution	128×256	~4× smaller than detection models. Enough for road geometry; critical for 20 Hz on mobile SoC.
Output	Global avg pool → 512-d	Forces entire scene into a fixed token. Efficient but discards spatial localisation.
Architecture	Custom ResNet	Residual connections for gradient flow through ~13M params. Proven for structured prediction.
Input channels	12-ch YUV pair	Atypical — first conv layer randomly initialised, no ImageNet transfer possible.

The Calibration Transform

Before the CNN, frames are reprojected into a calibrated coordinate frame — a virtual camera pointing straight ahead, level with the road. This removes pitch/roll variation, so road geometry always appears at a near-constant location in the image. A manual inductive bias that dramatically simplifies what the CNN must learn.

Design Insight

Global pooling is a deliberate trade — it makes the model fast and memory-efficient on a phone-class chip, but at the cost of spatial precision. Transformer-based designs recover this by operating on spatial token grids instead.

03 GRU Temporal Module

A single GRU cell with 512 hidden units — the model's entire working memory. CNN embedding plus side inputs are concatenated as the GRU input vector each timestep.

# Conceptual forward pass at time t cnn_feat = backbone(frame_pair_t) # (N, 512) side = concat(desire, traffic) # (N, 10) gru_in = concat(cnn_feat, side) # (N, 522) h_t, out = GRU(gru_in, h_prev) # h_t: (N,512) → fed to output heads

Why GRU over LSTM? GRU merges cell and hidden state — ~25% fewer parameters than LSTM of equal width. For sequences measured in seconds on a latency-constrained device, GRU's effective context window is acceptable. The extra LSTM gate complexity isn't justified here.

State training protocol: First batch of each segment runs forward but doesn't update weights (warmup). State carries over between batches within the same segment (persistence). State zeroed at segment boundaries, matching deployment behaviour (reset).

Key Limitation

All driving history must be compressed into 512 floats. Anything the GRU can't fit is permanently forgotten. This is the single biggest architectural bottleneck of the v0.8 model.

04 Output Heads

The GRU output fans into independent FC branches, one per task, concatenated to (N, 6472). Multi-task learning: gradients from lane-line detection improve the shared representation used for path planning and vice versa.

Output

Dims

Description

path_plan

4×33×2

4 candidate paths × 33 waypoints × (μ, σ). Quadratically spaced to 10s / 192m.

path_prob

Softmax probability per path hypothesis. Drives winner-takes-all selection.

lane_lines

4×33×2

Left/right boundaries + outer edges as (μ, σ) Laplacians at each waypoint.

lane_line_probs

Presence probability per lane line at t=0s and t=2s.

road_edges

2×33×2

Left and right road edges, same format as lane lines.

lead

2×51

Lead vehicle MDN (position, velocity, accel) at t=0s and t=2s.

lead_prob

Lead detection probability at current, +2s, +4s.

pose

6-DoF ego-motion for current and next frame. Acts as self-supervision signal.

long_v / long_a

200 ea.

Longitudinal velocity and acceleration profiles for ACC/AEB planning.

desire_state / meta

8 + 4

Predicted maneuver class and engagement metadata.

Probabilistic Outputs

Almost every output is a distribution, not a point estimate. Waypoints as (μ, σ) Laplacian pairs. Wide σ → conservative control; narrow σ → confident control. The uncertainty is live information for the downstream planner.

05 Design Rationale

Why End-to-End?

Traditional ADAS stacks separate perception, tracking, prediction, and planning — each handoff introduces error and latency. Supercombo collapses all of these, allowing gradients to flow end-to-end and the model to develop joint representations serving all tasks simultaneously.

Multi-Hypothesis Paths + Quadratic Time Spacing

The road ahead is genuinely multimodal — at a highway exit you either stay on or exit. A single path averages between modes, producing an impossible trajectory. 4 candidate paths with probabilities lets the model represent ambiguity. The 33 waypoints are spaced quadratically — denser near-term where errors matter most, sparser at 10s where uncertainty is inherently high.

Calibrated Frame

Reprojecting into a calibrated frame removes camera-to-world projection from the learning problem entirely. All predictions are expressed in a coordinate system aligned with the flat road plane — a manually engineered inductive bias that significantly simplifies the function the network must learn.

06 Loss Functions

KL Divergence (distillation baseline)

Penalise the student for diverging from the teacher's distribution. Stable but flawed — it averages across hypotheses, washing out multimodality. Result in experiments: ~20 hours to converge.

Winner-Takes-All Laplacian NLL ✓

At each step, find the winning hypothesis (closest to ground truth) and only apply loss there. Zero gradient for all others. Forces diverse, committed hypotheses instead of collapsing to the mean. Shared by Comma AI — result: ~1 hour to converge (20× faster).

# WTA Laplacian NLL (winner-takes-all) for each step: winner = argmin L1(hypotheses, gt) loss = LaplacianNLL(hypotheses[winner]) # zero grad for losers

Why Laplacian over Gaussian? Driving errors have heavy tails — routine frames dominate, but rare frames fail badly. Laplacian NLL is robust to outliers (L1 in the mean) where Gaussian NLL (L2) would over-penalise rare events.

KL Divergence

~20h

WTA Laplacian NLL

~1h

07 Training Loop

Batch Construction

Each batch contains sequences from N different one-minute segments (N = batch size). Adjacent frames from the same drive are near-identical and cause overfitting. Empirically: batch size 8 overfit badly; batch size 28 converged well. Cost: ~56 CPU cores to load 28 parallel video streams.

Custom Data Loader

PyTorch's DataLoader doesn't natively support parallel sequence loading across different files. A custom loader was built: each worker owns one segment, fills a shared-memory queue; a background collation process assembles batches. Result: ~150ms latency per batch, ~175ms GPU transfer.

Optimizer

Adam + weight decay (L2=1e-4), LR=1e-3, ReduceLROnPlateau scheduler (factor 0.75, patience 3). Optional gradient clipping. Conservative and well-tested.

openpilot v0.9.0–0.9.8 · 2022–2025

// EfficientNet→FastViT · E2E lateral · 700-bit features · nav inputs

The 0.9 Series
Evolution

How the supercombo architecture was systematically rebuilt across the 0.9 release series — backbone swap, end-to-end planning, new inputs, and training overhaul.

00 v0.9 at a Glance

The 0.9 series spanned three years and introduced the most fundamental changes since the original supercombo. The overall shape (backbone → temporal → heads) remains, but each component was substantially redesigned:

Backbone (new)

FastViT

Hybrid ViT

EfficientNet replaced by a Hybrid Vision Transformer. Spatial reasoning via attention.

Temporal

GRU

~700-bit features

Same GRU structure but fed richer representations. 10× more information content.

New Inputs

Map + NavVec

image + (50,)

Compressed map image and navigation instruction vector as additional side inputs.

Outputs (new)

Direct Curvature

→ single value

Lateral planning moved inside the model. Outputs executable control directly.

The Big Picture

The 0.9 series marks a shift from a perception-and-predict model toward a true end-to-end planner. By 0.9.6, the model directly outputs control actions rather than intermediate representations for a classical MPC to consume.

01 Version Timeline

v0.9.0
Nov 2022

BackboneTraining

Architecture Redesign — 10× Feature Richness

Internal feature space information content increased tenfold to ~700 bits. Less reliance on previous frames (more reactive). Trained in 36 hours from scratch vs. the previous one-week timeline. Introduced Experimental Mode with E2E longitudinal: model can stop for traffic lights and slow for turns without hand-coded logic.

v0.9.4
Jul 2023

New Input

Navigate on openpilot — Map Image Input

When navigation is active, a compressed map image of the route ahead is fed into the model. Encoded via a learned neural compressor (VAE-style), the map provides context for upcoming forks, exits, and turns — context that pure vision can't see around corners.

v0.9.5
Nov 2023

BackbonePlanningNew Input

FastViT + E2E Lateral + Navigation Instructions

Three simultaneous changes: (1) EfficientNet → FastViT (Hybrid Vision Transformer — biggest architecture change since 0.9.0), (2) lateral MPC moved inside the model (direct trajectory output, replacing classical MPC), (3) navigation instruction vector (ternary left/straight/right at 20m resolution, ±500m ahead) added as new input.

v0.9.6
Feb 2024

PlanningVision

Direct Curvature Output (Los Angeles Model)

Model now directly outputs a desired curvature value for lateral control — a single scalar mapping directly to steering. Collapses perception→plan→MPC→curvature to perception→curvature. Simpler interface, prepares for RL fine-tuning.

v0.9.8
Feb 2025

Infra

ISP Pipeline + More GPU Headroom

Image processing pipeline moved to the ISP (dedicated hardware), freeing significant GPU time for the driving model. Power draw reduced 0.5W. Sets up headroom for larger model variants in the 0.10 series.

02 The FastViT Backbone

FastViT (Apple Research, 2023) is a Hybrid Vision Transformer — convolutional stages in early layers handle local feature extraction efficiently; transformer-style attention in later stages enables long-range spatial reasoning. Best of both worlds for mobile deployment.

Property	Custom ResNet (v0.8)	FastViT (v0.9.5+)
Architecture	Pure CNN, residual blocks	Hybrid: Conv early + ViT later
Spatial reasoning	Implicit in conv kernels only	Explicit attention over spatial locations
Long-range context	Limited to receptive field	Global attention in later stages
Feature information	~70 bits (est.)	~700 bits (10× improvement)
Mobile efficiency	Snapdragon 845 optimised	Retokenization for depthwise conv efficiency

The Feature Information Content Jump

The 0.9.0 release reports the internal feature space going from ~70 to ~700 bits of information content. This means the representation at the GRU input encodes 10× more semantically distinct states — the difference between a network that barely knows "road or not" and one that simultaneously understands lane structure, occlusion relationships, and scene geometry.

FastViT's Key Innovation

FastViT uses "retokenization" — spatial tokens are fused using depthwise convolutions before attention, dramatically reducing the cost of transformer stages. This makes it viable on mobile hardware without sacrificing the spatial reasoning advantages of attention.

03 EfficientNet → FastViT: Why Switch?

Between v0.8.11 and v0.9.5, Comma used EfficientNet before switching to FastViT. EfficientNet is strong — compound scaling optimises width, depth, and resolution simultaneously. Why not keep it?

# EfficientNet: compound scaling, purely convolutional EfficientNet-B0: 5.3M params # excellent classification accuracy/cost trade-off # Inductive bias: local operations only — limited relational reasoning # FastViT: hybrid convolutional + attention FastViT-T8: 4.0M params # similar or smaller, better spatial reasoning # Retokenization enables global context at low cost

For driving, the task is fundamentally spatial and relational: where is the lane relative to the car, where is the lead relative to the lane, what is the curve geometry ahead. EfficientNet's purely local inductive bias is weaker at these relational queries than attention-based models. FastViT's hybrid approach retains CNN efficiency while allowing global relationships to emerge in the later stages.

Broader Trend

The switch reflects the field's convergence on CNNs for local feature extraction + attention for global spatial reasoning. The hybrid is the pragmatic compromise for mobile deployment.

04 End-to-End Lateral Planning

This is the conceptually largest change in the 0.9.x series — moving from a model that predicts a path (which a classical MPC converts to steering) to a model that directly outputs executable control.

The Three Stages of Evolution

// Stage 1 · v0.8 era

Predict → MPC → Curvature

Model outputs path waypoints. Classical MPC converts path to trajectory. Kinematic approximations convert trajectory to curvature. Three handoffs, three sources of error and latency.

Latency: high · RL: impossible (non-differentiable)

// Stage 2 · v0.9.5

Predict → Learned MPC → Curvature

The MPC is absorbed into the model (New Lemon Pie). Outputs a smooth, executable lateral trajectory directly. MPC is now learned and differentiable. One fewer external handoff.

Latency: medium · RL: partially possible

// Stage 3 · v0.9.6

Direct Curvature Output

Los Angeles Model: single desired curvature value output directly. The entire pipeline collapses to neural network → one control scalar. Maximally end-to-end for lateral control.

Latency: minimal · RL: fully possible ✓

// Stage 4 · 0.10+

Direct Longitudinal

Same evolution planned for longitudinal control. The 0.10 series begins this transition using the new world model (Tomb Raider) for training supervision.

Status: in progress

Why This Matters

Moving planning inside the model means gradients from actual driving outcomes can flow back through the planner during training. Classical MPC is non-differentiable — you can only train perception. With E2E planning, you can train from a driving reward, which is the prerequisite for RL.

05 New Inputs in v0.9

Map Image (v0.9.4)

When Navigate on openpilot is active, a compressed route map is fed into the model. Encoded by a learned neural compressor (similar to a VAE encoder), then concatenated as an additional side input. This allows the model to "see around corners" — if navigation shows a left turn 400m ahead, the model can begin adjusting lane position before the turn is visually apparent.

Navigation Instructions Vector (v0.9.5)

The map image alone can be ambiguous about timing. A ternary instruction vector is added: for each 20m segment from -500m to +500m ahead, the value is -1 (left), 0 (straight), or 1 (right) — a precise 50-element vector.

# Navigation instruction encoding nav_instructions: shape (50,) # -500m to +500m at 20m resolution # Values: -1=turn left, 0=straight, 1=turn right # Combined with map image → significant NoO performance improvement

Why Both?

Map image answers "what does the road look like?" (lane count, curve shape, intersection geometry). Instruction vector answers "exactly when and which way?" They're complementary — image for context, vector for precision timing.

06 Training Changes

Reprojective Simulator (v0.9.0)

Training now uses a reprojective simulator — a differentiable renderer synthesizing what the camera would see from a different position/orientation. This expands the training distribution by augmenting real data with counterfactual views, and enables training on lateral behaviour hard to collect from real drives.

Lateral + Longitudinal Simulation

From v0.9.0, training simulates both lateral and longitudinal behaviour simultaneously, allowing the model to learn to slow for curves and stop for traffic lights in Experimental Mode. Previously, pure imitation learning on human drives couldn't teach these behaviours because humans rarely drive at the limit.

Desire Ground-Truthing Stack

A new desire GT pipeline accurately labels when lane changes happen, what type they are, and when they complete. Previously the desire input during training was noisy — the new stack enables dramatically better lane change behaviour.

Anti-Cheating Regularisation

Early simulator training found the model "hugging" lane edges in laneless mode — a degenerate shortcut that doesn't generalise. Anti-cheating regularisation was added to the simulator training to prevent these solutions.

07 v0.8 vs v0.9 Comparison

Dimension	v0.8.11	v0.9.x (latest)
Backbone	Custom ResNet, global pool	FastViT (Hybrid ViT)
Feature richness	~70 bits	~700 bits (10×)
Lateral planning	Waypoints → external MPC	Direct curvature output
Navigation input	None	Map image + instruction vector
Training simulator	Limited	Reprojective + lateral + longitudinal
Experimental mode	No	Yes (stops for lights, slows for turns)
Train time (baseline)	~1 week	~36 hours
RL-readiness	Non-differentiable end-to-end	E2E differentiable → RL possible

What Stayed the Same

GRU as temporal module. Probabilistic outputs (Laplacian μ/σ). Calibrated frame reprojection. Multi-task learning across all heads. The core mathematical skeleton of supercombo is preserved — what changed is the quality of representations flowing through it.

Proposed Architecture · Research Synthesis 2024–2026

// BEV · Causal Transformer · Diffusion Planning · World Model Training

The Improved
Architecture

A ground-up redesign synthesizing best ideas from modern autonomous driving research — what supercombo would look like built with current knowledge and no legacy constraints.

00 Vision & Goals

The v0.8 and v0.9 architectures are constrained by 20Hz on a Snapdragon 845. The proposed architecture asks: what is the right architecture given modern hardware and the full weight of 2024–2026 research?

Design Principles

(1) Preserve spatial precision — don't discard location information. (2) Rich selective temporal memory — recall specific events, not a blurry state average. (3) Full joint distribution over futures — not fixed hypotheses. (4) Everything differentiable end-to-end for RL.

Problem	Current (v0.9)	Fix
Spatial compression	Global pool → flat 512-d token	Spatial BEV token grid
Temporal memory	GRU: 512-float state	Causal transformer over N past BEV frames
Multi-modal futures	4 fixed hypotheses + WTA	Flow matching over full trajectory distribution
Sensor fusion	Vision + map only	Camera + map + speed + IMU + CAN
Scene geometry	Calibrated front-view frame	Lifted perspective → BEV (LSS)
Training signal	Imitation + simulator	World model + RL fine-tuning

01 Architecture Overview

Five stages: multi-input encoding, BEV lifting, causal temporal attention, query-based decoding, and trajectory generation.

Input · Keep

Camera Frames

YUV420, 2 frames, (N,12,H,W).

Input · New

Ego Speed + IMU

Velocity, yaw rate, 3-axis accel → sensor tokens.

Input · New

CAN State

Steering angle, brake, throttle.

↓

Backbone · Improved

FastViT-L — Spatial Tokens (no global pool)

Outputs 2D feature map (H/8 × W/8 × C). Each token = spatial image region. Spatial precision preserved.

Sensor Encoder · New

MLP Encoder

Speed + IMU + CAN → conditioning tokens appended to image tokens.

↓

Representation · New

Perspective-to-BEV Lifting (LSS / BEVDet)

Predict depth distribution per pixel. Splat image features onto a BEV grid via outer product of depth distribution × image features. Produces explicit top-down scene representation (e.g. 200×200 @ 0.5m = 100×100m).

↓

Temporal Module · New

Causal Transformer (N=16 past BEV frames, KV-cache)

Attention over sliding window of 16 past BEV snapshots. Each head can attend to any past frame. RoPE positional encoding. KV-cache for O(1) amortised inference per step.

↓

Decoding · Improved

Query-Based Decoder (DETR-style)

Learnable queries per task. Each query cross-attends to BEV tokens. No fixed-index concatenation.

Path Output · New

Flow Matching Head

Conditional flow matching → sample full trajectory distribution at inference.

02 Backbone

Keep FastViT (it's a good choice) but remove the global average pool. Output the full spatial token grid at 1/8 input resolution — for 128×256 input, this is 16×32 = 512 tokens, each a 256-d vector.

# v0.9 (current): global pool discards spatial info spatial_map = FastViT(frames) # (N, 512, 8, 16) feat_vector = GlobalAvgPool(spatial_map) # (N, 512) — all spatial info gone # Improved: preserve spatial tokens spatial_tokens = FastViT(frames) # (N, 256, 16, 32) — 512 spatial tokens # Each token represents ~7.5° × 6° patch. Lanes still localisable.

// Benefit 1

Lane Line Precision

With spatial tokens, the lane line head can attend to specific pixel regions rather than reasoning from a scene-level average. Sub-pixel accuracy becomes achievable.

↑ Lane accuracy · ↑ Curve entry prediction

// Benefit 2

Lead Detection at Range

Lead vehicle detection benefits enormously from spatial locality. Spatial tokens let detection heads directly attend to where the vehicle is in the image — no need for the GRU to "remember" object locations.

↑ Detection range · ↑ Cut-in scenarios

03 BEV Representation

Instead of a calibrated front-view frame, explicitly lift image features into a Bird's Eye View grid using Lift-Splat-Shoot (LSS). Geometric relationships become explicit in the representation rather than implied.

# Lift-Splat-Shoot (simplified) for each image pixel (u, v): depth_dist = DepthNet(features[u,v]) # predict depth distribution feat_3d = outer_product(depth_dist, img_feat) # (D_bins, C) bev_grid += voxel_pool(feat_3d) # project to 3D → flatten to BEV # Result: (200, 200, C) BEV map at 0.5m/cell = 100m × 100m

Property	Calibrated Front-View (current)	BEV (improved)
Geometric accuracy	Projective — distances distorted	Metric — distances accurate
Multi-camera fusion	Complex per-camera handling	Trivial — sum BEV grids
Object geometry	Requires projection	Direct in BEV space
Occlusion handling	Camera-frame dependent	Occluded cells naturally empty
Compute overhead	None (manual transform)	DepthNet + voxel pooling (~15-20%)

Multi-Camera Upside

Once in BEV, adding a second camera (wide-angle, rear) is trivial — lift it to the same BEV grid and sum the features. Generalises naturally to comma 3X's multiple cameras and future hardware without any architectural changes.

04 Temporal Module

Replace the GRU with a causal transformer over a sliding window of N=16 past BEV frames (at 5Hz = 3.2 seconds of history). KV-cache makes streaming inference O(1) amortised per new frame.

# Causal transformer temporal module bev_history = [bev_t-15, ..., bev_t-1, bev_t] # 16 BEV snapshots tokens = flatten(bev_history) # (N, 16*S, C), S=spatial tokens/frame out = CausalTransformer(tokens) # attend past → present # KV-cache: new step only computes Q for current frame; K,V cached for past # → amortised O(1) per step, matching GRU's inference cost

Property	GRU (current)	Causal Transformer (improved)
Memory capacity	512 floats	16 × S × C tokens (~100× more)
Recall mechanism	Lossy compression into state	Direct attention to any past frame
Near-miss memory	May vanish from GRU state	Explicitly accessible via attention
Interpretability	Opaque state vector	Attention weights = what was recalled
Inference cost	O(1) per step	O(1) amortised with KV-cache
Training parallelism	Sequential (BPTT through time)	Fully parallel over time axis

Engineering Note

KV-cache is critical. Without it, attending over 16 BEV frames is expensive per step. With KV-cache, each new step only computes Q for the current frame; K and V for all past frames are cached and reused. This keeps latency bounded regardless of window length.

05 Output Heads

Query-Based Decoder (DETR-style)

Replace fixed-index FC concatenation with learnable query vectors per task. Each query cross-attends to the BEV representation and decodes its output. More flexible, better spatial localisation, naturally handles variable-cardinality sets (arbitrary number of lane lines or lead vehicles).

# Query-based decoding path_queries = learnable_embedding(4) # 4 path hypotheses lane_queries = learnable_embedding(8) # flexible lane count lead_queries = learnable_embedding(16) # multiple leads simultaneously for each query set: output = CrossAttention(query, bev_tokens) # learn where to attend in BEV

Flow Matching Trajectory Head

Replace the 4-hypothesis WTA path head with a conditional flow matching model. Given BEV context and temporal representation, it learns a mapping from noise → future trajectory. At inference, sample K trajectories to get a full distribution over futures.

// Advantage 1

Continuous Multimodality

WTA with 4 hypotheses discretises the future into 4 modes. Flow matching produces a continuous distribution — all futures reachable, weighted by probability. Handles rare scenarios naturally.

↑ Rare scenario coverage · ↑ Calibrated uncertainty

// Advantage 2

Temporal Correlation

WTA generates waypoints semi-independently. Diffusion/FM over the full trajectory respects temporal correlations — if turning left at 2s, still turning at 3s. Produces physically plausible trajectories.

↑ Trajectory smoothness · ↑ Physical plausibility

// Trade-off

Inference Cost

Flow matching requires multiple denoising steps. With DDIM / FM, 1–4 steps is sufficient for a distilled model. ~3–5× more compute than an FC head — manageable on modern hardware.

Distill to few-step model for on-device use

// Retain

WTA Laplacian NLL Still Useful

For non-path outputs (lane lines, leads, pose), the WTA Laplacian NLL approach from v0.8 is still excellent and should be retained. Only replace the path head with flow matching.

Keep what works ✓

06 Training Strategy

Stage 1: Pretraining on Large-Scale Video

Pretrain the BEV backbone and temporal transformer on large-scale unlabelled driving video using a self-supervised objective — predict the next BEV frame or reconstruct masked tokens. Builds general road scene understanding without requiring GT labels.

Stage 2: Multi-Task Imitation Learning

Fine-tune on labelled data with the full multi-task loss: flow matching for paths, WTA Laplacian NLL for lanes/leads/pose. This is the current approach, now applied to richer representations.

Stage 3: World Model Fine-Tuning

Train a world model (video prediction conditioned on ego actions) alongside the driving model. Use it to score trajectories: trajectories that predict future frames matching real-world outcomes are rewarded. Enables learning from near-misses and unusual events without manual labelling. This is the direction of openpilot 0.10's Tomb Raider world model.

Stage 4: RL from Simulator

Fine-tune with RL in a high-fidelity simulator (MetaDrive or CARLA), using rewards based on comfort, progress, collision avoidance, and traffic rule compliance. The fully E2E differentiable architecture enables direct policy gradient through the trajectory head — this is only possible because planning is inside the model.

The Key Enabler

All four stages are only possible because the architecture is fully end-to-end differentiable — from pixel to control action. Every gradient signal can flow through every component. This is the fundamental prerequisite the entire training pyramid depends on.

07 All Three Models Compared

Component	v0.8.11	v0.9.x	Improved
Backbone	Custom ResNet	FastViT (Hybrid)	EfficientViT / FastViT-L (no global pool)
Spatial representation	Global avg pool → 512-d	Partial spatial tokens	Full BEV lifting (LSS)
Temporal module	GRU, 512-d state	GRU, 512-d (richer input)	Causal transformer + KV-cache
Sensor fusion	Vision only	Vision + map	Vision + map + speed + IMU + CAN
Path output	4 hypotheses + WTA	4 hypotheses + direct curvature	Flow matching (full distribution)
Output head API	FC branches, flat concat	FC branches, flat concat	DETR-style query decoder
Lateral control	Waypoints → MPC	Direct curvature ✓	Direct curvature (retained)
Training	Distillation only	Imitation + simulator	Pretrain → imitation → world model → RL
Multi-camera	No	No	Native (BEV fusion)
RL-ready	No	Partial (lateral only)	Yes — fully E2E differentiable
Interpretability	Low	Low (+ attention in backbone)	Medium (BEV grid + attention maps)

What Stays the Same Across All Three

Probabilistic outputs with Laplacian (μ, σ). Multi-task learning across all heads. WTA loss principle for non-path outputs. The calibrated frame intuition lives on in BEV. The mathematical skeleton of supercombo is preserved — what changes is the fidelity of geometric representation and the richness of temporal memory flowing through it.

The SupercomboArchitecture

00 Pipeline at a Glance

01 Inputs

Primary — Stacked Frame Tensor

Side Inputs (post-CNN, injected into GRU)

02 CNN Backbone

The Calibration Transform

03 GRU Temporal Module

04 Output Heads

Probabilistic Outputs

05 Design Rationale

Why End-to-End?

Multi-Hypothesis Paths + Quadratic Time Spacing

Calibrated Frame

06 Loss Functions

KL Divergence (distillation baseline)

Winner-Takes-All Laplacian NLL ✓

07 Training Loop

Batch Construction

Custom Data Loader

Optimizer

The 0.9 SeriesEvolution

00 v0.9 at a Glance

01 Version Timeline

Architecture Redesign — 10× Feature Richness

Navigate on openpilot — Map Image Input

FastViT + E2E Lateral + Navigation Instructions

Direct Curvature Output (Los Angeles Model)

ISP Pipeline + More GPU Headroom

02 The FastViT Backbone

The Feature Information Content Jump

03 EfficientNet → FastViT: Why Switch?

04 End-to-End Lateral Planning

The Three Stages of Evolution

Predict → MPC → Curvature

Predict → Learned MPC → Curvature

Direct Curvature Output

Direct Longitudinal

05 New Inputs in v0.9

Map Image (v0.9.4)

Navigation Instructions Vector (v0.9.5)

06 Training Changes

Reprojective Simulator (v0.9.0)

Lateral + Longitudinal Simulation

Desire Ground-Truthing Stack

Anti-Cheating Regularisation

07 v0.8 vs v0.9 Comparison

The ImprovedArchitecture

00 Vision & Goals

01 Architecture Overview

02 Backbone

Lane Line Precision

Lead Detection at Range

03 BEV Representation

04 Temporal Module

05 Output Heads

Query-Based Decoder (DETR-style)

Flow Matching Trajectory Head

Continuous Multimodality

Temporal Correlation

Inference Cost

WTA Laplacian NLL Still Useful

06 Training Strategy

Stage 1: Pretraining on Large-Scale Video

Stage 2: Multi-Task Imitation Learning

Stage 3: World Model Fine-Tuning

Stage 4: RL from Simulator

07 All Three Models Compared

The Supercombo
Architecture

The 0.9 Series
Evolution

The Improved
Architecture