The market is saturated with quantum hype and quantum threat. Atlas separates signal from noise on the one question that precedes the spend: does this circuit actually need a quantum computer, or does a classical method reproduce it? Every headline number below is traceable to a file, a script, or a hardware job-id — corpus route distribution shows the answer is "classical" far more often than the hype implies. Three layers: a 2,517-circuit oracle-certified corpus (its confusion matrix + a held-out Wilson bound), real-QPU validation on public Heron-r2 hardware, and an honest table of what Atlas does that the surveyed tools do not.
Atlas's routing is checked against a classical oracle (Stim for Clifford, exact non-truncated MPS, and statevector) that can only certify the classically-tractable regime — the only regime where a classical ground truth exists. This is the honest scope of every number in this section.
Oracle route distribution: cpu 2,431 · tensor 61 · hpc_first 25 · escalate 0. The escalate (genuinely quantum-hard) class has 0 certified circuits by construction — there is no classical ground truth there (the BQP≠BPP wall), so it is declined as out-of-distribution, never certified. The corpus is evaluation-only: Atlas is not tuned to it.
What the corpus is made of — stated up front, so the "are these synthetic?" objection is answered with the composition, not a defense. Yes, the circuits are generated, not harvested from production — and they are fully specified, hash-pinned, and regenerable, which is the point: the oracle can only certify what it can exactly simulate. The families are structural-topology generators spanning the regimes where the classical/quantum frontier actually lives:
| Family | Topology / why it's here | circuits |
|---|---|---|
| line | 1D nearest-neighbour chain — the area-law base case (cheap unless deep) | 128 + 285m + 30e |
| ladder | two coupled chains — width-2 entanglement growth | 120 + 253m |
| ring / cycle | periodic boundary — breaks the open-chain treewidth shortcut | 285m + 269m + 30e |
| grid | 2D lattice — where MPS bond starts to blow up (volume-law onset) | 104 + 250m |
| heavy_hex | the real superconducting hardware topology (Heron-class) | 128 |
| star | one hub, many spokes — high-degree node, treewidth stress | 120 + 285m + 30e |
| dense_core / all_to_all_sparse | dense subgraphs — the hard, near-frontier end | 120 + 80 |
Each family is crossed with n = 8 … 44, multiple depths, 4 T-gate densities (stabilizer → magic-heavy), and 8 seeds. The point of the spread is to walk a circuit across the simulability frontier — low T-count + area-law stays CPU, magic + 2D-dense pushes toward the wall — so the route distribution above is earned by structure, not by cherry-picking easy cases. (m = moat slice, e = ext slice.)
Source: three CSVs under benchmarks/results_scaled/ (scaled_results.csv 800 · _ext 90 · _moat 1627 = 2,517; family column in the 800-slice, id-prefix in ext/moat), benchmark_manifest.json (sha256 66f9d6…, split=evaluation-only), reproduced by oracle.py. CLAIMS C6/C13; SELF_ASSESSMENT #1. Conceded honestly: these are 2,517 variants over ~7 structural families, not 2,517 independent families — variants within a family are correlated, and exchangeability across families is assumed, not proven. We do not claim the corpus spans every circuit a user might submit; that is exactly why OOD decline exists.
We do not present a perfect score. There are 11 measured disagreements with the oracle; here is where they fall. Rows = Atlas route, columns = oracle route, on the tractable corpus.
The 11 disagreements: 10 cpu→tensor under-routes (Atlas said CPU, oracle said tensor — the safe direction, you over-trust the laptop on a circuit a workstation handles) and 1 tensor→hpc, the single false-safety. Every off-diagonal cell is below the diagonal: Atlas never over-routes (0 false-alarm), and never lands two tiers off.
The single false-safety is moat_ladder_n28_t8_s3: Atlas routes it TENSOR via treewidth 2²⁸ (≈4 GB, genuinely workstation-feasible); the oracle routes it HPC because its statevector cutoff is n=27 < 28. Both methods agree the cost is ~2²⁸ ≈ 4 GB — it is a route-class boundary artifact between two oracle threshold tables on different scales (qubit-count vs log₂-cost), not a "said classical when you truly need a QPU" error. We report it rather than retune the thresholds to force the count to zero.
Reporting false-safety as "1 / 2517" is statistically misleading: 2,431 of 2,517 (97%) are trivial cpu where false-safety is impossible by definition and only inflate the denominator. The honest figure is over cases where false-safety is possible (true route hard): 1 / 25 ≈ 4%. And the deepest limit stands: in the genuinely quantum-hard regime there is no ground truth, so measurable false-safety is reduced there, not proven absent.
Source: the three CSVs under benchmarks/results_scaled/ (direct count of atlas_route_class vs oracle_route); SELF_ASSESSMENT #1. Every cell here is the exact per-cell count: diagonal 2,431 / 51 / 24, off-diagonal 10 cpu→tensor + 1 tensor→hpc (the false-safety). 0 above the diagonal = 0 false-alarm.
The advertised number is computed on data the acceptance threshold never saw. atlas_conformal.py splits the certified set 50/50, picks the threshold τ on the selection half only, and reports a one-sided Wilson upper bound on the error of the held-out validation half.
Calibration quality after isotonic recalibration (held-out): Brier 0.103 → 0.0076, ECE 0.255 → 0.007. We never show "100%" — the ceiling is the Wilson bound, and all 11 errors sit at confidence ≤ 34, i.e. the calibrated confidence behaves as an error detector. The reliability diagram is public.
Source: HANDOFF_5ideas/atlas_conformal.py + atlas_recalibrate.py + calibration_report.json + reliability_diagram.svg. CLAIMS C13/C14; SELF_ASSESSMENT #1. Caveat (conceded): the corpus is template-family-bounded — 2,517 variants, not 2,517 independent structural families; exchangeability across families is assumed, not proven.
Real QPU. Run on public-access superconducting hardware, Heron r2 architecture (backend ibm_kingston, open plan), 2026-06-22/24. We report only what we measured; the genuinely quantum-hard regime (n>24, no classical oracle) remains unmeasurable by construction and we do not over-promise it.
4a · Depth-resolved 3-regime sweep — 19 (n,depth) points. TVD(ideal, QPU) rises with depth because the device decoheres (real hardware noise grows with depth), not because Atlas is wrong: at n≤12 the ideal distribution is classically tractable (exact statevector) and is the ground truth. Mirror-RB fidelity (independent, readout-corrected, K=6 error bars) falls in mirror.
| Regime | Circuit | n | depth | Atlas route | TVD(ideal,QPU) | mirror-RB F (±SEM) |
|---|---|---|---|---|---|---|
| Easy | ghz6 GOOD | 6 | Clifford | CPU | 0.081 | — |
| Easy | pt_n12_d2 | 12 | 2 | CPU | 0.135 | 0.947 ± 0.021 |
| Frontier | pt_n12_d4 | 12 | 4 | CPU | 0.358 | 0.781 ± 0.041 |
| Frontier | pt_n12_d6 | 12 | 6 | CPU | 0.533 | 0.678 ± 0.044 |
| Frontier | pt_n12_d8 | 12 | 8 | CPU | 0.551 | 0.451 ± 0.055 |
| Frontier | pt_n12_d10 | 12 | 10 | CPU | 0.532 | — |
| Hard/deep | pt_n12_d12 | 12 | 12 | CPU | 0.583 | — |
| Hard/deep | pt_n12_d16 | 12 | 16 | CPU | 0.602 | — |
| Hard/deep | pt_n12_d18 | 12 | 18 | CPU | 0.570 | — |
| Hard/deep | ramp_n12_d32 | 12 | 32 | CPU | 0.779 | — |
| Hard/deep | ghz6 TLS (bad layout) | 6 | Clifford | CPU | 0.782 | — |
| Beyond oracle | disc_n20_d2 | 20 | 2 | CPU | 0.999 | — |
Mirror-RB exponential fits: r_per_layer = 6.7% (n=8, R²=0.996), 11.2% (n=12, R²=0.944); additional n=8 point d8/16/32 → F = 0.655±0.010 / 0.415±0.039 / 0.126±0.031. Every value traceable by job_id in qpu_jobs.json; tables in qpu_regime_results.json + qpu_mirror_results.json. 1024 shots (TVD) / 500 shots (mirror-RB); 0 new QPU time (retrieved from completed jobs). The "—" gaps are honest (not interpolated). CLAIMS C1/C2.
4b · Embedding A/B — hardness depends on the physical layout. Same logical GHZ, two physical mappings. A deliberately bad layout (real TLS cluster) collapses the output — measured contrast 9.7× at n=6, robust and growing with n.
| Layout | circuit | TVD(ideal,QPU) | reading |
|---|---|---|---|
| GOOD | ghz6 | 0.081 | GHZ correct, structured |
| TLS (bad) | ghz6 | 0.782 | collapsed to near-uniform |
Contrast 9.7× (ghz6) vs 7.3× (ghz4). The bad layout is the documented real TLS cluster, not an adversarial worst case. This validates that Atlas's hardware-aware lens (recommend_embedding_offline / exclude_qubits) is actionable on metal. Source: QPU_RESULTS.md §2/§4.
Our own Porter-Thomas conformal calibration on real metal flips the sign of the first-order correction. The honest direction is conservative: first-order estimates overestimate reachable depth.
From 7 PT-valid points (n=12, 2D-random), median per-layer ratio κ̂ = 2.62 (>1): on real ibm_kingston, fidelity falls ≈2.6× faster than a first-order inference predicts (the simulator underestimates correlated/non-Markovian noise — crosstalk, TLS, leakage). Impact: Atlas's realistic depth ceiling becomes ≈11 layers (vs an optimistic first-order 29–49). The correction tightens the ceiling (safety), it does not extend it. Held-out (LOO conformal) MAE 29% → 12%, 80% band ±0.21.
Source: qpu_pt_calib.json + atlas_conformal_hardware + qualify_offline; CLAIMS C2. Honest scope: 7 PT-valid points (not 9 → ~80% conformal coverage, 90% needs ≥9); worst-case 2D-random family (κ̂ applies to high-magic ESCALATE candidates — the relevant family); each circuit had its own transpiled layout, so embedding variance is inside the band.
A fair question from anyone who reads carefully: there appear to be two noise models. There are — and conflating them would be dishonest, so here is the precise state of each.
| Layer | What it is | Status | Where it's used |
|---|---|---|---|
| Measured local model | Per-edge CZ error, per-qubit readout, T1, SX — pulled from the real device's calibration snapshot (156 qubits, 176 edges) | Measured on ibm_kingston, 2026-06-22 | QPU validation above; mirror-RB / κ̂ ceiling |
| Toy global envelope | A single global 2-qubit depolarizing channel (one knob) | Approximation — explicitly labeled "uncalibrated noise estimate" | the interactive UI slider only |
The honest split: the measured model carries real per-edge heterogeneity — median CZ error ≈2.0×10⁻³ (best edge ≈8.2×10⁻⁴, with a long tail up to dead edges), median readout ≈7.9×10⁻³ (best ≈2.4×10⁻³), median T1 ≈241 µs. That is the model behind every hardware number on this page. The toy global depolarizing channel is a single-knob teaching envelope wired only to the UI slider; it does not model per-qubit T1/T2, readout, crosstalk, or topology, and is labeled as such. The measured per-edge model is not yet wired into the interactive panel — that connection is roadmap, and we say so rather than implying the slider is calibrated.
Source: benchmarks/kingston_calibration.json (per-edge CZ / readout / T1 / SX for 156 qubits / 176 edges, measured 2026-06-22) + noise_local_validation.csv (the measured local model reproduces the device: TVD(ideal vs noisy-local) ≈ 0.03–0.10 across the validation circuits); noise.py = the toy global channel. CLAIMS C7; SELF_ASSESSMENT #5 (toy noise conceded as a real gap for the slider; the measured model is what backs every hardware claim). The two layers are kept separate and labeled — never presented as one calibrated model.
Bounded novelty, stated honestly: this is of the tools we surveyed, not "nobody has X" in the absolute. Qiskit Aer's method='automatic' is prior art for multi-method routing and is cited here as the baseline. What Aer does not do is the rest of the row.
| Capability | Qiskit Aer (baseline) | Stim · quimb · cotengra | Atlas |
|---|---|---|---|
| Multi-method routing | Yes — method='automatic' picks a sim method (prior art) | single-paradigm each | Yes — min over magic / MPS-bond / treewidth / spread |
| Uses measured hardware calibration | No | No | Yes — real Heron-r2 mirror-RB + κ̂ |
| Emits a verdict (CPU / TENSOR / HPC / ESCALATE) | No — runs to find out | No | Yes |
| Emits a signed certificate + evidence ledger | No | No | Yes — SHA-256, per-fold signers |
| Calibrated confidence + honest deferral (OOD → decline) | No | No | Yes — held-out conformal bound |
| Pauli-path / SPD, free-fermion / matchgate, stabilizer-rank + magic-budget axes | No | partial, per engine | Covered as adjudicated axes |
Aer's method='automatic' already routes among simulation methods — we credit it as the baseline and do not claim to have invented multi-method selection. Atlas's contribution is the layer on top: measured-hardware calibration, an emitted verdict + signed certificate, a calibrated confidence with honest deferral, and the broader axis set (Pauli-path/SPD, free-fermion/matchgate, stabilizer-rank+magic-budget). None of these is claimed as "first in the world" — only "not present in the tools we surveyed."
Source: COMPETITIVE.md §1–§2 (full per-tool, no-strawman comparison incl. AWS Braket cost estimator, Mitiq, TKET). Each competitor is excellent at its purpose; several (Stim, quimb, cotengra) are dependencies of Atlas. CLAIMS C11 — language: "productized first," never "category without competition."
# corpus route-correctness + held-out conformal bound PYTHONPATH=HANDOFF_5ideas NUMBA_DISABLE_JIT=1 python3 HANDOFF_5ideas/atlas_conformal.py # -> held-out route-correctness ≥ 0.987 (α=0.05); 2506/2517 = 0.9956 # real-QPU depth-resolved validation (re-derives from completed jobs) python3 benchmarks/qpu_validation.py # end-to-end TVD # tables: qpu_regime_results.json · qpu_mirror_results.json · qpu_jobs.json
Every figure on this page is re-derivable from the named scripts and data manifests without handing over the calibration corpus. The code is open (Apache 2.0); the measurements are the moat. Hardware language follows policy: public-access superconducting hardware, Heron r2 architecture, June 2026 — numbers and job-ids remain verifiable.