Model calibration

How accurate is GSM?

We compared the model's predicted G-score against published real-world performance for 24 well-documented machines. A perfect model would line up — higher G means higher real-world performance. Reality is messier.

Loading benchmarks…

Within-regime accuracy

How well predicted rank tracks published rank inside each regime. Closer to 1.0 = better.

Known limits. The G-score is an order-of-magnitude similarity index, not a precision performance predictor. Cross-regime ranking is the weakest dimension (a Bugatti and a Prius compete on different physics). Within a regime, relative ordering tends to track real performance to within ~1 G unit. Use it for triage, not procurement.
How to read this page. If r > 0.7 across regimes, the model is broadly calibrated. If a specific regime falls below 0.3, the Π-group set for that regime needs work — open the Research page and re-source those entries.