How much
CO2
does your LLM
inference cost?
We autonomously tested inference configs on NVIDIA DGX Spark, scored with the ISO SCI standard. Each model is a line. Each config is a station.
Per-Model Analysis: Throughput vs Carbon
Each chart shows one model family. X = tok/s, Y = SCI. Point size reflects batch size. Log-log scale.
SCI Scaling Law: Predicting Frontier Model Carbon
Power-law regression fit on measured Qwen data, extrapolated to frontier models. SCI = 0.000207 × params0.374 (R² = 0.979). Frontier parameter counts are best public estimates.
3D Interactive Analysis
Drag to rotate, scroll to zoom, hover for details. Three dimensions reveal tradeoffs invisible in 2D.
Top 5 Greenest Configs
| # | Model | Quant | Batch | SCI (µgCO₂/tok) | BPB | tok/s | Pareto |
|---|
Leaderboards
Carbon Calculator
Same model, same hardware — different carbon cost based on where you run it.
Select Experiment
Grid Region
Recalculated SCI
SCI by Region
Methodology
How we measure environmental impact. Transparent, reproducible, ISO-standard.
What is Autoresearch?
Autoresearch is an open-source framework by Andrej Karpathy that lets AI agents run research autonomously. Give an agent a small but real ML setup, and it experiments on its own — modifying code, training, checking results, keeping or discarding — while you sleep.
The agent edits train.py, trains for a fixed 5-minute budget,
checks if val_bpb improved, and repeats. ~12 experiments/hour, ~100 overnight.
You wake up to a log of experiments and a better model.
We adapted the autoresearch pattern for inference benchmarking on DGX Spark. Instead of optimizing training loss, our agent explores inference configurations — batch sizes, quantization, sequence lengths — and measures carbon cost via the SCI standard.
You don't touch Python. You write a program.md — a Markdown file
that instructs the agent. The agent reads it, runs experiments autonomously, and
logs every result. The "code" is the program, the program is the prompt.
Single file to modify. Fixed time budget. Self-contained. One GPU, one file, one metric. No complex configs, no distributed training — just an agent doing science.
The SCI Formula
Energy per token in kWh. nvidia-smi at 1 Hz, trapezoidal integration.
Grid carbon in gCO₂/kWh. Default: Connecticut / ISO-NE (~210).
DGX Spark manufacturing (~200 kg), amortized over 5 years.
Per token generated. Comparable across batch sizes.
Hardware
NVIDIA DGX Spark — Blackwell GB10 · ARM big.LITTLE · 128 GB LPDDR5X · 1 TB NVMe
Power Measurement
nvidia-smi at 1 Hz. Trapezoidal integration. GPU temp, clocks, fan RPM logged.
Quality Metric
Validation BPB on held-out text. Lower = better compression quality.
Search Strategy
Grid → Random → Bayesian (Optuna). Auto-switches on convergence.
Thermal Safety
85°C abort. Cooldown gating between experiments. Throttling flagged.
Limitations
GPU power only. Embodied emissions estimated. Grid intensity is static.
ISO standard for measuring software carbon emissions. Per-request functional units.