How much
CO₂
does your LLM
inference cost?

1 2 3

We autonomously tested inference configs on NVIDIA DGX Spark, scored with the ISO SCI standard. Each model is a line. Each config is a station.

DGX Spark · GB10 128 GB Unified ISO/IEC 21031

Per-Model Analysis: Throughput vs Carbon

Each chart shows one model family. X = tok/s, Y = SCI. Point size reflects batch size. Log-log scale.

SCI Scaling Law: Predicting Frontier Model Carbon

Power-law regression fit on measured Qwen data, extrapolated to frontier models. SCI = 0.000207 × params^0.374 (R² = 0.979). Frontier parameter counts are best public estimates.

3D Interactive Analysis

Drag to rotate, scroll to zoom, hover for details. Three dimensions reveal tradeoffs invisible in 2D.

Top 5 Greenest Configs

#	Model	Quant	Batch	SCI (µgCO₂/tok)	BPB	tok/s	Pareto

Leaderboards

Model: Quant: Batch:

Carbon Calculator

Same model, same hardware — different carbon cost based on where you run it.

Select Experiment

Grid Region

Carbon Intensity: -- gCO₂/kWh

Recalculated SCI

--

gCO₂ per token

Operational

--

Embodied

--

SCI by Region

Methodology

How we measure environmental impact. Transparent, reproducible, ISO-standard.

What is Autoresearch?

Autoresearch is an open-source framework by Andrej Karpathy that lets AI agents run research autonomously. Give an agent a small but real ML setup, and it experiments on its own — modifying code, training, checking results, keeping or discarding — while you sleep.

The Idea

The agent edits train.py, trains for a fixed 5-minute budget, checks if val_bpb improved, and repeats. ~12 experiments/hour, ~100 overnight. You wake up to a log of experiments and a better model.

Our Twist

We adapted the autoresearch pattern for inference benchmarking on DGX Spark. Instead of optimizing training loss, our agent explores inference configurations — batch sizes, quantization, sequence lengths — and measures carbon cost via the SCI standard.

How It Works

You don't touch Python. You write a program.md — a Markdown file that instructs the agent. The agent reads it, runs experiments autonomously, and logs every result. The "code" is the program, the program is the prompt.

Design Philosophy

Single file to modify. Fixed time budget. Self-contained. One GPU, one file, one metric. No complex configs, no distributed training — just an agent doing science.

GitHub ↗ 59.7k ★ MIT License

The SCI Formula

SCI = (E × I) + M per R

E — Energy

Energy per token in kWh. nvidia-smi at 1 Hz, trapezoidal integration.

I — Carbon Intensity

Grid carbon in gCO₂/kWh. Default: Connecticut / ISO-NE (~210).

M — Embodied

DGX Spark manufacturing (~200 kg), amortized over 5 years.

R — Functional Unit

Per token generated. Comparable across batch sizes.

Hardware

NVIDIA DGX Spark — Blackwell GB10 · ARM big.LITTLE · 128 GB LPDDR5X · 1 TB NVMe

Power Measurement

nvidia-smi at 1 Hz. Trapezoidal integration. GPU temp, clocks, fan RPM logged.

Quality Metric

Validation BPB on held-out text. Lower = better compression quality.

Search Strategy

Grid → Random → Bayesian (Optuna). Auto-switches on convergence.

Thermal Safety

85°C abort. Cooldown gating between experiments. Throttling flagged.

Limitations

GPU power only. Embodied emissions estimated. Grid intensity is static.

ISO/IEC 21031:2024

Green Software Foundation — SCI

ISO standard for measuring software carbon emissions. Per-request functional units.

How much CO2 does your LLM inference cost?