How Far Can We Push AI Just by Making It Bigger?

A Tour of Scaling Laws, Data Limits, and Diminishing Returns

Nov 17, 2025

I have been taking a break from Substack for a long time, as I have felt that it offers very little reward for the amount of effort required to write a blog post. However, incidentally, in the process of doing unrelated work, I have ended up producing some “blog post” material in the course of doing a bit of AI-aided research. I have decided to post this heavily AI-assisted work here given that it was only a little bit of extra work to do this.

1. Why “scaling laws” matter

Over the last few years, something slightly eerie has shown up across dozens of AI experiments.

Take a model. Make it bigger. Feed it more data. Spend more compute.
Plot how the error drops as you scale things up — not in linear space, but on a log–log plot — and you keep seeing almost straight lines.

Those straight lines are called scaling laws. They say, in effect:

“If you increase model size / data / compute by a certain factor, the error will shrink in a predictable way.”

This isn’t just about language models. Similar patterns show up in:

machine translation
image classification
speech recognition
diffusion models for image generation
multimodal models (text + images + audio)

The report you pasted is essentially a large meta-analysis of those experiments. It pulls together exponents from many papers and asks:

How fast does performance improve when we scale model size?
How fast does it improve when we scale data?
How fast does it improve when we scale compute?
And, crucially: When do we start to run out of useful data?

This rewrite is aimed at a general but curious audience: people who want to understand where brute-force AI scaling is likely to run out of steam.

2. Three knobs: model size, data, and compute

Most scaling-law papers look at some version of this setup:

P = model size (number of parameters)
D = dataset size (number of unique training examples / tokens)
C = compute (roughly FLOPs used in training)
L = some notion of “loss” or “error” (lower is better)

Empirically, lots of results can be summarized as:

Gap to perfect performance ≈ A·P^(-α_P) + B·D^(-α_D) + …

where:

α_P is the model-size exponent
α_D is the data-size exponent

Similarly, along a “reasonable” training recipe, we can talk about:

Gap ≈ K·C^(-α_C)

where α_C is the compute exponent.

The whole meta-analysis is basically:

“What are α_P, α_D, and α_C, across many domains and papers, once you ignore obviously under-trained or weird regimes?”

3. What the papers broadly agree on

After going through a long list of studies (see the appendices), a picture emerges:

3.1 Model-size exponent: α_P

For modern, reasonably well-trained models (think Chinchilla-style language models and similar work), a good planning value is:

α_P ≈ 0.30
(with plausible range ~0.24–0.36)

What does that mean in plain language?

If you increase model size by 10× while keeping everything else “compute-optimal”, the improvable part of the loss scales roughly like:
- 10^-0.3 ≈ 0.5 → about a 50% reduction in the remaining gap to perfection.

Earlier language-model work (like the original GPT-3 era) often reported much smaller exponents, like α_P ≈ 0.07–0.1. That turns out to be mostly an artifact of under-training (too few tokens per parameter). Once you train models closer to optimally, the exponent steepens to the ~0.3 regime.

Broadly:

Old under-trained LM regime: α_P ≈ 0.07–0.1
Modern, data-adequate regime: α_P ≈ 0.25–0.35

Other domains (translation, vision, diffusion, multimodal) tend to land in that same 0.2–0.4 band for α_P, once they’re trained well.

3.2 Data-size exponent: α_D

Here the picture is surprisingly universal.

Across:

language modeling
machine translation
multilingual models
vision
diffusion image generation
even synthetic-data experiments

you keep seeing:

α_D ≈ 0.28–0.30
(with a reasonable band ~0.24–0.34)

So:

10× more unique data (at sufficiently large model size)
→ roughly halves the data-limited part of the error
(10^-0.29 ≈ 0.51)

Earlier small exponents (around 0.1) show up mainly in early language-model work that wasn’t trained on enough tokens.

Once you look at modern experiments, α_D clusters tightly around ~0.3.

3.3 Compute exponent: α_C

Compute is a bit trickier, because different papers define different scenarios.

Two main regimes:

Global compute frontier for language models
- You scale P and D together sensibly as you increase total FLOPs.
- In that regime, papers like Kaplan and Henighan find:

α_C ≈ 0.05–0.06

- Meaning: 10× more compute along a good training recipe
  → only about a 15–20% reduction in pretrain loss above the floor.
Fixed-data or special frontiers (e.g. ViTs)
- When you hold data fixed and just scale compute/epochs, or look at particular frontiers (like some ViT studies), you see much larger exponents, around 0.3–0.4.
- Useful for understanding those setups, but not directly the “big language model” frontier.

For “how far can brute-force training take the kind of LMs people care about”, the relevant number is:

α_C ≈ 0.05–0.06

which is soberingly small.

4. How data and parameters should scale together (β)

There’s a classical derivation: if

loss ≈ A·P^(-α_P) + B·D^(-α_D), and
compute C is roughly proportional to P·D,

then the compute-optimal relationship between tokens and parameters is:

D ∝ P^β, where β = α_P / α_D.

Using the modern LM-ish values:

α_P ≈ 0.30
α_D ≈ 0.28–0.30

you get:

β ≈ 1.0–1.3, with a central value around 1.1.

Interpretation:

To stay close to compute-optimal training, dataset size should grow at least linearly with model size, and probably a bit faster.

Example:

Say you have a 10¹¹-parameter model trained on D₀ tokens.
You go to 10¹² parameters (10× bigger).
With β ≈ 1.1,
[
D_1 / D_0 = 10^{β} \approx 10^{1.1} \approx 12.6.
]

So:

10× more parameters → ~12–13× more tokens
if you want to train in an approximately compute-optimal way.

If D₀ was 2 trillion tokens, D₁ should be ~25 trillion tokens.

This is where data starts to become the real bottleneck.

5. The looming “data wall”

Scaling laws work best when you count unique, high-quality tokens, not just “total tokens including repeats”.

Several papers look at what happens when you:

fix a dataset of size D_unique,
train for 1×, 2×, 4×, 8×, … many epochs over it.

The pattern:

At first, more epochs behave like extra data. You stay roughly on the same scaling curve.
After a while, you hit a repeated-data regime:
- loss vs compute curves flatten,
- the effective compute exponent becomes much smaller,
- additional epochs give very little gain and can even hurt generalization.

Larger models can tolerate a bit more repetition before things really flatten, but they don’t make the problem go away.

Overlay that with a rough reality check:

The world only produces so much high-quality, human-generated text.
Estimates (from other work) tend to put the total stock of usable public text somewhere in the 10¹⁴ token ballpark. Current top-end runs are already using 10¹³ tokens per training run.
If β ≈ 1.1 and you keep scaling P up by orders of magnitude, the number of unique tokens you’d like to have for compute-optimal training quickly pushes into 10¹⁴–10¹⁵ territory.

So even without perfect numbers, you can say:

We’re probably one to two orders of magnitude in model size away from wanting more unique text than the obvious public web can provide.

At that point, you either:

accept heavy repetition and degraded returns,
get serious about licensed / proprietary data,
or lean more on synthetic data (with its own limitations).

6. Does architecture change the story? What about MoE?

The meta-analysis also looks at whether exponents depend strongly on:

Domain (text vs images vs translation vs diffusion)
Architecture (CNNs, Transformers, Vision Transformers, diffusion U-Nets, DiTs)
Dense vs MoE (Mixture-of-Experts)

The short version:

Domain effects
- Once you focus on modern, well-trained experiments, α_D ≈ 0.25–0.35 shows up everywhere: language, translation, vision, diffusion, synthetic vs real, etc.
- α_P varies more by regime and metric, but again sits mostly in the 0.2–0.4 band when training is decent.
- Diffusion and some image tasks sometimes show slightly larger α_P than language models, but the evidence is still thin.
Architecture family
- Within Transformers, quite different variants (dense, Funnel, MoE-ish, etc.) usually share similar exponents; what changes is the intercept — how good they are at a given size.
- Vision architectures (CNN vs ViT vs DiT) also yield exponents in similar ranges.
Dense vs MoE
- MoE papers explicitly construct an “effective parameter count” N_eff and show that dense and MoE models lie on the same scaling curve in terms of N_eff.
- MoE gives constant-factor gains:
  - better loss at fixed compute,
  - or same loss for less compute.
- But the exponent α_P doesn’t really change.

So if you were hoping MoE or some clever architecture would magically turn α_P from ~0.3 into ~0.8, that’s not what the data shows so far. It’s about efficiency, not a new law of physics.

7. Diminishing returns in plain numbers

Putting the exponents together:

Model size (α_P ≈ 0.30)
- 10× parameters → ~50% reduction in the remaining gap to the floor (if you also scale data properly).
Data (α_D ≈ 0.29)
- 10× more unique tokens → ~50% reduction in the data-limited gap.
Compute (α_C ≈ 0.05–0.06)
- 10× compute on the global LM frontier → only ~15–20% reduction in gap.

That last one is the kicker. It means:

You can absolutely get better models by pouring in more compute and money.
But each extra order of magnitude buys you a smaller and smaller improvement in the pretraining loss.

The obvious questions then become:

How many extra “15–20%” steps are left before we hit domains where additional pretraining barely moves real-world performance?
And is that worth another 10×, then another 10×, then another 10× in GPUs, energy, and engineering?

8. What this likely means for the next few years of AI

If you’re watching AI from the outside, the scaling-law picture suggests a few broad expectations:

Bigger models will keep helping — but not forever.
- There is still real headroom in scaling P and D by another 1–2 orders of magnitude.
- But you don’t get “AGI tomorrow” just by adding more zeros; you get predictably sublinear gains.
Data strategy will matter as much as compute strategy.
- It’s not enough to hoard GPUs; you need massive, high-quality, diverse, deduplicated data pipelines.
- Licensing, private corpora, multimodal data (images, audio, video), and good data cleaning all matter.
Repetition will be a central headache.
- As you bump into the “data wall”, you’ll increasingly be forced into regimes where you’re repeating the same text many times.
- That’s where scaling laws begin to break down and returns fall off sharply.
Architectural tricks are constant-factor wins.
- MoE and other smart architectures can make models cheaper to run or train for a given quality.
- But they don’t seem to offer a free exponent upgrade.
Algorithms and systems are where a lot of leverage is.
- Better optimizers, schedulers, retrieval, compression, distillation, and data selection can shift the intercepts of these curves substantially.
- Since the exponents themselves are stubbornly modest, those intercept shifts can easily matter more than another 2× in raw compute.
We should expect more careful experiments, not just bigger ones.
- Some of the most informative work now will be deliberately designed scaling grids:
  - varying P and D across a range,
  - measuring where repeated-data effects kick in,
  - carefully comparing dense vs MoE at matched effective size.

In other words: we’re moving from the phase of “throw more GPUs at it and see what happens” into a phase where data engineering, experimental design, and algorithmic finesse matter just as much as raw scale.

9. For the nerds: how solid is all of this?

The original report doesn’t just eyeball a few plots. It:

collates exponents from a large set of studies (language, MT, vision, diffusion, multimodal, transfer, synthetic data);
distinguishes between under-trained and well-trained regimes;
conceptually applies random-effects meta-analysis ideas (to handle variation between studies);
checks that differences like “0.08 vs 0.3” are too large to be explained by noise, and really do reflect different regimes.

The punchline of that more technical work:

α_P in the modern regime really does sit much higher than the old GPT-3 style exponents.
α_D is remarkably consistent across many domains.
α_C ≈ 0.05–0.06 for language models seems robust.
Nothing in the data suggests some easy way to “escape” diminishing returns through mainstream architecture tweaks.

If you want to see the raw tables and all the nitty-gritty, they’re below.

Appendix A – Study-level summary table

The full tables behind this post are too wide and detailed to display comfortably in Substack, so I’ve linked public Gists instead.

(domains, architectures, metrics, notes): [link to GitHub]

Appendix B – Exponent-level data

(α_P, α_D, α_C entries and caveats): [link to GitHub]

Rainbow Roxy

Nov 23

Couldn't agree more. It's so insightful to see these 'scaling laws' laid out, and your meta-analysis is quite impressive. While the predictable error reduction is compelling, I'm curious about the eventual practical limitations. Do you foresee a point where architectural or data quallity constraints might cause the 'straight line' to diverge?

Expand full comment

Discussion about this post

Ready for more?