

Summary
- Inference-time scaling is the New Frontier: AI progress is shifting to focus more on inference-time scaling techniques that refine responses after initial training.
- Deeper Reasoning, Longer Waits: Enabling models to break complex tasks into smaller steps during inference results in longer waiting times but better responses.
- Better Cost-Outcome Alignment: This approach better aligns cost for Hyperscalers with performance and usage. It should also ensure more focus on real-world outcomes rather than potentially obscure benchmarks.
AI Fails Differently
You typically think of technology as improving tasks by being faster, cheaper, and more accurate than humans.
But, AI is not only different but worse in some areas than previous technologies. Take hallucinations (incorrect answers), for example. Previously technology was used to correct human error, but AI adds system error. In return, you get a flexible technology that is more creative, and ‘human sounding’ than before.
Recent AI developments have added another drawback – longer waiting times.
Scaling Capability
Driving AI progress is the application of scaling laws. These increase model intelligence by expanding three main components: data, parameters, and compute resources used during training.
Among these components, compute plays the most crucial role. In simple terms, dedicating ten times more compute to a model generally results in a considerable increase in its performance and intelligence.
Most of this scaling is applied during pre-training. This is the initial stage in building large AI models, where the model learns general patterns from a wide range of data so it can perform tasks and respond to questions. Scaling the pre-training has been a key driver of increased spend on leading semiconductors, datacentre networking equipment, and securing power supply.
Last year, fears arose that pre-training scaling was peaking due to AI labs running out of novel training data. Grok-3 has now disproved this, but the episode highlighted the need for high-quality synthetic data to perpetuate model improvements.
Then came the ‘Deepseek event’. By popularising two new scaling laws, innovations in the R1 and V3 models called into question the billions spent on leading AI labs. These were:
1). Post-training scaling:
After the initial training phase where AI learns general knowledge (like going to school), post-training techniques help customise these models for specific tasks or industries. This makes them more efficient, accurate, or specialised for particular fields like healthcare or law without requiring the enormous investment of building a model from scratch.
Examples of such techniques include:
- Fine-tuning: Adjusting a pre-trained model using additional data to make it more specialised for a specific task or industry (e.g., legal or medical language).
- Distillation: Training a smaller, more efficient model (student) to mimic a larger, more powerful model (teacher) while retaining most of its capabilities.
- Reinforcement Learning (RL): Improving a model by rewarding good responses and discouraging bad ones, often based on human or AI feedback (e.g., RLHF, RLAIF).
One challenge with RL with LLMs is that providing objective feedback to text answers is difficult. In this case, limited examples and high subjectivity play a much larger role than, say, training a model to play Chess or Go.
2). Inference-time scaling:
LLMs, like other technologies, have been designed to generate responses quickly. While this speed works for simple questions, it becomes problematic when questions require detailed reasoning. In LLMs, ‘reasoning’ is the ability to generate logically coherent responses based on statistical patterns in data rather than explicit logical inference.
Inference‑time scaling allows a model to use extra processing time to break a complex question into smaller steps and evaluate multiple answers. This additional work results in more accurate and reliable responses. The three examples of inference-time-scaling techniques are:
- Chain-of-thought (CoT) prompting: Solves problems step-by-step rather than immediately giving a final answer. A benefit of this process is that verified CoT data can be used as part of post-training and help alleviate data shortages.
- Sampling with majority vote: Generates several responses to a question and then chooses the most recurring response.
- Dynamic resource allocation: Allocating more compute for harder problems vs easier ones.
Studies have shown that a smaller model with inference-time-compute can outperform one even 14x larger on intermediate math problems. From a technology perspective, this increases the need for both compute and memory.
Inference-time scaling ensures time is now an additional vector used to improve AI, contrasting speed as a more typical driver of technology-driven efficiency gains.
A Better Alignment of Investment and Outcomes
Early on, the market questioned whether AI models would see widespread demand.
During Nvidia’s May 2024 earnings call, CEO Jensen Huang estimated that only 40% of AI chip usage was dedicated to inference, with training still dominating revenues. However, as AI adoption has grown, these concerns have faded. It has also become clear that reasoning through complex tasks can require up to 100x more compute than simple, one-shot responses.
As a result, industry focus has shifted from scaling AI training to optimising inference, with several key implications:
- Inference quality should keep improving. Yes, pre-training scaling remains relevant, but gains from inference-time scaling are still in the early stages. This means improvements should persist, which is crucial for agentic AI’s success.
- More focus on real-world needs not benchmarks: Focusing on inference ensures models generate responses that better align with real-world needs, rather than just performing well on sometimes arbitrary benchmarks.
- Better alignment of costs with outcomes for Hyperscalers: Scaling pre-training by adding more GPUs has uncertain returns, whereas a focus on delivering inference compute ensures investment is closely aligned with the business value generated from AI. This helps to explain why Microsoft passed on Stargate, instead focusing on optimising infrastructure for inference.
- Higher marginal costs for better performance: Unlike traditional software, AI systems require significantly more compute as reasoning depth increases. For a model provider, this means variable costs (inference) are rising relative to fixed costs (training).
- Strategic use of inference-time scaling: Not every response will require extra compute. Both users and business leaders must decide when deeper reasoning is worth the cost. Indeed, in many cases, faster responses may be preferable to longer responses. In many cases, longer waiting times, and more compute may still be preferable to the human alternative. Also, UI developments should help users understand the trade-off between speed and accuracy.