

Summary
- Despite significant strides in speed, cost, intelligence, and multi-modal capabilities, AI has yet to move from testing to widespread adoption.
- Key barriers to adoption include hallucinations, limited memory, reliance on good prompts, and poor real-world adaptability.
- For AI to deliver real business value, there must be a shift in focus from just improving models to integrating AI effectively within existing workflows, setting realistic expectations, and investing in proper training and safeguards.
AI’s Tech Lifecycle
In a past life, I used to help large firms implement new technology platforms. Once you have done a few, you get a sense of the project lifecycle. It typically goes something like this:
With AI, I think we are somewhere between the second and third phases. However, further progress seems difficult to come by.
For most of 2023 and 2024, the Nvidia GPUs required to run LLMs were in short supply. This triggered one of the most intense technological arms races since the dot-com era, as rapidly scaling AI infrastructure became an existential priority for major cloud providers. The situation has improved as Hyperscalers significantly expand AI capacity for clients, though shortages persist.
So what’s driving the hold-up?
To understand this, we must look at how the technology itself has developed since the release of GPT 3.5 at the end of 2022. Models are now:
- Faster thanks to more compute, software optimisations, and architectural shifts (such as Deepseek v3’s mixture-of-experts technique). This means GPT’s o1 model can respond with a 200-word reply in up to five seconds, or seven times faster than the original (when excluding reasoning).
- Cheaper as falling cost per token reduces cost per query. For instance, GPT’s 3.5 model cost $20 per million tokens, while today, the cheapest Llama model (3.2 3B) costs just $0.05 (Chart 1). That is a 99.95% reduction in just two years! More recent models, such as Deepseek’s v3 and the Gemini Flash 2.0 model, have reduced costs further.
- ‘Smarter’ thanks to scaling laws like pre-training, post-training RLHF, and inference-time-to-compute (Chart 2). Models can now better:
- Follow multi-step instructions without getting confused.
- Solve math and coding problems with fewer errors.
- Maintain context throughout longer conversations.
- Understand nuanced questions and provide more relevant answers.
- Identify logical flaws in arguments more accurately.
- Synthesise information from multiple sources to form a coherent summary.
- Multi-modal, unlike text-only early LLMs such as GPT-3 and the original Claude models. Newer models can now:
- Process images alongside text: More recent models can ‘read’ images and discuss their content, recognise objects, read text within images, analyse charts/graphs, and understand visual context.
- Understand audio input: Some models can process spoken language, transcribe audio, and respond to voice commands directly.
- Generate and edit images: Models like DALL-E, and Stable Diffusion can create images from text descriptions, while newer systems can modify existing images based on instructions.
- Process video: Understand video content, including analysing movement, temporal sequences, and dynamic visual information.
Optimists point to the ever-increasing capability of models as the driver of success. ‘After all, who wouldn’t want a 130 IQ helper with you 24/7?’
Issues With LLMs
Despite recent improvements, AI adoption faces challenges. Lower costs alone have not solved the issue, as the ROI for AI implementation remains questionable, and successful use cases are scarce.
We highlight seven reasons why:
Hallucinations/inaccuracies. LLMs still fabricate answers, preventing rollout to many industries. This remains a key pain point. The legal industry is a great case study. AI should, in theory, be perfect for finding precedent, automating the drafting of contracts, and even helping to pass the bar exam! Yet a single error can cost millions of dollars in fees or added costs, damage trust, and require verification overheads.
Reliance on good prompting. Users often struggle to derive full benefits due to insufficient training or awareness of effective prompting techniques. Most prompts are often too short or unclear, hurting response quality. This is compounded by unknowns: why does being polite or asking for the result in a specific format help output quality?
We provide an example of two prompts here. In both cases, Chat GPT is asked to describe what drives FX markets, with the second, more detailed, and specific prompt leading to a better answer.
Models are too generalised and not specialised. LLMs are yet to become a real product. Many developments enhance general features while leaving implementation to the business. This contrasts specialist software solutions that are meticulously designed for specific tasks such as accounting, inventory management, or supply chain optimisation. Their strength lies in their deep understanding of niche requirements, regulatory compliance, and the ability to automate well-defined workflows, e.g., QuickBooks or SAP.
Poor real-world adaptability: More recent iterations of LLMs can reason, breaking down complex queries into manageable steps to achieve more accurate and logical conclusions. Despite this, LLMs are still unable to perform simple tasks outside of their training set. They struggle even with simple games like Pokémon, which was made with young children in mind, due to issues with real-time decision-making, context-length limitations, and unoptimised reward structures (like Go). It is hard to see AI performing in a more agentic capacity without progress in these areas, leaving them little better than scripts or RPA software.
Memory issues. AI lacks persistent memory beyond saving past conversations. Unlike a human assistant who builds a deeper understanding of you over time, AI doesn’t connect the dots between your preferences, habits, and needs. Persistent memory would help offer a seamless, tailored, and personal experience that feels less like a chatbot and more like a real assistant.
Getting AI right is resource intensive. We would know! In our efforts to build out our Macro Hive LLM, we have found it takes time and effort to onboard the right talent, sufficiently cleanse and correctly label large data sets, and integrate with legacy systems. AI might be a productivity-boosting tool in time, but the initial stages are labour intensive.
Reluctance to change workflows. Lastly, as with all new technology implementations, there are people challenges. This can include resistance to changing established processes, technological knowledge gaps, and comfort with the status quo, creating significant friction. We expect these barriers to fade.
Overall, these issues have led pessimists to highlight a persistent expectations gap between what AI optimists claim AI can do and its actual capability.
When Will AI Work?
As of late 2024, nearly 40% of the U.S. population aged 18-64 said they used generative AI, while search interest in ChatGPT continues to make new highs more than two years post-release. AI has also already upended multi-billion dollar industries such as ed-tech and resulted in widespread concerns over Google’s dominance in search.
Also, companies have started to report use cases where AI has delivered significant cost savings.
For example, Walmart has used AI to automatically improve and update their product catalogue, power smart shopping assistants, and provide quick, helpful answers to common questions. This task would have ordinarily taken Walmart much longer to complete at a significant cost.
Another example is Axon Enterprises, who provide smart monitoring and tasers for police officers. Their Draft One transcribes audio from body-worn cameras to create initial drafts of reports, reducing manual paperwork. In addition to time, officers now benefit from greater efficiency, allowing more focus on public safety while ensuring accuracy through human review.
Yet, these use cases are few and far between. So what do we need to get AI to work?
Optimists suggest AI tools are still improving and that use cases will emerge naturally. But what does it mean for a model to improve? If a model’s accuracy rises from 90% to 95%, the risks remain too high in many fields – and we are still nowhere near this. It seems a reasonable judgment at this stage to assume that hallucinations are an inevitable feature of these systems, particularly when responding to questions with a clear ‘right’ answer. Yes, reasoning helps, making an AI fact-check itself helps, and so do techniques such as RAG. However, AI will never be perfect.
So, business leaders must accept that if LLMs are worth adopting, the cost-benefit analysis must include the estimated cost of a wrong answer and the human overhead required to provide quality assurance. And why should AI be held to a higher standard than humans? We know human error exists, so AI error should be held to the same standard, not a higher one where possible.
We also know AI, in its current guise, is unlikely to advance to human-level reasoning. AI insights will always be borrowed from human insights. Take Deep Research, for example. It is very good at telling you ‘what’ but not very good at the ‘so what’. So it can brief you on a topic (assuming the answer is correct) but struggles to provide salient implications.
Bridging the AI Gap
To bridge the gap, we provide three requirements:
A clear understanding of LLMs’ strengths and limitations is essential for businesses seeking broader AI integration. Without recognising potential pitfalls, successful implementation remains elusive.
The integration of LLMs within existing software platforms should be prioritised with appropriate restrictions and guardrails. Rather than adding more features, the focus should be on productising AI’s multi-modal capabilities to enhance user experience while minimising risks. Examples include education & training, advertising, and customer service/ CRM.
Suitable training is crucial for AI success. Since output quality varies significantly based on input quality, organisations must invest in developing effective prompting techniques to maximise AI’s potential.