The Memory Wall and the Next Wave of AI Infrastructure Expansion

April 22, 2026

As token volumes explode at hyperscale, memory bandwidth — not raw compute — is becoming the binding constraint in AI inference. Karmel explores why specialized chip architectures may define the next infrastructure wave.

Antolin Garza

Partner, Karmel Capital

Connect on LinkedIn

Tokens sit at the heart of AI inference. They represent the fundamental units, such as words, punctuation, and subword pieces, that models process during every query. The number of tokens directly determines compute usage, inference latency, and cost. Google's monthly tokens processed has accelerated meaningfully in recent months, reflecting explosive real-world demand at hyperscale.

The true constraint in AI is shifting from raw compute to memory. The global memory market is projected to reach ~$300B by 2027, nearly double the 2024 level, with DRAM comprising two-thirds of the market. Global DRAM capex is expected to more than double from 2024 to 2026, yet supply will barely keep pace with demand.

The memory wall explains the issue: while processor compute power has doubled roughly every two years, memory bandwidth has grown only elevenfold and capacity about eighteenfold. Meanwhile, AI models have scaled ~1,200x, from 1.5 billion parameters in 2019 to around 1.8 trillion in GPT-4. This mismatch creates severe bottlenecks — even powerful GPUs sit idle over ~60% of the time on typical workloads.

Advances in new base-layer chip technologies offer a solution by attacking the memory wall directly. Specialized memory architectures can decouple capacity and bandwidth from traditional GPU designs, delivering major gains in real-world performance. With hyperscalers committing $600B–$700B in AI infrastructure spending in 2026 alone, the AI infrastructure opportunity remains firmly in early innings.