2 Comments
User's avatar
Rainbow Roxy's avatar

This article comes at the perfect time, as I've been deaply thinking about the practical challenges of agentic AI deployments beyond the hype. Your hypothesis about control and memory being the real bottleneck, rather than raw model capability, is truly insightful; I wonder what specific architectural aproaches or memory mechanisms are proving most effective in bridging that significant gap between benchmark scores and real-world enterprise task completion.

Sam Keen's avatar

Great question, we do see a lot of bench-maxing and also, if you look at many benchmarks, 3-5 models my be within 2-10% of each other, so away from the benchmark, into real world (highly variable), workflows, who knows which of those models will perform better. I think it really comes down to really good evals, something I need to dig into more.