Deep Engineering #40: Karun Thankachan on Small Language Models in production
How to train SLMs to reason well, route across specialised models, and govern correctness at scale
Building Production-Ready Agent Systems with MCP
A full-lifecycle workshop on designing, securing, evaluating, and optimizing agent systems that hold up in production.
Online: March 29 | 10:30 AM - 4:00 PM EST
✓ 2-for-1 deal: Bring a colleague, get 2 passes for the price of 1.
✍️ From the editor’s desk,
Welcome to the 40th issue of Deep Engineering!
The conversation around AI infrastructure is heating up. Nvidia this week pointed to more than $1 trillion in AI chip revenue opportunity by 2027, explicitly tying that outlook to growing demand for inference. And when inference at scale becomes a boardroom number, model efficiency stops being a niche engineering concern and starts shaping cost control, product design, and build versus buy decisions. That is the context in which small language models are becoming harder to ignore.
The case is not that general-purpose LLMs are going away – it is that a large share of production workloads, classification, structured extraction, and domain-specific reasoning, do not need an LLM. Serving a 7-billion parameter SLM runs 10 to 30 times cheaper than a 70-billion parameter LLM, and that gap compounds across millions of requests. For engineering leaders, the question is no longer whether SLMs belong in the stack – it is how to train them to reason well, how to route across multiple specialized models at runtime, and how to govern correctness when AI is accelerating the pace of change.
This week’s feature is based on a conversation with Karun Thankachan, Senior Data Scientist at Walmart. We dig deep into how to treat reasoning as a budgeted resource, when routing across specialized models beats relying on a single large one, and why context engineering is winning over fine-tuning for many teams right now.
And we also have a thought leadership piece by Lee Peterson, VP of Secure WAN Product Management at Cisco, on why the network layer underneath agentic systems matters just as much as the models running on top of it.
Let’s get started.
Stop Building Vault
Secrets, PKI, & PAM in one platform. Postgres-backed. No custom orchestration. Flexible deployment.
The Case for Small Language Models in Real Systems
by Deepayan Bhattacharjee with Karun Thankachan.
Karun Thankachan’s conversation with us keeps circling back to one production truth:
The “best” model is the one that meets your accuracy target inside your cost and latency budget.
In early phases of a product, teams often use large general-purpose LLMs to explore what’s possible. But once the user experience stabilizes, the engineering focus shifts – toward predictable latency, controllable spend, and repeatable quality. That shift is where small language models (SLMs) start to matter more than giant LLMs, because many real systems don’t need universal reasoning – they need reliable, narrow reasoning at scale.
Recent research and releases support this trajectory. For example, Zihao An et al. (Jan 20, 2026) show that a 0.6B-parameter reasoning model (ReasonLite) can reach 75.2% accuracy on AIME 2024, and they attribute it to distillation and training recipe design rather than sheer parameter count. This is exactly the kind of result that makes SLMs a serious production option once you know what task you need to solve.
Cost-effective reasoning is a budget problem
Thankachan’s core argument is not that LLMs are bad. It’s that general-purpose reasoning is expensive overkill for many high-volume workflows (customer support classification, catalog normalization, routing, anomaly triage, internal copilots for one domain). When a feature goes from thousands of calls to millions, the economics change.
A key point from the research community is that reasoning often becomes expensive because teams implicitly pay for long reasoning traces even when a request is simple. Kun Liang et al. (Jan 13, 2026) describe this problem directly: long-form chain-of-thought reasoning can improve accuracy, but applying “overlong reasoning” uniformly at inference time creates “substantial and often unnecessary computational cost.” They propose ORBIT, which aims to make reasoning behavior controllable across multiple “budgets.”
That maps cleanly to production reality: you don’t want a single mode of intelligence. You want a system that can do “fast and cheap” when it can, and “slow and deep” only when it must.
Teaching SLMs to reason without bloating latency
The approach Thankachan describes in the interview treats SLM training like an engineering workflow: combine distillation techniques, program-aided verification, and evaluation probes to close the gap between a teacher model and a smaller student.
What’s notable is that the same idea is showing up in recent public work: reasoning performance in small models is increasingly a function of distillation data quality, training curriculum, and trace control, not only scale.
In the AMD ReasonLite write-up, the authors describe a two-stage distillation curriculum: first fine-tune on short chain-of-thought for efficiency, then fine-tune on long chain-of-thought for peak accuracy. They also report that the “Turbo” variant tends to generate shorter outputs and offers a better efficiency/accuracy balance, while the long-trace variant reaches higher peak performance. This lines up with the production emphasis Thankachan describes: if your model is trained to be verbose, it will be slow and costly; if you want predictable serving, you must treat thinking length as a first-class knob.
A practical pattern for senior engineers is to separate training-time compute from serving-time compute:
Use heavier techniques (like generating multiple traces, voting, or deeper teacher reasoning) during dataset creation and training, where batching and offline pipelines reduce the pain.
Enforce strict budgets at inference (max reasoning tokens, max latency, max retries) and treat budget violations as signals to improve training data or shorten traces.
The leadership takeaway: budgeting belongs in the contract, not just in infrastructure dashboards after costs explode.
Orchestrating many small models beats one giant model
The SLM-Fusion framing Thankachan describes – routing across specialized models, merging where it helps, and wrapping the whole thing behind an OpenAI-compatible gateway – reflects a broader trend:
Modern systems are becoming mixtures, not monoliths.
Recent work in retrieval-heavy QA shows why. Yasaman Zarrinkia, Venkatesh Srinivasan, and Alex Thomo (Mar 18, 2026; v2) report a striking result when evaluating a Graph-RAG system: 77%–91% of questions contain the gold answer in the retrieved context, yet end-to-end accuracy is only 35%–78%, and 73%–84% of errors are reasoning failures.
That finding matters for architecture decisions. It implies that “better retrieval” alone isn’t enough; you need better reasoning structure and better context shaping. The same paper proposes two augmentations – structured prompting plus graph-walk compression – and then shows something that should make engineering leaders pay attention: with question-type routing, an open-weight Llama-8B configuration can match or exceed an unaugmented Llama-70B baseline across multiple benchmarks at about 12× lower cost.
This is the pragmatic case for SLM-centric production:
Route by intent/domain and cost budget, not just by “best model available.”
Specialize models for stable tasks (where fine-tuning or distillation pays off).
Escalate to a larger model only when uncertainty is high or stakes demand it.
Why context engineering is beating fine-tuning right now
Thankachan argues that industry attention is shifting toward RAG and context engineering because it can be cheaper and easier to operationalize than constant fine-tuning – especially when domain knowledge changes frequently.
The Graph-RAG results above reinforce this. Even when the answer is present in retrieved context, the failure mode is often reasoning over context, not retrieval itself. That pushes teams toward two concrete engineering priorities:
Structure the prompt so the model reasons in a way that matches the data representation (for Graph-RAG, they use SPARQL-style decomposition).
Compress context so the model sees less noise and wastes fewer tokens, while still preserving the minimal information needed to answer.
For architects, the key decision lens is: fine-tuning changes the model; context engineering changes the problem presented to the model. When requirements or source documents change weekly, the second approach often has a faster operational loop.
What’s next – budget-aware reasoning and diffusion models
Thankachan flags diffusion models as a trend to watch. The key reason is not hype – it’s that diffusion-based language models may offer new efficiency controls that differ from token-by-token autoregressive decoding.
A concrete example: Vittorio Rossi et al. (Mar 6, 2026) argue that diffusion language models (DLMs) can waste compute because they run denoising steps over a fixed maximum length even when the desired response is short. They propose a zero-shot mechanism to estimate required output length and crop the context window before generation, reporting large FLOP reductions without statistically significant performance loss (and improvements in 2 of 4 benchmarks).
This matters because “right-sized reasoning” is becoming an explicit research goal across paradigms – whether you do it with controllable budgets in autoregressive models (as in ORBIT) or with length control in diffusion models.
Finally, distillation itself is getting more modular. Shaoxiong Yang et al. (Feb 1, 2026) propose FutureMind, which combines adaptive distillation with a multi-stage reasoning pipeline and retrieval guidance – explicitly acknowledging that SLMs are attractive for low-latency settings but often struggle on knowledge-intensive tasks without structured reasoning and retrieval help.
Key Takeaways
Treat reasoning as a budgeted resource. Put token limits and latency targets into your API contract, not just your monitoring dashboards, and design “fast vs deep” modes intentionally.
Distillation makes small models competitive when the recipe is right. Evidence from subbillion models shows that curriculum design and trace supervision can close surprising amounts of the gap.
RAG failures are often reasoning failures, not retrieval failures. Even when the answer is in-context, models may fail to use it – so prompt structure and context compression matter.
Routing is a cost lever. With the right augmentations and routing, smaller open-weight models can rival much larger baselines at large cost reductions.
Watch diffusion LMs for new efficiency primitives. Length-aware diffusion decoding suggests a different path to controlling wasted compute, which may matter for production workloads dominated by short answers.
🔍 In case you missed it…
Small Language Models and the Future of Production AI with Karun Thankachan
This conversation with Karun Thankachan is a practical tour through small language models in production, starting from the limitations of general-purpose LLMs and repeatedly returning to a single constraint. Cost-effective reasoning for specific tasks is a different engineering problem than general-purpose reasoning, and good engineers choose their tool…
💡 Industry Perspective
Agentic AI Is Redefining Edge Infrastructure
Thought leadership piece by Saqib Jan with Lee Peterson, VP of Secure WAN Product Management at Cisco.
As agentic systems move closer to the edge, the network underneath them matters just as much as the models running on top. Peterson makes the case that organizations still designing around centralized control will hit a wall at exactly the wrong moment, and that getting edge compute and networking right is not an upgrade but a rethink from the ground up.
🛠️ Tool of the Week
DistillKit — open-source knowledge distillation toolkit for language models
Highlights:
Two distillation methods in one toolkit: Supports both logit-based distillation and hidden states-based distillation, which aligns intermediate layer representations and allows distillation across different model architectures.
Offline distillation at scale: An advanced logit compression system using polynomial approximation, error-diffusion quantization, and bit-packing makes it practical to train from pre-captured teacher outputs without running the teacher model live during every training step.
Composable loss functions: Mix and match KL divergence, JSD, TVD, ranking losses, and hidden state alignment to control exactly what the student model learns and how aggressively it is pushed toward the teacher’s behaviour.
📎 Tech Briefs
Hugging Face Spring 2026 open source report released: The platform now has 13 million users and over 2 million models, with Chinese models accounting for 41% of downloads and robotics datasets growing 23x to become the largest category.
vLLM 0.18.0 released with 30.8% throughput improvement and new Realtime API: The release ships async scheduling with Pipeline Parallelism delivering a 30.8% end-to-end throughput gain, a WebSocket-based Realtime API for streaming audio, and expanded hardware support across NVIDIA, AMD, Intel, and TPU.
OpenAI launches GPT-5.4 mini and GPT-5.4 nano: Two smaller models optimized for speed and efficiency in coding, subagents, and high-volume tool use, continuing the industry shift toward task-specific models over general-purpose frontier inference.
NVIDIA’s AI Grid reference design cuts edge inference costs by 76% in early benchmarks: The architecture distributes AI compute across telco network nodes, with Comcast benchmarks showing sub-500ms latency at P99 burst traffic and 80.9% throughput gains over a centralized deployment.
Anthropic announces the Anthropic Institute: A new research body focused on the economic, societal, and security impacts of advanced AI, signalling that frontier labs are formalizing impact analysis as a structured part of model development and evaluation.
That’s all for today. Thank you for reading this issue of Deep Engineering.
We’ll be back next week with more expert-led content.
Stay awesome,
Saqib Jan
Editor-in-Chief, Deep Engineering
If your company is interested in reaching an audience of senior developers, software engineers, and technical decision-makers, you may want to advertise with us.









