Packt Deep Engineering: Interviews

Sovereign AI and Agentic Infrastructure with Rick Spencer

Saqib Jan — Wed, 24 Jun 2026 18:06:00 GMT

Most engineering organizations adopting AI do it without compliance regimes scrutinizing every decision. SUSE works under exactly that scrutiny, and the way it solved for AI adoption under strict data sovereignty requirements is instructive for any team that cares about where its data goes and what its AI actually costs.

Rick Spencer is the General Manager for Technology and Product at SUSE, where he leads the engineering teams behind the company’s full product portfolio, from SUSE Linux Enterprise and Multi-Linux Manager to the cloud native stack of Rancher, RKE2, K3S, and SUSE AI.

SUSE has one of the longest and deepest open source infrastructure histories in the industry, and its enterprise customers are operating under strict compliance regimes. Rick joined Deep Engineering Live to talk about how SUSE adopted AI agents without breaking its promises on data sovereignty, the framework his teams use to decide which AI tools fit which work, why he rejects output-based developer metrics, and the role MCP now plays in managing enterprise infrastructure.

Watch the full conversation below or read the full interview.

This session was recorded offline as part of the Deep Engineering Interview Series. The transcript below has been lightly edited for clarity and readability.

Q. Tell us a little about yourself and what you do at SUSE, and the kinds of engineering challenges your teams are working through right now.

Rick Spencer: I’m the General Manager for Technology and Product at SUSE. That means I lead the engineering teams for all the products that we offer to customers. That includes Linux, like SUSE Linux Enterprise, SUSE Linux Enterprise Server for SAP, and Multi-Linux Manager. We also have a suite of cloud native products like RKE2, K3S, and Rancher of course. We have a lot of things built on top of both of those, like the application collection, which are certified Kubernetes applications that you can run. We have other products composed of those building blocks like SUSE AI that you can use to run your own sovereign AI stack. There is SUSE Edge, and SUSE Edge’s cousins like SUSE Telco and SUSE Industrial Solutions.

Our customers tend to be enterprises with pretty serious enterprise requirements. They work under compliance regimes, they typically face a lot of scrutiny, they need important things like L3 support, reliable lifecycle models, a lot of predictability, high quality, and low CVE counts. So we take the open source software in the world and we create packages of it that are usable by enterprise companies.

Q. SUSE has a longer and deeper open source infrastructure history than most companies in the space. When AI agents started becoming the real workflow tool for engineers, how did that land internally, and what did adoption actually look like on the ground?

Rick Spencer: There is a lot to unpack here, so let me try to go at least somewhat systematically. All the software that we write is open source, so we are not worried about leaking the code. We publish the code. That was not the concern. But there are things like, let’s say you are debugging a customer environment. You do not want to let your engineers just take those logs and send them to random AI bots. We promise that we won’t do that. They trust us to not do those kinds of things. So there was a phase that we went through trying to figure out how to use AI in an effective way that maintained our promises to the customers.

Part of that solution was really realizing that engineers are going to use AI to go faster wherever they can. It is not like, oh, don’t use AI. That would just not be workable. So it was really about setting up our engineering management team to coach those engineers effectively. Besides keeping promises of data sovereignty, costs can also really run out of control. We see a lot of people run into that. For us, we never really had the problem of engineers deleting things in production. Our engineers tend to be very cautious. But it is easy to rack up pretty big bills on Anthropic and Copilot and so on.

A big part of the solution was that we have our own sovereign AI. We make this thing called SUSE AI, which is a stack you can use to manage AI workloads on your own infrastructure. We use that pretty heavily internally and we run Llama on it. If you are doing things, and we have a few different places that we do that, it is all within our private infrastructure, so we make sure there is no chance that any data can escape. We only use models which can be vetted effectively.

Then there was more to it on the coaching and oversight side. What we ended up doing is getting pretty precise about how engineers use AI. We broke it down into three categories. The first is using it for your daily work, which is your statement completion and your debugging and that kind of thing. The second is using it for agentics, which is relieving your toil, letting agents take care of some things that used to create a lot of work or interruptions. The last part I call curve jumping. That is when you are going from zero to infinity, doing things with AI that you would not have tried before, like solving really deep problems in one big step.

We created a framework around those three different kinds of uses, and then we help engineering managers help their engineers pattern match. Okay, if I just need statement completion and debugging help, these are the tools I can use for that. If there is a sovereignty aspect to it, these are the tools for that work. These are good tools for this kind of agent. And then these are the frontier models that we provide to you for those curve jumping capabilities. It sounds very organized now, but there was a lot of experimentation and a lot of rapid innovation from the engineers. Some of the early adopter engineers led, and then we went back and tried to create some order out of that so we could spread what we learned around.

Q. Digital sovereignty is central to how SUSE thinks about its stack. How does that principle shape where AI can and cannot go inside your engineering workflows?

Rick Spencer: I don’t think it’s can or cannot. It’s more so how. Let me give you an example. For digital sovereignty, a lot of the things we build, we actually build in something called the internal build service, which is an instance of something called Open Build Service, which is a service we provide to everybody. Kubernetes is built on it. There are thousands and thousands of things built on it. The interesting thing in terms of digital sovereignty is that all the builds are offline. They literally are not connected to the internet. This is super important because you need to be able to prove that nothing happened during the build process. That is a lot easier to do if there was no internet connection.

So we have to go around the hard way. We have to make sure all the sources are there. You cannot pull anything live in the heat of the moment during build time. You cannot run post-install scripts. Now if you want to apply AI in that environment, let’s say you want to backport a patch to previous stable releases, can you do that in a sovereign way? Yes, you can, as long as we are running AI in a way that is able to run disconnected from the internet and we have complete visibility into everything it is doing. These things are not easy, but we have decades of experience with it that we can apply. In some cases we even train our own models to accomplish these things, and that way we know the model does not have some naughty time bombs built into it.

Q. When your engineers started integrating AI agents, where did it deliver productivity gains, and where did it create new problems?

Rick Spencer: Let me give you some examples. My favorite example isn’t really about code. We are an open source company, and we have all of this code within our dominion of control, in GitHub and here and there and everywhere. I don’t know if you remember the Trivy attacks last month, and just a spate of tool chain attacks. We had a response process for that, but it was predicated on those things occurring occasionally, not twice a week. So our security team wrote an agent that scans certain sites every hour. It says, okay, there is a reported compromised package, typically in NPM, sometimes in PyPI, different places. It finds those, and then it scans all of our open source code to see if we are using it anywhere. If we are, it writes a report and notifies us on Slack. So we know right where to pay attention right away.

Fortunately, so far we have not really been impacted because we have really good hygiene around our tool chains. But bad guys are really smart and work really hard, so we still want to stay super vigilant. That has just been such a huge relief, because now if you see a report of a tool chain attack, our agent was on it before we even knew about it. It saves so much toil, because we don’t have to send people out to check if we are using this package or do a search in this area of GitHub.

There are other areas like CVE mitigation. A CVE comes in, and an agent examines it. Is it even applicable? A lot of times the CVE comes in and the package is in the repo, but it is not being exposed in any way that it would matter. There is this thing called VEX, which is basically a file you provide along with the CVE database to explain whether the vulnerability is impacting you or not. That is really hard to do at the scale that CVEs are coming in, but the agents can do that for us pretty easily. That means we are focusing our attention not on keeping up with the crush of CVE reports, but on the actual vulnerabilities. Our attention is reserved to actually keep our customers safe.

Q. How do you think about measuring the impact of AI on your engineering teams?

Rick Spencer: We might have a different view on that than some people. We are really tending away from measurements that measure output and utilization, and we are trying to focus on impact. What that means is we don’t have leaderboards that show every developer and how many lines of code their agents submitted. I consider that garbage vanity metrics. Not helpful.

One of the main things we want to do is measure the impact of our use of AI without it being an extra burden on the development team. A lot of the tools out there assume you are a proprietary software company where everyone is working on a single code base, which is just not how an open source enterprise works. We are working on hundreds, if not thousands, of repositories all the time, and the work to maintain them is very different. So those utilization numbers and those developer-to-developer comparisons just don’t have value. I’d rather have engineers working than reporting.

So we are setting up metrics for different things, like how fast CVEs are being addressed, how fast patches are being backported, how fast our L3 responses are getting closed while maintaining the same NPS score. These are things where we have applied AI to different areas, so let’s focus on the business impact, not on the utilization.

That said, we are working on a set of dashboards right now so an engineering manager can look at the cost and utilization on their team to help with coaching. Let’s say an engineering manager has an eight person team. Hey, I noticed we are burning a lot of tokens. What are we actually doing that is burning that many tokens? I’m not sure we are getting value out of that. Or, hey, we have these seats for this LLM or code assistant we bought, but we are not using them. Are there areas where we could be? So we are definitely measuring the value from a business perspective, but we are really trying to decentralize and allow engineering managers to guide their teams on getting the most value out of AI, without it becoming a leaderboard game where developers feel exposed in some game that is not about providing value to customers.

Q. Whenever we talk about AI agents, we cannot avoid MCP. Why does it matter to so many companies today, and what does it unlock for engineering teams in your kind of environment?

Rick Spencer: MCP is critical, actually. If anyone is listening and does not know what I mean, a Model Context Protocol server is a little bit of code that runs and offers to an LLM, hey, here is some context you can use, and here are some tools, some actual things you can do. That turns a chatbot into an agent, because an agent can actually do things.

MCP does a few things that are really important. The first is just ease of use. A good MCP server provides structure to the LLM so that it is way easier to write a prompt to get the results you want. The LLM sees the MCP server is for this purpose, and these are the kinds of information the humans want out of it. You don’t have to include all that in the prompt. And it is actually really easy to write an MCP server with an LLM. If you have a decent model, it does not even need to be top tier. You say, hey, we have this bit of software we want to control with agents, this is our use case, and it will write an MCP server for you pretty easily. Then you can have a human go in and edit it.

But there is another part to it. We ship MCP servers with all of our products, and we think this is really important. In our view, the world is moving to a new paradigm. Before, as an administrator, you would think about all the applications you use to monitor and control your servers, your Kubernetes clusters, your workloads. Now we are moving into a mode where you don’t think that way. You think about writing agents, or chatting with the infrastructure to get the information you need, and then it is able to take action on your behalf without you having to worry about the specific syntax.

For that to work, those MCP servers have to have really good human knowledge encoded into them. If you think about our MCP servers for SLES, for Rancher, for Multi-Linux Manager, the key is that the experts in using that tool have crafted those MCP servers. It would be like, instead of you sitting down in front of a chatbot saying I need to figure out how to use Rancher, you are sitting down with the whole Rancher development team telling you how to prompt the chatbot. That encoding of knowledge makes the agentics way more powerful, because it is not guessing. Otherwise it has to look at the raw APIs and make a bunch of guesses, and there is no way as a human you will know if that is the right thing to do.

All that said, there is another really important thing MCP servers do, which is provide a place where you can, as an enterprise, bring some sanity and control to the usage. If you have MCP servers running, they are just servers. That means you can provide access ACLs to them. You can say the MCP server for this user is allowed to use these tools and not these tools. You can log the use of the MCP servers. We have our own gateway, but we also partnered with a company called StackLok that we talked a lot about at the last SUSECON. There are different gateways you can put into place as an enterprise to keep the MCP servers under control. You don’t give the LLMs access directly to tools, only the MCP servers, and then you can have that oversight and meet your compliance needs.

Even at a low level, you can put the MCP server, I call it, in jail. You can say, on the server, here is the user for this MCP server, here is a systemd process that only presents the actual compute resources it needs. Because you have to be thinking, for every MCP server you are running, there is an LLM out there trying to use it, and who knows what kind of prompt injections people are running. MCP servers also guard against things like the AI hallucinating something and deleting your production server, because you simply don’t provide that tool to it. This to me is one of the main roles SUSE has to play as part of this disruption, because we are bringing this agentic notion of how to manage all of your infrastructure.

Q. Cost is a real constraint when you are running AI at any scale. What does a practical cost mitigation approach look like for an engineering organization working the way SUSE does?

Rick Spencer: I can speak from our own experience. The fact of the matter is, if you are using a self-hosted AI, sure, you spend a lot on the big iron, and you are probably paying a company like SUSE for support. But nonetheless, there is a maximum cost there. Then the real question is, do you have the observability in place to make sure it is being utilized fully? That is a very different conversation. Are we getting full utilization out of our fixed costs, where we never have to worry about overrunning?

There is digital sovereignty, and sometimes they call this cost sovereignty, because no one can come back later and say, oh, by the way, we are changing our model. We have some suppliers where a lot of our developers were using seat-based pricing, and then over time they let us know, in plenty of time, that they are moving to usage-based pricing. That is a big change. We did not have sovereignty over the way they price it, whereas if you are hosting your own, you have that sovereignty over the pricing. So it is something to think about.

Another thing we use a lot is circuit breakers. Hey, we just noticed our Claude usage in the last minute was way too high, or Gemini, whatever you are using. That keeps runaway agents in check. It can be very frustrating for developers if they are trying to get work done and every single minute they are getting rate limited, but we are talking about cost controls, so you need to do the thing.

The other thing to say is that we are big believers in frontier models. We are not saying don’t use frontier models, but it is important to use them for the right things. You do not need a frontier model to understand your Python module and give you code completion. You just don’t need it for that. The frontier models are really for when you are in that curve jumping, super strategic mode. We have projects where we spend tens of thousands of dollars on frontier models, but they generated, who knows, a million or two million dollars in value, so the cost benefit was definitely there.

One thing we do with frontier models is, let’s say we need an agent for something. We use the frontier model to create an agent that can then be run on a much lower cost model. It will say, sure, I’ll write the Python scripts it needs to use, so it doesn’t have to try to do that inference every time. I’ll write the context file that works for that model. So you can start with the frontier model and then tell it to do things with your less expensive models, or even your own models that are in your own infrastructure.

When you see that needle move, you see people start adopting it, and you’ll see step functions in utilization of tokens. A certain engineer, the penny drops for them that they are in a new paradigm where, as a developer, they suddenly realize they are empowered to be ten times, a hundred times more effective using these tools. You can see day to day these little jumps. Oh, somebody figured it out. Someone figured it out. So then you need to go back, because you don’t want to stop them from getting that 100X improvement. You need to give them the right tools for the job.

Try Rust With Your Own Hands and Eyes with Francesco Ciulla

Saqib Jan — Thu, 11 Jun 2026 10:33:12 GMT

Francesco Ciulla has been building with Rust since 2022, working across web development, developer tooling, and content creation for a large technical audience online. He is a Docker Captain since 2021, a former full-stack developer at the European Space Agency on the Copernicus project, and currently head of developer relations at Zerops.

He is the author of The Rust Programming Handbook, published by Packt in December 2025. Francesco joined Deep Engineering Live to talk about Rust adoption strategy, organizational challenges, concurrency, deployment workflows, and where the language is headed in 2026.

You can read or watch the full conversation here:

This session was recorded live as part of the Deep Engineering Live Interview Series. The transcript below has been lightly edited for clarity and readability. Audience members joined the conversation and asked questions directly during the session.

Q. What does Rust adoption actually look like at the organizational level, and what does success look like for engineering teams introducing it?

Francesco Ciulla: Rust has been growing a lot in the past few years and I am glad I started learning it a bit earlier than most people. I started creating content in 2022 and 2023 and then began working on the Rust Programming Handbook around April 2023, which took about two years to publish.

On adoption at scale, there is the famous meme about rewriting everything in Rust, and like every good meme there is a bit of truth in it. But I think the best approach, from a practical perspective, is not to rewrite everything in Rust at the beginning. The best way to introduce Rust in a big project is to find the hard part that is slowing things down, the bottleneck of your services, and try to write one single service in Rust. That is the best way to approach it. And then you will probably see Rust slowly take over more of your codebase, but I mean that in a good sense.

Q. Amazon and other large organizations have noted the high cost and risk of adopting Rust without internal expertise, and the talent pool is also quite thin. What advice do you give engineering leaders planning to introduce Rust and acquire the right people?

Francesco Ciulla: As with every new technology, the problem is not the technology itself. It is how well the technology is understood by the people in the organization. I remember when I was working at the European Space Agency and Docker adoption was slow, not because of anything wrong with Docker, but because a new technology that is not well known internally creates friction. That is the bottleneck.

The best approach is to have a shepherd, someone who can bring real knowledge into the organization. Basically a senior Rust developer who already knows all the flows and who people can refer to when they get stuck. This is especially true in the AI era where everyone is writing code with AI assistance, but you still need validation. Who decides whether the AI-generated Rust service is safe to put in production? You need the validation of an expert. That said, this is not just a Rust rule. It is the golden rule of adopting any new technology.

Q. The Rust learning curve has a reputation for being steep. The borrow checker in particular causes a period of deep soul searching for newcomers. What is the best way for an experienced developer to learn to think in Rust, especially in an AI-accelerated world?

Francesco Ciulla: I actually gave a talk at Rust Nation UK recently with the deliberately provocative title Rust Is Hard to Learn, so feel free to fight me on this. Because I think the idea that Rust has a steep learning curve is more of a myth than a reality, and I believe it can be addressed quite quickly.

The biggest challenge is not the concepts themselves but the mindset. Rust has a unique way of handling memory, and even if you are a senior developer with 20 years of experience, if you try to learn Rust by comparing it directly to other programming languages, you will struggle. The more experience you have, the more you think you already know how things work. Fighting the borrow checker, understanding lifetimes and ownership, can feel overwhelming if you bring that baggage with you. But if you are open-minded and approach it as something genuinely new rather than as a variation on what you already know, you will get the full power of the language and understand why so many people are enthusiastic about it.

Rust has been voted the most loved programming language year after year, and loved means the people who have used it still want to use it. That is a meaningful metric. I would rather trust the judgment of people who have actually worked with Rust than form an opinion based on what someone said on Twitter. When I am an engineer, I prefer to try things with my own hands and my own eyes.

Q. When is Rust not the right choice for a team, despite all its advantages?

Francesco Ciulla: I should probably say never, because I am supposed to be biased for Rust. But I prefer to be honest. There are genuinely cases where Rust is not the ideal tool.

First, when you need something simple and it needs to be done immediately. If you are a junior developer who needs to deliver a working API today and you do not know Rust, this is not the moment to start learning it under deadline pressure. You can always refactor something later. When you need something that just works, go with the technology you already know well.

Second, the ecosystem argument is real. Python has better libraries for data science. JavaScript has a larger package ecosystem for certain kinds of web work. Rust integrates well with other languages, but if you need something that is native to another ecosystem, that is a real constraint rather than just a preference. Good engineers use the right tool for the problem, and the case for Rust is strongest when the problem involves performance, memory efficiency, or concurrency at a level where other languages start showing their limits.

Q. Google’s Android security team reported that memory safety vulnerabilities fell below 20 percent after prioritizing memory-safe languages. Are those kinds of productivity and code quality benefits common in practice when teams use Rust?

Francesco Ciulla: Yes they are, but I think productivity in Rust is not primarily about typing speed. Rust can feel verbose and you might write more slowly in some ways than in other languages. The biggest benefit is the lack of debugging depth. You spend more time thinking carefully upfront, but you spend almost zero time chasing segfaults or memory leaks in production. And we always underestimate that part.

We talk a lot about how efficiently we can write code, but if you need less time to debug your code you are effectively writing more logic per unit of time. That is the part people consistently underestimate. We only tend to count the time it takes to write the thing, not the time it takes to find and fix what breaks later.

I also think Rust is one of the best programming languages for working with AI-generated code specifically because of what it requires from you as the reviewer. After the AI generates code you still need to touch it, understand it, and validate it. Otherwise there is no difference from copying boilerplate off Stack Overflow. You still have to understand what the code does. If you have no control, either you are useless or you cause a problem. In both cases, that is not a good place to be.

Q. What makes Rust’s approach to concurrency different, and how does it actually help teams building multi-threaded systems?

Francesco Ciulla: The first time I learned concurrency was at university in Java, and it was treated as the final, advanced session of the course, something dangerous that required extra care and specialized knowledge. I think many engineers carry that experience as a kind of trauma around concurrency.

When I went to teach concurrency in Rust for a YouTube video, I expected it to be challenging. I started the example and in two minutes I was done. I had the opposite problem. It was too easy.

This is structural rather than accidental. Rust was created when multi-core processors were already standard. Concurrency was not retrofitted onto a model designed for single-threaded execution. It was built in from the start. And the ownership system that prevents data races at compile time is the same ownership system that governs memory safety everywhere else in the language. There is no separate concurrency model to learn. The properties that make Rust memory safe are the same properties that make concurrent code safe.

That said, I personally prefer to add concurrency once an application is already working and you want to use memory more efficiently. If doing that takes a day of extra effort and it reduces your server costs from a hundred dollars a month to twenty, it was worth it. If it takes three weeks of fighting with the runtime, you would probably rather just spend more on the server. We are talking about production-grade applications here, not weekend side projects.

Q. How does Rust’s concurrency model hold up when you are building low-level networking components like proxies, packet processors, or kernel modules, territory that has traditionally belonged to C?

Francesco Ciulla: There are more repositories in C for that kind of work, and that is just a historical fact. But I am already seeing people build protocols and low-level networking components in Rust, and I think with AI assistance this is becoming more practical and more doable than it was even a year ago.

Just the fact that we are seriously considering writing these things in Rust is a win for the language. Nobody ever suggested doing this kind of work in JavaScript, because at that level you need pure efficiency and developer experience is not on the table. Only languages with the right performance profile can even enter this conversation. And Rust’s readability is a genuine advantage in that domain specifically. Low-level code becomes very complex very fast, and the fact that you can read your own code tomorrow is a significant practical benefit when you are dealing with networking protocols. I am not saying Rust will replace C or C++. Languages rarely disappear. But we now have an option, and having that option is already a meaningful shift.

Q. What are the most common pitfalls you see developers run into with Rust?

Francesco Ciulla: I wrote an eighteen-page chapter on pitfalls in the book, so I can probably remember about half of them now. The biggest one is trying to write Rust as though it were another language. If you approach it with the patterns and assumptions you bring from other languages, you will have problems. The second is not trusting the compiler. Especially at the beginning, the instinct is to try to fix things your own way without reading what the compiler is actually telling you. The compiler is giving you very specific information and the errors are basically tutorials. They are helping you write better code. Learning to read them carefully rather than working around them is probably the single most important habit to develop early.

Q. Rust’s trait system and rich type semantics let developers encode invariants directly in the type system, but this power can also lead to very complex code. How should an experienced architect balance using the advanced type system versus keeping things simple?

Francesco Ciulla: If you want to build rockets, you need to use more complex tools. That is just the nature of it. I think the best approach is to build a solid foundation in the basics first. You should not be using traits without understanding what they are. In terms of code organization, Rust is the best language I have worked in for how it structures modules, files, and folders. At some point you will stop writing everything in a single file, and having a clear module structure helps you manage complexity as it grows.

You can also write straightforward Rust code up to a certain point. If you want to unlock the full potential, including unsafe Rust and raw performance, Rust allows you to remove the seat belts. But of course the code becomes more complex at that level. This is not unique to Rust though. The complexity of any project is always exponential. It starts simple, simple, simple, and then suddenly nobody knows what is going on anymore. Having solid fundamentals gives you the tools to manage that curve.

Q. Rust versus Go for microservices is a question many backend teams face right now. Is Rust ready to challenge Go in the web backend space?

Francesco Ciulla: Let me be clear first that Go is not a bad language. Docker is written in Go and I would never say that makes Go bad. I have a Docker bottle on my desk and I have been a Docker Captain since 2021. I also currently work at Zerops where we have a CLI written in Go and I read a lot of Go code myself. So I am not dismissing it.

On pure performance, the comparison between Rust and Go is not really a contest. Check the benchmarks yourself and send me the link where Go beats Rust. Sometimes they are at the same level. In terms of pure performance there is no story.

Where Go genuinely wins is on CLIs, cloud tooling, and developer ecosystem. Docker is written in Go, Kubernetes is built on Go. If you want to be in the DevOps space and write tools in that ecosystem, Go is the natural choice. It also has more engineers available in the job market right now, which matters if you are hiring.

From a developer perspective though, I would argue the opposite is worth considering. If there are fewer Rust engineers than Go engineers, being expert in Rust means less competition. I would rather be expert in a language where there is less competition than be one of millions of Go developers. But I understand that a company hiring today will probably find a good Go developer faster than a good Rust developer. Both arguments are valid depending on which side of that conversation you are sitting on.

Q. How does Rust affect build and deployment workflows in practice?

Francesco Ciulla: This is one of my favorite questions because it combines two of my favorite things. When you build a Rust application, you get architecture-specific executable binaries. If you run cargo build on your machine, you get an exe on Windows, an executable on Mac, and an executable on Linux. You can also cross-compile, changing the target architecture through the compiler, to produce a Linux binary on a Windows machine.

My preferred workflow for production is to build the Rust binary directly when building the Docker image. That way I have a Linux executable compiled inside the container, ready to run everywhere Docker is installed. The dream for operations teams is having a single lightweight binary inside a Docker container. You get the portability and scalability of containers with the minimal footprint of a Rust binary.

Q. Are there real operational benefits to shipping Rust services as smaller container images with lower runtime overhead?

Francesco Ciulla: Absolutely. The binary is the dream for operations teams because it is the most lightweight option. I mentioned earlier that my Rust web server uses four megabytes of RAM in development and five in production. On a one-gigabyte droplet you could theoretically run more than 200 such services in idle. That kind of resource profile changes what is economically viable to deploy.

You could in theory run the Rust executable directly without Docker, and for a single service that works fine. But if you need to orchestrate ten services on the same machine, containers are still the right answer for production-grade applications that need to scale. The slight overhead Docker adds is worth it many times over in terms of scalability, replaceability, and operational consistency. We are not in the 90s anymore.

Q. What use cases in web development do you see Rust excelling at?

Francesco Ciulla: When performance is really important, Rust is where it shines. If performance is not your main concern, you can go with more conventional choices. I am still a fan of Node.js for certain kinds of work.

One of the most underappreciated arguments for Rust in web services is flat latency. Languages with garbage collectors, including Go, Java, and Node.js, introduce periodic pauses when the collector runs. Those pauses can last hundreds of milliseconds. An HTTP request that arrives during a GC cycle gets a worse experience than one that does not. By not having a garbage collector on the backend side, you have flat latency. You do not rely on luck or on the user not being the unlucky one. That problem is simply removed.

The other scenario where Rust makes a clear case is when you have one service in your system that is significantly slower than everything else. There is always one in any sufficiently complex system. Sometimes it is not the service itself but the upstream dependencies it is calling. But when the service itself is the bottleneck, spending time to optimize it in Rust makes sense. Writing the whole application in Rust from scratch is only necessary if you are starting a new project or you genuinely want to have fun with the language. For most teams, the bottleneck service is where to start.

Q. What are the main challenges when integrating Rust into CI/CD pipelines and monitoring?

Francesco Ciulla: The honest answer is that since Rust has been in production for less time than other languages, there are fewer examples in the documentation for some specific integrations. This gap is closing quickly and will probably disappear in a couple of years, but if you are using a specific technology and looking for a Rust integration example, you may occasionally find the documentation lacking compared to what exists for JavaScript or Python.

The Rust toolchain itself is actually one of the language’s strongest points once you get used to it. Running tests is cargo test. It is integrated natively into the language and there is no equivalent of the npm versus yarn versus pnpm decision that JavaScript teams have to navigate before writing a single line of code. The ecosystem and toolchain are famous within the Rust community for being one of the things people love most about working in the language.

Q. The Linux kernel maintainers have declared Rust permanent and are planning components that require it. What does that endorsement signal about where Rust is headed?

Francesco Ciulla: It is great news and very bad news for Rust skeptics. The fact is that Rust is slowly getting adopted at bigger and bigger levels. You can see this on the government side, in military applications, and in security-critical domains where safety requirements are the highest. Just the fact that Rust was considered a viable option for kernel-level work was already a meaningful milestone, even before it succeeded. The language was competing in a domain that had been exclusively C and C++ territory for decades and it earned a permanent place there.

I will be genuinely happy when we stop having the conversation about whether Rust should be used, and we just start using it. Python does not get these conversations. Nobody asks whether you should use Python. I will be happy when Rust reaches that level of acceptance as a normal production choice. We are moving in that direction. Every day I see another positive signal.

Q. Where do you see the next phase of Rust growth happening?

Francesco Ciulla: I think the biggest shift is coming in web development backends, and I know this is an unpopular opinion in the Rust community where the language is traditionally associated with systems programming. But I am seeing companies with hundreds of developers reach out to tell me they are rewriting their backend services in Rust. These are not random side projects. These are companies making deliberate production decisions.

Two years ago I would not have committed to building a paid SaaS product in Rust. In 2024, probably not. In 2025, maybe. In 2026, yes, I would use it. The Axum framework in particular has matured to the point where I am confident recommending it for production. That was not true a year ago.

In the embedded space, Rust is already winning and I have largely stopped advocating for it there because the argument is settled. I am also seeing companies that manufacture embedded devices ship them with Rust as the default, which is a different story from developers experimenting with Rust on embedded hardware. When the producers of these devices choose Rust before selling them, that is a commercial signal.

A member of the audience asked: With the state of the industry right now, is it challenging for a junior developer to start their journey with Rust and find their first job?

Francesco Ciulla: With the state of the industry right now, I think it is challenging for a junior developer to find a job regardless of which language they know. So let us remove Rust from that equation first.

You have two approaches here. One is to go as mainstream as possible, learn the most used framework, and compete for the highest volume of jobs. The problem with that approach is that you are also competing with the largest number of other candidates. I personally prefer the opposite approach. Since there are fewer Rust engineers than engineers in most other languages, being expert in Rust gives you a real differentiator.

If a job description lists JavaScript, React, SQL, Docker, Kubernetes, and then also mentions Rust, and there are two candidates and one of them knows Rust, that extra knowledge might be the thing that gets you the role. That is my honest view. The era of becoming strong in exactly one technology and finding a job with that alone is probably over. We need to be flexible. But dedicating some time to understanding the basics of Rust might make you shine in an interview in a way that knowing only mainstream technologies will not.

Francesco Ciulla is the author of The Rust Programming Handbook, published by Packt, and head of developer relations at Zerops.

Hands-On Software Engineering with Python with Brian Allbee

Saqib Jan — Wed, 03 Jun 2026 12:30:00 GMT

Brian Allbee has been writing Python almost exclusively since 2012, working across cloud-based application development, machine learning integration at Dice.com, and backend systems in AWS using Step Functions and Python Lambdas.

Allbee, Staff Software Engineer at Cleerly and author of Hands-On Software Engineering with Python, now in its second edition published by Packt, joined Deep Engineering Live to talk about what separates engineering from programming, how to scale and refactor Python systems responsibly, and what it actually takes to grow into senior and staff-level roles.

Watch the full conversation below.

Q. Tell us about your background and the kinds of systems you have worked on.

Brian Allbee: I have been programming almost exclusively in Python since early 2012. Prior to that I worked in C Sharp dot net, Flex markup language, and PHP for application development. I landed on Python at a job I started early in 2012 at an ad agency where they needed somebody to come in and build an internal application that was more performant than their off-the-shelf solution for asset management. I fell in love with the language a little before that position started, but I was very happy that a language audit I did reinforced that Python was still the way to go because it had everything they needed.

Since then I have done client-facing cloud management application work, a handful of customer-facing applications I cannot get into too much detail on because they are still covered under NDAs, and the last six years I spent doing machine learning implementation and integration for Dice.com on the team that eventually became their applied data science and AI team. Currently I am doing backend system development in an AWS cloud context with Step Functions and Python Lambdas to deal with health insurance processing.

Q. What distinguishes a true software engineering approach from just programming, particularly for Python developers working on real-world systems?

Brian Allbee: I think learning to think in terms of systems, not just implementations, is probably the main thing. I feel that holds true whether the backing language is Python or not, and it does not stop with just the systems that an engineer is writing. On the technical side it extends out to the entire toolchain, anything that shapes the code itself or determines how the code is managed or handled. But it also extends to what I would call nontechnical systems in the sense of a set of principles or procedures that define how something is done.

I basically feel that programming is really focused on making sure that the code is correct, the correctness of the code itself. Where software engineering starts expanding out into more of a focus on sustainability as change occurs.

Q. For Python developers aiming to move into senior or staff engineering roles, given how much AI is now part of development workflows, what skills or mindset shifts do they need beyond raw coding proficiency?

Brian Allbee: I think that same systems-oriented thinking is still the big dividing line, and I believe that will hold true even if LLM-based code generation turns out to be the next big thing that all of its proponents argue it will be. Even in those scenarios, manual interaction with code at the level of the syntax of the code itself might dwindle over time, but there are still going to be sensitive domains where some of that remains necessary. More importantly, engineers need to understand how that code fits together even if they did not write it, and why it fits together the way it does.

Hand in hand with that is a broader understanding of the problems being solved. Software engineering, like every other engineering discipline I am aware of, is concerned with solving problems usually while operating within some set of constraints. Software engineering focuses on solving those problems by creating systems, and that goes back to the whole systems-oriented thinking. But solving the problem requires understanding that problem first. Even if code generation becomes largely or completely automated, someone still has to own that system and understand its constraints and its potentials for failure and how it is expected to evolve over time.

Q. What are some of the best practices for updating, refactoring, and scaling an existing Python codebase as it evolves?

Brian Allbee: I think most of the paths to success in that context, at least the ones I can think of that I have seen, do not really start with the architecture but with discipline behind the process. The approaches that have worked for me or that I have seen work well for others include keeping things as simple as possible, wrapping processes that get used over and over again into functions or methods or whatever context works best whenever possible, and not being afraid to use structured data.

If you are working in a team, make sure the team has some agreement about how much in-code documentation and by extension comments are expected and what it should provide. The same kind of team-level agreement about what code standards you want to apply. Stick to those until something comes up that is not covered, and revise them as needed. It is a growing process.

Writing code with an eye towards making it testable is key even if there is not an immediate need for testing. Future you, if you are writing good tests, will come back and thank you if they could.

Q. Can you share an example from your experience of tackling technical debt or redesigning a Python system to improve its maintainability and performance?

Brian Allbee: Honestly, I really cannot. I do not have any dramatic war stories here because I have worked with generally exceptionally healthy teams that treated technical debt as a first-class concern, not as an emergency. Technical debt is one of those product-level priorities. Whoever is making the prioritization decisions is going to be in control of when those get tackled if they get tackled.

If there is significant technical debt, making sure that you can communicate effectively, here is what the impact of this technical debt is to your product-level people or whoever is making those decisions, is going to be a key thing. That means being able to sit down and say, I understand that you do not want us to deal with this bug or whatever it happens to be. If we do not deal with this, it is going to lead to this, then that, and the longer we put that off, the more likely it is to lead to a really significant problem and the longer it is going to take for us to get past that.

Q. Modern teams often grapple with how much upfront system design to do versus driving straight into coding. How do you find the right balance between careful architecture and rapid agile execution?

Brian Allbee: I think understanding the full final scope of a project, even if it is just at a very high level, is critical to one side of that balance. The other side is knowing, again even if just at a high level, what constraints and non-project expectations are in play. You mentioned agile. Even if some form of agile is not part of a team’s day-to-day processes, there are some takeaways from agile that I think can still be beneficial. The entire idea of delivering work and software frequently on some sort of cadence is one of those. Iterating against the smallest deliverable units that can be identified would be another major factor.

I do not think the real risk is too much design or too little. It is designing without understanding the constraints that are involved. Iterating on the smallest meaningful units of work is going to be the most practical way to find that right balance between design and execution.

Q. Python has introduced features like data classes, type hints, and static typing tools in recent versions. How have these modern language features shaped your approach to designing Python software, and how would you recommend engineers fully embrace type hints in large projects?

Brian Allbee: Not as much as it might sound like, actually. Before I turned to Python I was working with C Sharp dot net, which is a statically typed compiled language. I came to really like the idea of static typing from a programming perspective even in dynamically typed languages like Python and JavaScript. If you dig back far enough into some of my very old and now unsupported blogs, you will find I even wrote some blog posts about implementing that sort of a structure in Python back as far as version 2.7.

I definitely recommend using the typing system that is in the language right now. At worst, it is additional documentation that modern IDEs can pick up on to help an engineer working with the code. With the inclusion of just one third-party package, I like TypeGuard myself but there are others, it is possible to achieve runtime static-like type safety and static type-like behavior in Python code. And pre-deployment tools like MyPy are going to pick up on your type hints to give you some extra quality control going into that process. I think about whether you enforce types at runtime or not, the design clarity is worth it.

Q. In building robust Python applications, how do you approach data modeling and validation? When does Pydantic make sense and when are simpler options sufficient?

Brian Allbee: I think it depends on the scope and intentions of the project. Pydantic is great for projects where there are complex requirements that can be derived from something like a JSON or OAS schema. It is also good for projects that are responsible for generating those JSON or OAS schemas. The downside is it is a larger package, coming in at around two and a half megabytes or so for the module itself and its primary dependency. So if package size is a concern, that might not be the best choice.

There are other options. Fast JSON schema combined with regular Python dictionaries and lists is a solid alternative and it is much smaller. If there is no need for any kind of schema documentation but there is still a desire for type checking, type-annotated Python classes will probably get you 80 plus percent of the way there in my experience. If you need mutable data structures and that is all you need, data classes are a good option. If you need an immutable data structure and type checking is not a concern at the level of the code structure, I usually start with something like a named tuple.

I think the right data modeling tool depends less on popularity and more on the system’s constraints and the scale, scope, and longevity of the project.

Q. What are your go-to best practices for testing Python systems at scale? How do you balance unit, integration, and end-to-end tests, and how do you ensure the test suite stays reliable and useful over time?

Brian Allbee: For my own projects, I tend to like a testing approach that exercises valid and invalid inputs for all of the parameters of every callable in the project. You combine that with judicious monitoring of missing lines in a code coverage report, and that has served really well for me in making sure that the targets of those tests are being both thoroughly and realistically exercised.

In a working environment, I like for the team to come to some sort of consensus about how the tests are organized, what tools are involved, and so on. I have seen what happens when different engineers who are not communicating with each other each go their own way. The tests that result, even if they are rigorous and well thought out, are oftentimes difficult to follow across different test modules because different people write tests in different ways.

When integration or end-to-end testing is not feasible, I try to push unit tests closer to behavioral testing even if that increases mocking complexity. Those kinds of scenarios, unit testing can still go a long way. Ultimately, though, tests are most valuable when they reflect how the system is actually used, whether that is managed at a unit testing, integration testing, or system testing level, not just how that testing code or the code itself is structured.

Q. AI is now being used in areas like test generation, debugging, and even test maintenance. How should engineers think about AI in testing without compromising reliability and confidence in their systems?

Brian Allbee: I think the fundamental important thing there is going to be getting some sort of a consensus from anybody who is involved, all of the stakeholders, and anybody who is going to be held accountable for failures in the code, as to what the guardrails need to be. I like AI from the standpoint of code generation for things that are not sensitive. If it affects somebody’s lives or livelihoods, I do not want randomly generated code out there without really good guardrails.

The approach that I have seen and tried myself that has the most promise is to take a test-driven development approach and define a test suite, and only allow humans to modify that test suite. Make sure it is really good, really solid, really rigorous, and covers all of the business needs, everything that you can come up with. And then you can let an AI process go to town on the code as long as there is a clear boundary there. You can tell it, you can write all the code you want, it must pass this test suite, you do not get to modify that test suite. Let it go to town. At that point I think you are probably about as well covered as you can be.

Q. How should Python teams set up CI/CD pipelines to improve code quality and deployment reliability? What best practices help the most and what pitfalls should be avoided?

Brian Allbee: The goals are all the same for CI/CD regardless of the language involved. You have to fetch the code, you have to test it, you have to build it and package it, and you have to deploy it. The basic sequence is common across the board. There may be additional tasks like checking that the deployment process is well-formed, or aspects that are not tied directly to the code itself.

The main value that CI/CD adds is not necessarily the automation. It is the fast, automated, trustworthy feedback that you get from one of those processes. I would say look for places where you can generate that feedback, find the break points that are going to happen, and make sure that anything that fails gets surfaced in a meaningful, timely, and useful fashion.

Q. What makes a Python application cloud-ready, and what are the most important design principles to bear in mind?

Brian Allbee: Cloud-ready can mean different things to different organizations, different teams, or even individual people. A containerized application is cloud-ready provided that it can be deployed appropriately, but so too are function-as-a-service constructs like AWS Lambda functions and their equivalents in other provider spaces. It all depends ultimately on the final deployment expectations.

Some key things to bear in mind include leveraging environment variables to help control behavior in different cloud accounts or environments within those accounts. You will find that they can carry over from local development to deployment processes and build pipelines all the way out to your final deployed product. They can always be replicated and manipulated locally, and that makes things easier and faster to change in a deployed application without having to redeploy an entire stack.

Be aware of and actively seek out systems that cloud providers offer for things that you need to deal with. Secret storage is a great example. Pull a secret in one time when a container initializes or a Lambda starts up, and then do not touch it after that. Know the best practices and constraints for your final deployed code. A great example is AWS Lambda functions. You cannot run a Lambda function for more than fifteen minutes, so once you have a good idea of how long a process can take, set that timeout accordingly and test against it.

I think cloud-ready is less about where code runs and more about designing for volatility and external constraints.

Q. How do statelessness and containerization fit into building scalable cloud systems, and why do they matter?

Brian Allbee: If you think about it at a basic structural level, the key concept that ties almost every cloud resident system together, containerization, stateless design, any of those, is that they are inherently disposable. A container can be killed at any time. A Lambda invocation could be terminated before it reaches a successful completion. Kubernetes pods restarting are probably routine events. Even in a serverless context, a virtual machine can be stopped and restarted without warning. Recognizing that means designing your processes around the expectation that your hardware can disappear at any point in time.

Statelessness in that context is in a very real way about making failure of your hardware cheap. There is no state to manage, there is no need to write code to reacquire that state. A process ends and is restarted. Planning for failures and designing around the idea that recovery from a failure is just starting a new instance is probably near the top of the list of factors shaping design decisions.

In a container-based context, the container is at that point the smallest unit to replace. The key factor to keep in mind there is making sure that the startup behavior is consistent and predictable. Ensure that the environments are repeatable and allow a failed container to be replaced automatically and seamlessly rather than relying on any kind of manual troubleshooting process.

Statelessness and containerization matter more because they make failure cheap and recovery routine than for any other purpose or reason. That is what it comes right down to.

Q. A member of the audience asked: How critical is containerization to scaling systems?

Brian Allbee: Containerization is one of the more popular mechanisms but it is not the only mechanism out there. Most of my experience with containerization has been in a cloud-oriented context, and the alternatives in an AWS context at least include things like Lambda functions, which technically are their own containers but you do not have to worry about containerization as one of the factors in your code that you are concerned with. You are literally just writing code to fit inside the context of that Lambda container and letting it go. It is a good skill to have, most definitely, something that is going to be of use and interest in a lot of jobs these days, but I do not know that it is a critical skill for all cases.

Q. A member of the audience asked: Does Flask scale well?

Brian Allbee: The scaling question is so context-dependent that it is really hard to say definitively. In a containerized structure where your data store is completely separated from your Flask environment and application, and you can spin up and drop new instances of containers, I think it scales as well as anything else out there.

You will probably find that FastAPI is going to be more performant, but there is also a lot more work that has to happen in a FastAPI context. Flask is probably about in the middle. It is a good balance between a lot of stuff already supported versus speed of operation. And then at the other end you have something like Django where it does everything for everyone but it is not going to be as performant at an instance-by-instance level. After that, it is really going to end up depending on how well you can spread that load out through load balancing across containers running your application, regardless of whether it is Flask or FastAPI or Django or something completely homebrewed. That is probably where you are going to see the most scalability capability out of all of those options.

Q. What is one trade-off you see Python developers consistently get wrong when building systems?

Brian Allbee: The one I would say is most consistently seen in my experience is going back to the idea of overengineering. I want to write this as an object-oriented system because object-oriented is the way to go. And the same could be said for functional programming. Understand your problem space and design the solution around that problem space, because that is what you are trying to do, provide a solution for that specific problem space.

The best example in my personal experience was a system that was written in Python by somebody who came from a C Sharp background. The project was ridiculously huge. Every function had its own module. Every class had its own module. You put everything together, the functional layers of the system were seven or eight deep depending on the context.

Way more complicated than it needed to be, and it was hard to manage and hard to maintain. If I could have gone back and talked to, let us call him Steve, I probably would have said, Steve, there is this really good book out there called Code That Fits In Your Head. Read that. It is all about keeping things at a manageable level because psychologically humans can only keep five to seven bits of information in the front of their memory at a given point in time. You are talking about seven layers worth of depth in a project structure. That is already saturating things. Keep it simple. Collapse things down to the point where you do not have to have nineteen different classes and fifteen different instances of all these other classes to deal with something that really should be capable of being managed as a single function.

Q. How can you tell when an engineer is ready for senior-level work?

Brian Allbee: It goes back to what we started with. If they start demonstrating that they are concerned with more than just whether the code is doing what it is supposed to do, whether this function works, if there is a certain amount of curiosity about why are we doing it this way, what is the advantage of taking a functional programming approach versus a procedural versus an object-oriented one, what are the trade-offs, and do they recognise the trade-offs. Those are the things that start really indicating somebody is actually ready to go beyond just, I have written this function, it is done, it is tested, it works, I am finished.

The senior engineers that I have tried to emulate and that I have seen do their best work really are not defined by the code that they write but by the systems that they shape and the teams that they are enabling. That involves some gatekeeping, asking why we are going down this particular design path, or why are we not using this brand new library that has just shown up in the last three months. There is a lot of broader scope in asking those questions of less senior engineers and guiding them to learn how to ask those same questions on their own. Why do we go down this road? What is the benefit? What are the trade-offs? Because there are always trade-offs. Always.

Q. What motivated the second edition of Hands-On Software Engineering with Python, and what changed?

Brian Allbee: A good part of it was just the time lag. It was seven years between the first edition and the second. But Python itself has changed significantly, not so much in the language core but in the maturity of its tooling and the breadth of problems that it is now used to solve. The ecosystem around testing, packaging, automation, and deployment has grown in ways that significantly change how Python is used in real-world systems. Its adoption has also expanded dramatically, particularly in cloud and large-scale environments and also in AI, where it is very much a go-to language right now.

Today, Python engineers are frequently expected to think about architecture, performance, testing, and operational concerns in ways that just were not as common when the first edition was written. All of those growth areas and the dramatically increased surface area of use of the language kind of begged for further discussion.

The first edition tells the story of a fictitious company called Handmade Stuff that is just starting to develop an application structure to deal with what they are trying to accomplish as a company. The second edition takes that story forward and says, okay, we have this application and it is functional but less than optimal, and there is a significant impetus from the organization to move into the cloud. So what would that look like? A lot of the principles are still absolutely the same. You still need that system-level thinking, you still need to understand the problem space, you still have to work out how you are going to deploy this. What it looks like is going to be extremely different.

Q. What is the one piece of advice you would give to Python developers trying to grow into stronger engineers in an AI-accelerated world?

Brian Allbee: If you come away from thinking differently about why you write the code that you are writing, not just how, then you are moving in the right direction. Engineers are on the hook to develop and ship working code. But I will go back to my basic principles more than anything else. Think in systems. Design for change. If you are working in a team, optimise for your team.

Ask how easily the code could fit into a larger system, how it will change over time, and how your design choices will affect the product later for the people who have to work with it after you are done with it. That mindset shift is a key thing in enabling an engineer to grow into more senior roles and to build software that holds up under the real-world pressures that you are going to run into.

Brian Allbee is a Staff Software Engineer at Cleerly and the author of Hands-On Software Engineering with Python, published by Packt.

Building Agent-Ready APIs in Production with Erik Wilde

Saqib Jan — Wed, 20 May 2026 20:06:57 GMT

Erik Wilde has spent more than 12 years working on APIs in every form, from communication protocols to enterprise API platforms, governance frameworks, and now the question of what it takes for APIs to actually work for AI agents. He holds degrees in computer science from TU Berlin and a PhD from ETH Zurich, has contributed to multiple open standards, and is an OpenAPI Ambassador at the OpenAPI Initiative. He currently works at Jentic, where he focuses on making API landscapes usable for the next generation of agentic consumers.

Erik joined Deep Engineering Live interview session to talk about OpenAPI 3.2, what agent-ready APIs actually look like, and why he is more skeptical about MCP than most people expect.

Watch the full conversation below.

A note on format: this session was recorded live as part of the Deep Engineering Live Interview Series. The transcript below has been lightly edited for clarity and readability. Audience members joined the conversation and asked questions directly during the session.

Q. Tell us about your background and how you ended up working on APIs.

I have been working on APIs, in some shape or form, all of my life. I started with communication systems and protocols and then moved into the API space proper about 12 years ago. I have mostly worked for companies that sell enterprise software in that space, so typically API gateways and API platforms, the kinds of things where large companies have a lot of digital capabilities and a lot of those have APIs. More and more, companies have realized that the better you maintain, manage, extend, and govern that real estate, the easier it becomes to develop new applications and to realize potential that is within the company but needs a little bit of digging to get to.

Then about a year ago I met the two founders of Jentic, and they described to me what they were building. Very briefly, what they want to do is build a platform where agents can use APIs, because oftentimes the APIs that exist might not be the ideal ones for agents, and you also might want to control those agents a little more because you might not be confident they always do the right thing. We all know that AI has a tendency to sometimes have surprising ideas. I really liked that idea, so I decided to join. I have been at Jentic now for just over half a year and it has been a great experience. I still talk about APIs because in the end, without APIs, there is simply no AI.

Q. OpenAPI 3.2 shipped last September. What changes have the highest operational impact for engineering teams, and which are mostly nice to have?

3.2 is a maintenance release. It is backwards compatible and does not change things dramatically. What it is not, and I want to start there, is AI focused. That is what we are planning for the next version, 3.3, where we really want to think more aggressively about what it would take to make OpenAPI specifically more AI friendly.

That said, even in 3.2, some of the improvements are more meaningful than they might first appear. The tag system has been extended so that tags, which you use to group and annotate operations, are now a hierarchical space rather than a flat one. You can have tags and subtags and so forth. That is something people always wanted to do. The reason it matters for AI is that anything that makes an API description semantically richer, anything that allows descriptions to carry more meaning, is valuable for agents. So thinking about how you describe your APIs not just as technical endpoints but as semantic services, with rich schemas, descriptions at every level, and well-defined error messages, that is where I think the real operational value lies right now.

At Jentic we have released a scoring mechanism for APIs so you can find out whether your API is AI friendly or not. A lot of what that scoring looks at is the kinds of things that have always been good API design practice: put in more descriptions, include examples, make your error messages clear and actionable. The difference now is that where a human developer might look at a poorly described API and figure it out from experience and context, an agent that cannot figure out how an API works will simply move on to the next one. It has less context and less tolerance for ambiguity. So the APIs you design now will probably be around for a couple of years, and starting to think about this new class of consumers is worth doing today.

Q. Streaming is also now explicitly supported in 3.2. When teams document streaming, what details separate readable from implementable and testable?

Streaming always was something people were doing. I think it has just become so much more visible because that is how all the AI APIs work. When you use a chatbot and you watch the response appear word by word, that is streaming in action. And what 3.2 does is give you a slightly more explicit way to document that in OpenAPI. That is actually a very common pattern with OpenAPI improvements over the years. It is not that something entirely new is added. It is more that people can now formally document something they have been doing all along, but that was not well covered by the specification.

WebHooks are another good example of this. WebHooks have been popular for a long time. I was surprised when somebody gave me a statistic saying that around 60 percent of the 100 most popular APIs use WebHooks. That is a remarkably high number, but it makes sense because WebHooks are a convenient pattern. You do something with an API, and at some point the API can call you back and say this process is finished, go and fetch your results. People had been doing that for a long time, but it was never explicitly supported in OpenAPI. And then at some point the specification simply gets extended to cover what practitioners are already doing. That is what makes it more complete over time.

Q. The 3.2 tag structure now supports nesting. How do you use tags as information architecture for large API catalogs, and how do you govern that taxonomy across teams?

That is a good and very demanding question, because it goes well beyond OpenAPI and into whether you have a data dictionary or some general framework for how things get named in your organization. Organizations always have a hard time doing that because it is hard to agree on terms, and it is hard to make sure that everyone understands which terms exist, what they mean, and when to use them. Tags are no different. They give you a way to assign meaning to things in your OpenAPI description, but what that meaning is is entirely up to you.

Until now tags were relatively minor things. The typical pattern was to say here are all the operations about customers, here are all the operations about products, and so on, and documentation tools would then group things by tag. With the hierarchical tag structure in 3.2, you could go much further. You could have a hierarchy of unlimited depth if you want, where each thing in your API is linked to some kind of data dictionary or ontology. I have not seen people doing that yet, but I am pretty sure they will start.

That said, my recommendation would be not to go crazy building a complex standalone tag taxonomy inside OpenAPI. If you start introducing complex terminology with different hierarchies and groupings, you probably also need to align that with every other place in your organization where things get tagged, whether that is databases, document stores, or wherever you manage information. So check what your general information architecture looks like. What dictionaries or terminologies are already established? Then think about how you map those into the OpenAPI tag model rather than inventing a whole new taxonomy that lives only in your API descriptions.

Q. On linting as a quality gate: how do you design a rule set taxonomy that maps cleanly to real ownership, the way platform teams and product teams each have different responsibilities?

What linting is being used for right now is governance and a level of automation. The goal is that when people start designing or changing APIs they get quick feedback on whether they are following guidelines or not. A good number of organizations publish their rule sets openly on GitHub. I have a collection of around 30 or 40 publicly accessible ones. The Zalando ones are popular because they have been around for a while. Adidas has some solid ones. There are also some published by government and e-government initiatives. So there are plenty of references.

Linting is useful but it has real limitations. The popular tools, whether that is Spectral, Vacuum, or Redocly, all work in a similar way. You have rules that apply to certain parts of your OpenAPI description and they check for structural conditions. Something like, this operation must have a description and the description must be at least 20 characters. It is really a structural check. And that is useful. I would absolutely recommend doing it.

What I am not a big fan of is just reusing existing rule sets wholesale. I would always say start owning this, build up your own in a collaborative fashion. Have a GitHub repository somewhere where developers can propose and discuss new rules, argue for whether a guideline is worth following, and then get it merged into your shared rule set once there is enough agreement. You might also have different rules for different stages of the API lifecycle. Some rules are so important that every code check-in has to follow them. Others might only apply to APIs you expose to external partners, where you want higher quality standards. So you end up with rule sets that are tuned by the consumer type or the lifecycle stage, or both.

But as I said, linting has limits. At Jentic we use Spectral and Redocly as part of our API scoring checks, but we also have a good number of LLM-based checks, because if you are scoring APIs for AI readiness, what matters is not just whether a description field exists but whether it is written in a way that is actually useful for an agent. Those are the kinds of checks that typical linting tools cannot do because they operate at the structural level. So linting is a solid and by now fairly standard first line of defense, but also look a little beyond it.

Q. How do you set severity levels like error, warning, and informational, and what is an exception policy that avoids lint fatigue without lowering the floor?

Severity levels really should be what you would expect. If something is non-negotiable and needs to be fixed before anything moves forward, that is an error. There is no discussion. Then you have warnings, where the message is that this is not great but it is acceptable, though you should consider fixing it. It gives the developer a signal without blocking them. And then informational messages, which honestly I am not sure are that interesting for developers to act on directly. What I have seen done a couple of times is that informational-level messages are not really meant for developers to read at all. They are intended for downstream tooling. The linter surfaces an observation that is then picked up by some other tool in the pipeline. So the informational channel becomes a way for the linter to communicate with tooling downstream rather than with the developer.

Q. On large specs with tens of thousands of lines, linting performance and PR feedback loops become real constraints. What repository or spec structuring patterns reduce friction without fragmenting the contract?

What you probably want is to avoid always linting the whole thing. Large specifications are never in one file. They are assembled from a whole bunch of sources, schemas, references, and components from various places. So it makes much more sense to have your checks in place at those individual source locations rather than only at the assembled specification level. Instead of linting the full spec at the end of every pipeline run, start linting when you make changes to the schemas and the smaller pieces that feed into the overall description.

If you do that with a reasonable level of discipline, you avoid the compounding effect where you finally lint the big spec and get hit with hundreds of errors you have been quietly accumulating. Do not treat linting as the last step. Do it as early as possible, as close to where the change is actually happening as you can. That is the pattern that keeps the feedback loops short and the debt manageable.

Q. There is a proposal for OpenAPI 3.3. What are you personally most interested in seeing there?

For me, because of where I work right now, the big issue is how we could improve OpenAPI specifically with a focus on AI. We have not done that so far in any serious way. There are a whole bunch of discussions within the OpenAPI Initiative around how that could be done.

Some of it is about semantics. Some of it is about making clearer when and how long an API is actually going to be around, which is something agents care about in ways that human developers traditionally have not. Agents always use an API at runtime. They discover it, decide it looks like a good API to use, and then need to figure out what it does, what it does not do, what its side effects and constraints are. All of that could be surfaced in a much more accessible way through the API description itself rather than sitting only in human-facing documentation.

One idea I find genuinely interesting is the relationship between OpenAPI and Arazzo. Arazzo is a workflow language, published by the OpenAPI Initiative, that lets you orchestrate sequences of OpenAPI interactions. You can say: to accomplish this goal, call this endpoint, then that one, then that one. It is a simple orchestration language layered on top of OpenAPI. What would be really cool is if an OpenAPI description could link to an Arazzo workflow and say, if you use this operation, it actually makes the most sense as part of this workflow you can find over there. Figuring out multi-step workflows is one of the hardest things for agents to do right now, and Arazzo is genuinely good at describing those. We just need to make it discoverable. So that is one of the directions I would love to see 3.3 move in.

And as a reminder, the OpenAPI Initiative is open source and open to everyone. You do not need to be a member, you do not need to pay anything. The discussions happen primarily on Slack. If you have ideas or questions, just come and join. It is a very active and welcoming community. Check out openapis.org, and note that the S matters.

Q. With MCP consolidating under the Linux Foundation’s AI foundation, what is the minimum governance surface an enterprise needs before agents can use tools broadly?

I am still a little skeptical about MCP, honestly. I may very well be wrong, but what I would really encourage everyone to do is first think about your API estate and really invest in your APIs, rather than obsessing too much over MCP specifically. Whatever you invest in better APIs becomes useful for everyone. Developers can use it, agents can use it, partners can use it. If you invest specifically in MCP, that investment is effectively scoped to LLM consumers. And that may sometimes make sense, but it is important to keep in mind that the API landscape is the foundational layer you will be working with long term, and MCP may or may not stick around.

At Jentic we do support MCP because at this point you have to, but we are not deeply invested in MCP itself. If MCP went away and something else came along, that would not be a significant problem for us. We think of what we do as delivering capabilities to agents, and MCP is the current delivery mechanism. You need a delivery mechanism, but I would not build too many things that are MCP-specific. That would be my personal view.

Q. From an audience member: what makes an API truly agent-ready in production compared to a standard REST API?

One of the things I like to use as an illustration is the GitHub API. The current GitHub API version three has around 1,100 operations. GitHub is a complex product and there is a lot you can do with it, so 1,100 operations is not unreasonable. But for an agent to work directly with that API is quite complex, because a large number of those operations need to be combined in a certain way to produce the workflows that you actually want to accomplish on GitHub.

Now compare that to the GitHub MCP server, which has around 70 tools. Way fewer, and they are much higher level. They represent entire workflows, entire things you might want to do on GitHub, rather than the more atomic operations you find in the native API. What I would argue is that if you had a genuinely agent-friendly GitHub API, it might also just have around 70 operations. Not 1,100. Right now those 70 are available through MCP because that is what GitHub decided to build, and that is fine, but the point is that if you have an agent that wants to get things done, it will be significantly happier with 70 well-described higher-level operations than with 1,100 lower-level ones.

The properties that make an API agent-ready follow from that. It should not be too fine-grained. The descriptions should be written at a level that is meaningful for an LLM, which means intent-based and human-readable, not just technical. It should have examples, and ideally multiple examples rather than just one. Error messages should be meaningful and actionable, giving the agent enough information to understand what happened and what it might do next. And if you make those improvements, you almost certainly also improve the developer experience as a side effect, so it is not a speculative investment.

Q. On API deprecation and sunsetting: how should agents handle the lifecycle signal that an API they depend on is entering a sunset cycle?

Deprecation and sunsetting are genuinely important to me. I have written some small standards for how an API can actually surface that information at runtime. And I think we will see more and more of these runtime mechanisms being built out, because agents consume APIs at runtime by design. They discover an API, start using it, and then ideally they should also be able to discover that the API is only going to be available for another two weeks. At that point, a well-designed agent might alert someone, or start looking for a replacement, or whatever the right behavior is for that situation. What exactly to do about it is a separate design question. But as a consumer of an API, this is information that is relevant, and if we can surface it at runtime, consumers can react at runtime. That feels like an obviously good thing to pursue.

Q. On request and response schema design: how do you design schemas so that an LLM can reliably choose the correct operation, handle partial failures, and avoid duplicating side effects?

Schema design becomes part of the general question of how you design OpenAPI for AI consumption. You want descriptions in your schemas, not just in your operations, so that an LLM can understand what individual fields actually mean rather than just their names and types. Names that carry meaning help too. Parameters named X, Y, and Z are much harder for an agent to reason about than parameters with names that reflect their actual intent.

Beyond that, I think we are going to see interesting evolution in how APIs handle the granularity of what they return. Right now the standard REST model is relatively static: here is a request schema, here is a response schema. But if you are working with agents that are trying to minimise token usage and context pollution, there is a real case for APIs that can return only the fields that were actually asked for. GraphQL has a nice built-in capability for this, which is one of the things that makes it interesting for agentic use cases. REST does not have that natively, but you could layer something on top. We will see how that evolves, but it is one of the more interesting design questions in this space right now.

Q. What workflow patterns show up repeatedly when enterprises actually start working with Jentic, and what makes them stable as APIs underneath them change?

One example we were not expecting, which is always a good sign when you start talking to real enterprises, is the partner integration scenario. If you have a relatively complex API that you expose to partners, that is a large engineering effort for each of those partners. They have to understand the whole API even if they only need a small part of it.

What we now actively pursue, because it keeps proving useful, is creating specific workflows for specific partners. You say, this partner only wants to do these particular things, so they get a set of workflows built on top of the API that match their actual use cases. They do not need to understand the full API surface. They just need to understand the workflows that were created specifically for them.

And the stability point is interesting. As long as you develop your APIs in a backwards compatible way, those workflows remain stable even as the underlying APIs change. As a workflow user you do not even need to know that the APIs underneath now do additional things. You just keep invoking the same workflows and they continue to work. The moment you break a backwards compatible API is the moment you also break the workflows depending on it. So the discipline of backwards compatibility pays off at every layer.

Q. Looking ahead six months, what should a senior engineer or platform engineer watch closely in standards, tooling, or governance for agent-facing APIs?

What I would recommend, starting from tomorrow morning, is to begin thinking about agents in your planning even if you do not have them yet. And I acknowledge that the term agent has become fairly meaningless at this point. Everything seems to be called an agent now. But what I do see when talking with organisations is that certain types of agents are already getting real use, customer support agents and some HR agents being the most common. These are agents that are useful across industries, and you can mostly buy them, hook them up to your documentation, and they work.

What you see much less of right now, despite all the talk, is what I would call real business agents in production, where a piece of software can sense things, take action, and make decisions. Agents that actually have agency. And I believe we will see more and more of these, not necessarily all at once, but incrementally. You trust them with a little more next year, and a little more the year after.

Because of that, I would highly recommend making the AI readiness of your APIs part of your standard practice now. API landscapes evolve slowly. Whatever you design or change today will probably be around for a year or two or three before you touch it again. So ask yourself whether your linting and your design practices are optimising only for developer experience, or whether they are also starting to account for agent experience. The good news is that optimising for agent experience tends to improve developer experience as a side effect. You are not making a speculative bet. You are making something better for everyone while also preparing for what is coming. If you work on API platforms or in platform engineering, start thinking now about how your API landscape will need to evolve as you have more and more agentic consumers. Because it is going to arrive. That is at least my personal view.

Erik Wilde is Head of Enterprise Strategy at Jentic and an OpenAPI Ambassador at the OpenAPI Initiative. He is the creator of the Getting APIs to Work channel on YouTube. This interview was conducted by Saqib Jan, Editor-in-Chief of Deep Engineering.

Design Patterns, Ownership Models, and Building Resilient Systems in Rust with Evan Williams

Saqib Jan — Wed, 13 May 2026 18:07:20 GMT

Evan Williams has been writing software for more than 40 years, across every layer of the stack and more programming languages than most engineers will encounter in a career. His book, Design Patterns and Best Practices in Rust, published by Packt, is not a pattern catalogue. It is an argument for a different way of thinking about code entirely, aimed squarely at experienced developers who arrive in Rust carrying instincts that the language will refuse to accommodate.

He recently sat down with Deep Engineering to talk about what that shift requires, which traditional patterns break in Rust, the typestate pattern he finds almost impossible to stop talking about, and why he discovered it is harder to write bad Rust than good Rust.

Watch the full conversation below.

A note on format: the transcript below has been lightly edited for clarity and readability.

Q. You have been in software for a long time. Tell us about your background and how your journey started.

Evan Williams: I have been in the software business for horrifyingly more than 40 years. Surprisingly, given that, my journey started when I was 14 years old, in the 1970s. My father was a very skilled electrical engineer, and we built a computer in the basement together, which had a 6502 processor and 1K of RAM. He hoped that I would become an electrical engineer because of my interest in the electronics. I became a programmer because I thought the programming was the most fun part. I have since then grown up with the industry, not from the very beginning of it, but certainly from fairly early on, and in particular grown up with professional software development from the beginning. I’ve touched every part of the stack, many, many programming languages, many systems, and I am intensely interested by this. If they didn’t pay me to do it, I’d do it anyway.

Q. You have worked with languages like C and Python. What initially drew you toward Rust?

Evan Williams: I’m interested in programming languages in general, and I had heard about Rust almost at the very start. I looked at it and I said, this looks kind of mildly interesting, I’ll just remember that. And then I forgot about it. But years later, Rust had just barely reached 1.0, and I was working on a hardware project that had a few interesting characteristics. It needed to be in a location that we couldn’t access easily. It wasn’t going to be on the internet. It needed to be rock solid because people depended on it. And it was remarkably complicated software. The team talked about it and we made the decision to go with Rust, which none of us knew at the time. We all learned it together. And that was where I caught the Rust bug and have not lost it since.

Q. What motivated you to write a book on design patterns in Rust at this stage of your career?

Evan Williams: There are sort of two questions tied up in there. One is, why would I be writing a book at all at this stage of my career? And the answer to that is, I have been remarkably fortunate to have people help me throughout my career and help me grow. And it makes me incredibly happy to be able to pay that forward and help the people who are learning at this point to grow themselves. This book is a vehicle for me to help people. Why Rust design patterns? Because I think the interesting thing about dealing with design patterns in Rust is very often they’re not what you need, and very often the traditional ones are not exactly what you need, and very often they can cause you problems that you didn’t have to have. So I wanted to save people that frustration and help them move along the path in a way that is a lot less painful, by not making the same mistakes that I made.

Q. Rust has been around for some time but is now seeing increased adoption. Why do you think this is the moment?

Evan Williams: For a long time there were a lot of people like me who were excited about Rust, and there were people who were early adopters who were pushing for it. But right now what we’re seeing is a world where the language has matured, the ecosystem has matured, it’s reached a kind of critical mass. Now when someone is thinking about doing a Rust project, they’re not pioneers who are wandering out into lands unknown. There’s a huge community, a huge set of packages and software, great learning resources, and there are people who have had success that you can see who have done amazing things with Rust. People feel more confident and feel safer venturing into using Rust for things, whereas before it was a little bit more of a risk in their mind.

Q. What gaps does Rust fill compared to more established languages like C, Java, or Python?

Evan Williams: I could just go over sort of the normal thing and say high performance and memory safety, and all those things are true. But Rust is also a powerful modern language that has a good feature set. And I feel like there’s a design gap of sorts. One of the things that’s great about Rust, and I’ll probably keep returning to this topic, is if you need something to be correct, if you need to be able to count on the code that you write, Rust helps you. Rust is your partner in doing that. You can still write code with bugs in it, but Rust makes it harder to do that and easier to write code that’s going to be solid. I think that’s a huge gap that it fills.

Q. What kinds of systems or use cases benefit most from adopting Rust today?

Evan Williams: Systems that are mission critical in some way or other are really key Rust use cases. Yes, Rust has high efficiency and the memory safety is very important. But all of these features combine into a whole that make Rust a really powerful language for doing things that have to work. Things where failure is monetarily or in human cost even a terrible problem. That’s one of the powers of the language and I think it’s great for those situations.

Q. One of the recurring themes in your book is that developers need to think differently in Rust. What does that mindset shift actually involve?

Evan Williams: There are a few things associated with that. One is that Rust is not an object-oriented language. It kind of looks like an object-oriented language in some ways, but it’s not. And if you carry with you an object-oriented language mindset, then you’re going to have nothing but trouble. It’s also a language that requires you to think carefully about the design of your code before you start writing it. It’s very easy to get yourself into trouble in Rust if you don’t plan what you’re doing. You have to start thinking about data and how it’s handled in a different way. You have to think ahead of time about where the data is flowing through your program, what it is, where it is going, how long is it going to live, who is responsible for it. These are things that you don’t really have to think about when you’re writing a Python program or a Java program. Those are good design principles in all of those languages, but Rust requires it of you.

Q. Many developers struggle with the borrow checker early on. How should they reframe it as a design tool rather than a limitation?

Evan Williams: The thing about the borrow checker is it’s there to help you, and it is very easy to get into a mode where you’re fighting with it and you feel like it’s your enemy. But in fact, what it is doing is encouraging you to build your code in a solid manner. It’s encouraging you to think about not just what data you have, but how it’s going to be used. In something like Java or Python, with some amount of plumbing you can get anything from anywhere. You don’t really have to design your program in a highly organized way where you’ve thought about the data flows. But you’re much better off if you do. I think the principles that the borrow checker forces you to adhere to in Rust are the exact principles that you should be using in every programming language. But you don’t have to. So it’s very easy to not think about those things. The borrow checker is your friend because it prevents you from making a messy design. It prevents you from making a broken design. It prevents you from writing whole classes of bugs that you will then spend many hours trying to find. I have found it to be an incredible partner in writing code that allows me to sleep at night.

Q. What are the most common mistakes developers make when they try to apply patterns from other languages directly in Rust?

Evan Williams: The principal thing is, number one, trying to use Rust as if it’s an object-oriented language. It’s not going to work. Viewing the compiler errors as things that you need to figure out how to work around is something that virtually everybody who starts with Rust does. And I can’t emphasize this enough: one of the things that’s sort of interesting about Rust is the more experienced you are, the more years you have doing something in some other language, the more trouble you’re likely to have, because you have patterns of thought that come from those languages that you don’t even realize are there. That’s one of the things that can really get you into trouble. Treating the compiler errors and the problems with the borrow checker as things to work around as opposed to signals that your program needs some redesign, and thinking about things from an object-oriented perspective, those are the main traps.

Q. Traditional design patterns were created with object-oriented languages in mind. How do they evolve when applied to Rust?

Evan Williams: In a number of different ways. First, some of them evolve into entirely new forms or almost out of existence, because being a modern language, Rust has things like enums and all sorts of very advanced use of generics. These are things that make a lot of these design patterns either less necessary or unnecessary. Another way that they evolve is that since there is no inheritance in the language, you have to rethink a lot of the design patterns where, say, you would have an abstract base class. You can’t have that because there’s no such thing. So you would lean into traits. And every single design pattern that you use is affected by the borrow checker and by memory discipline in a way that is not normal in something like Java or Python.

Q. Are there any patterns that become unnecessary or even counterproductive in Rust?

Evan Williams: The one that immediately springs to mind is Singleton, which is so useful that I was using it before it had a name. But it’s useful in Python or Java. In Rust, it tends to be either completely unnecessary because other features of the language make it unneeded, or it tends to encourage designs that are really not necessary and where a much better approach could be used. There are a few occasions where the singleton pattern, as it stands, is actually useful, but more often than not, it’s getting you into trouble.

Q. What Rust-specific patterns do you think are the most powerful and still underutilized?

Evan Williams: The one that I get so excited about that I have to limit myself so that I don’t spend the rest of this conversation talking about it is the typestate pattern. This is something that was not invented for Rust, but you would think that it had been because it works for Rust so perfectly. It’s a way of developing state machines and systems that have state that evolves where invalid state transitions aren’t just errors, they’re impossible to write. The compiler won’t compile them. It represents a huge advance in the way that such systems are written because now instead of runtime errors, you have a state machine that is guaranteed to work because every transition either is a valid transition or it won’t even compile. That’s an amazing thing and I love that feature. Not invented for Rust, but it fits Rust so perfectly, it’s hard to believe it.

Q. Your book emphasizes clear data flow and system architecture. Why is unidirectional data flow so important in Rust systems?

Evan Williams: Thinking about the way data flows through your system in Rust is crucial because Rust makes it much more difficult to thread things back. As a general rule, because you can’t have many different clients holding on to different things in different places, especially not being able to write to different things in different places, having a clear direction of data flow makes it much clearer and much easier to create a consistent system that’s going to compile and work. By saying, I have a chain of ownership that moves down but never moves back up, you are now much more likely to have a system that is going to work in an environment where the number of references that can be held is very limited and you have to be very careful about memory safety. Data flowing down is something that feels natural and smooth and just works. Data trying to fight the stream back up is going to end up giving you problems because the borrow checker is not going to like you.

Q. How does Rust’s ownership model influence architectural decisions at the system level?

Evan Williams: I think one of the crucial things about it is that because you have to be so thoughtful about how your data moves and who owns it, you have to also think about what that means for your system. You need to think about who controls what, how it is controlled, and you need to start from the very beginning thinking about the boundaries of your program and the system architecture, dividing things up into areas of responsibility. Because unlike Python or Java, you can’t have links going all over the place. The borrow checker is never going to accept that. Rust gives you lots of great tools for creating boundaries of abstraction that make it a lot simpler to write code that doesn’t have hidden or difficult-to-find connections between things. You know who owns things, you know who is able to write things. In order to do that, you have to be able to think about things ahead of time and think about what parts of your program own what, and create clear boundaries and areas of responsibility for each piece of the system that you build.

Q. Can you share an example of how Rust leads to better system design compared to other languages?

Evan Williams: One of the examples in the book, one of the projects that we build in the book, is a miniature publish-and-subscribe system similar to Kafka, but very, very much smaller. It is amazing how easy it becomes to make something that is solid and clean in that circumstance. Because Rust has move semantics, you know that if something leaves here and goes here, it’s now not here anymore. It’s there. There’s no question about things like having references dangling or anything like that. The clarity of things moving through the system, the clarity of being able to have immutable data in a lot of places and knowing who can and can’t modify any piece of data, it just makes the design of the system so clear and it makes it so much harder to make a system that doesn’t work. By doing things that way, you have something where you can have a potentially very complicated system and yet have complete confidence that every piece of it individually is going to work right and that they’re going to work together as a system.

Q. What are the real-world challenges that engineering teams face when adopting Rust in production?

Evan Williams: One thing that often happens is that when a team adopts Rust, because of the challenges of learning to work in the language, velocity can go down at first. The team can find itself actually moving slower. Once the team becomes very well acquainted with Rust, velocity can increase dramatically, but there is a period of time where it seems like things have gotten worse. Another problem that often happens is that, as I said earlier, more experienced developers often have a harder time adapting to it. And the last thing I’ll mention, although it’s a lot better than it used to be, is Rust is still not 100% in terms of the kind of rich libraries that you’d find in, say, Java or C, which are just older languages. There’s more out there that supports those languages, although Rust is certainly catching up and it’s remarkable how far it’s come.

Q. When would you advise against using Rust for a project?

Evan Williams: There are a few things. If you’re doing some kind of prototyping, Rust is harder to prototype in and you really want something more like Python. If you’re working in certain niche environments where the tooling is not there, that’s a place where you don’t want to try to fight with the tooling to try to get Rust to work. It’s good to just work with the native things that are there. There are places where a dynamic language like Python is just a much easier thing to use and will more perfectly fit what you’re trying to do. And I also think there are a few areas where Rust has the potential to do a lot but is still catching up. User interfaces, for example. There are certainly user interface frameworks and libraries, but doing a website in Rust is still kind of a feat. And it’s an awful lot easier to use the tools that everybody else is using to accomplish that goal.

Q. What is the best way to introduce Rust to a team without overwhelming your developers?

Evan Williams: The thing that you need to do to start is find a piece of work that you can work on that has a limited scope and which is not in the critical path, because there are going to be roadblocks and bumps as people learn. What you don’t want to do is jump into saying, we’re just going to rewrite our project in Rust now. Pick a small piece, focus on that, gain confidence and mastery of the language, and then use that to build upon it and start bringing in more things.

Q. How should developers balance performance, safety, and complexity when designing systems in Rust?

Evan Williams: One of the nice things about it is that all of those things sort of come with the language. The thing that developers need to do is focus on letting the language help you. Rust will give you all of those things if you focus on thinking in the language and using patterns and features that are natural to the language, as opposed to trying to retool something from another language that doesn’t really fit. There are so many things that people try to do to get around problems that they have where if they just use the features of the language as they exist and did things in a way that’s more natural to the language, all of those things just fall out.

Q. How do you see Rust evolving over the next couple of years in terms of adoption and ecosystem?

Evan Williams: There are a couple of different ways you can answer that. The ecosystem is going to get richer and people are going to be branching out in the set of use cases, hitting areas that right now Rust has relatively weak support for, but where I think people are building all the time. As larger and larger projects are built, there is going to be more refinement of the language itself, but more importantly, more refinement of the use of the language. And I think this is important. This book that I wrote is not just about the language itself, but the way to use the language. And I think the Rust community and the people who are building in Rust are going to be defining and creating new ways to use it that are unique and leverage its power to do things that right now we haven’t even thought of.

Q. Did any of your own assumptions about Rust change while you were writing the book?

Evan Williams: It’s interesting because one of the things that changed is I came to recognize, through working on the early chapters about what not to do, that in Rust it’s actually more work to do things wrong. And I think that’s one of the things that surprised me. I knew that you had to think in a new way and do things differently, but when I went back and tried to write bad code in Rust, it was much harder than writing the good code. That’s an interesting perspective that just didn’t even occur to me.

Q. For developers picking up your book, what key takeaway do you hope they walk away with?

Evan Williams: The key takeaway is that the thing you want is not to learn a particular set of patterns. My book is full of what the title says, how to deal with design patterns in Rust. But the much more important thing is changing your mindset. If I can help people to recognize that there is a new mindset that they need, that’s the key thing. And I see so many people who become frustrated with Rust because Rust has such an unusual learning curve. In most languages, it’s sort of a steady progress. Maybe you plateau a little bit, but you’re always going up. With Rust, very often it seems like you’re learning and learning and learning and getting better and better. And when you reach a certain level of complexity in your programs, it feels like things are getting worse. And that’s because of the mindset. Helping people understand that will save them so much pain. That’s what I want people to take away from the book.

Q. Is there something most developers underestimate about Rust?

Evan Williams: I think the thing that people perhaps underestimate about Rust is it’s not just about memory safety and all these other things. It’s a really powerful modern programming language. It brings so much to the table that has nothing to do with memory safety or thread safety or any of those other things or high performance. It’s just a very clean, beautiful language to write in because it brings so many modern innovations that other languages are sort of stuck having to drag along historic pieces of syntax alongside. Rust is a pleasure to write in. It’s just sometimes the borrow checker can be a little annoying.

Evan Williams is the author of Design Patterns and Best Practices in Rust, published by Packt. This interview was conducted by Saqib Jan, Editor-in-Chief of Deep Engineering.

Computer Architecture in an AI-accelerated World with Jim Ledin

Saqib Jan — Wed, 06 May 2026 18:15:00 GMT

Jim Ledin has been thinking about what happens between the instruction and the silicon for over thirty years. He is the CEO of Ledin Engineering, an expert in embedded software and hardware design, and the author of Modern Computer Architecture and Organization, now in its third edition, published by Packt. His career spans embedded systems development, battery management software for electric vehicles, and cybersecurity assessment and penetration testing for safety-critical systems including self-driving vehicles.

The third edition comes out at a moment when the architecture conversation in software engineering has narrowed almost entirely to one question: what hardware should run AI workloads. Ledin’s answer is more nuanced than the GPU consensus suggests, and it is grounded in the kind of bottom-up reasoning that most application developers have never had to apply.

And this conversation covers where that consensus is incomplete, what engineers building AI systems are getting wrong about memory and parallelism, why abstraction layers become dangerous when they hide hardware costs, and what the architecture of a self-driving vehicle teaches you that distributed backend experience does not.

You can watch the full conversation below or read on for the complete Q&A.

Q. You have been working with embedded systems and hardware design for over thirty years. What first pulled you toward understanding what was happening at the hardware level rather than just writing code?

Jim Ledin: My first real exposure to computer architecture was in the 1980s when I had a Commodore 64 with its 6502 CPU. I wrote a simple basic program to do some screen drawing, basically moving a dot around the screen with the joystick and pushing the button to draw the lines. And it was slow. It was so slow you could watch it moving one pixel at a time. That was painful to try to do anything with.

As time went on I learned a little bit about 6502 assembly language. I found out there were ways you could implement that through the basic interpreter. What you had to do was write out your assembly code by hand, convert it to the opcodes and data bytes, and then poke those bytes into memory. Poke is the basic command. Then you could transfer control and execute them. After I took the inner loops of the drawing program and implemented them in that way, the speedup was amazing. It shot the line across the screen faster than you could see. That episode really cemented for me how important it is to understand what is going on in the hardware of a system, and not just write what you want to do in your favourite language.

My current work is focused on embedded systems development and testing, as well as implementing cybersecurity for those systems and doing cyber testing on them. I have done quite a bit of work with electric vehicles, battery-powered systems, the battery management software, as well as the powertrain control systems. Also implementing cyber testing to evaluate what kind of vulnerabilities may be present in systems and trying to exploit those to demonstrate whether or not they actually exist. I have been doing that for over thirty years, embedded initially and then later adding in the cybersecurity aspect.

The architecture of computer systems is at the boundary where performance, system security, and behaviour in real-world situations all meet. You really need to understand, across all those domains, that everything works as expected and intended. You need an understanding top to bottom, not just what your high-level software does, but also at the hardware level. Not necessarily saying you need to understand what your compiler is doing, but know how the hardware operates, what kinds of things cause you to run into limits, and what you can do differently to improve performance, reliability, and security.

Q. Your book Modern Computer Architecture and Organization is now in its third edition. What had changed enough in the field to make a new edition necessary?

Jim Ledin: The book is intended to start at the beginning. I do not assume that readers have any background or experience with computer architecture, assembly language instructions, memory cache, pipelines, or anything like that. We start with history, where did the first computing devices come from, how were they developed. It even starts back in the 1800s when Charles Babbage designed a mechanical computer intended to be a general purpose digital computing system. It never actually got built, but many of the principles he developed, including pipelining and distributed processing, were implemented in that design. I thought it was remarkable that those concepts were being worked out that far back in history.

Then the book goes through the vacuum tube era in the 1930s and 1940s, the Intel 4004, which was really the first microprocessor, and then on to the 8086, 8088, the PC, the 386, which is basically the same base architecture that modern Intel and AMD processors in your PC and server-based systems use today. The code running on these modern systems is highly compatible with those systems from decades ago. It has gone from 16 bits to 32 bits to 64 bits, adding capabilities without removing previous ones.

The book walks through that history and then goes into detail on how processors work, starting with the 6502. That processor is simple enough that you can understand what is going on with its registers. It only has three. Nothing about it is overwhelming. Once you understand it, you can build upon it to get to the modern processors, which are far more complex.

What changed substantially since the last version of the book was the rise of AI workloads, particularly the shift from the fastest CPU available to very highly parallelised systems optimised to perform matrix computations. The new version, which came out in March, has a chapter that goes into detail on how GPUs operate, from the top-level modular structure down to the granular details of processor cores. There is another new chapter on transformer-based models, looking at them not as someone who designs them but more like a mechanic who wants to take them apart. We work through what calculations actually occur in GPT-2, which was one of the earliest large language models to break through as something genuinely new and important. The current frontier models have obviously evolved quite a bit since then, but they share many of the same fundamental characteristics. If you can go through GPT-2 and understand how it works, you are a very long way toward understanding the latest models.

We are also seeing real diversification of architecture. There were many years where computers for most applications were based on the same Intel-type architecture, but now across different application areas you are seeing GPUs, TPUs, domain-specific accelerators for things like Bitcoin mining, local AI in cell phones and cars, and the open source RISC-V processor which is available to everybody. You can design your own chip based on it, implement it in an FPGA, do whatever you want. It is a rapidly growing line of processor development and the book covers all of it.

Q. The argument that GPUs are the right architecture for AI and LLM workloads is often treated as settled. Where is that consensus incomplete?

Jim Ledin: GPUs are probably the ideal architecture for people and small companies that want to run language models locally. I have recently gotten the Gemma 27 billion parameter model running on an Nvidia RTX 4090, which is about the top end of consumer GPUs available today. For local and personal use, GPUs are the way to go.

But for larger scale deployments running much larger models, the trend is toward dedicated TPUs. A tensor is basically a multi-dimensional array. A matrix is a two-dimensional array, and tensors have more dimensions. Tensors are used widely across AI models, and the work going on inside the processing of those models is largely matrix multiplications operating on broken-down portions of higher-dimensional tensors. A TPU is a processor similar in concept to a GPU, but very specifically focused on the work of large language model tensor processing. GPUs, as the first letter implies, also have silicon dedicated for generating real-time video and handling things like gaming and video creation. TPUs do not use silicon for that purpose. They focus everything on the tensor work.

That is why systems like the Nvidia Blackwell architecture, designed for large-scale data centre applications, are built to have many components interconnected with extremely high-speed data links, working together as a supercomputer. For larger models, consumer GPUs are not really used. It is more the dedicated hardware that focuses on that work.

Another factor is that AI workloads are becoming increasingly memory bandwidth limited. That means it is taking more time to bring data into the GPU or TPU memory than it is taking for the computation itself to complete. These very high-end systems are implemented using what is called high bandwidth memory, or HBM. An HBM module is basically a cube made of a stack of RAM chips, so they hold a lot of memory and have very high bandwidth. On a TPU card you typically have several of these HBM modules, and they have a far higher data rate for transferring data in and out of the processing components than on a typical consumer GPU. This is also part of why it is becoming hard to find DDR5 RAM chips. A lot of production capacity for memory is going into high bandwidth memory modules, which cost more for the purchaser and make more money for the vendor.

Q. Software engineers working in the cloud often treat hardware as someone else’s problem. What does your book argue they are getting wrong, and what does that cost them?

Jim Ledin: If you write software and just ignore the hardware limits, that can lead to a lot of hidden costs. If your code is accessing memory in inefficient patterns, not using the cache memory within the processor effectively, and moving data around more than necessary, that can have significant performance impacts.

If developers understand how the memory access and caching processes work at the hardware level, they can often tailor code to work more effectively within those constraints and minimise latency. When the CPU requests data from memory and it is not available in its cache, it has to wait. You are giving the processor downtime when you want it to be processing data. A lot of that is unavoidable, but the amount of latency can be minimised by different approaches to optimising algorithms.

As an example, in a modern PC, each time you read something from DRAM, even if it is just a single byte, 64 bytes are transferred into the CPU cache. That is what is available at that point for the processor to work with. For best efficiency, assuming you have options, you would want your code to be working with data from that block before it moves on to something else, rather than bouncing around to other memory locations. If you access several other locations that cause them to be loaded into cache, and then that first block gets evicted, and then you go back and read it again, now you have to reread it. That is inefficient. When possible, you want to work through memory in a linear way.

And if you are working in a cloud environment, this not only has those performance issues but also results in higher costs, because you are paying for the usage of the system whether the CPU is actually crunching instructions or sitting idle waiting for a data item to come in from memory.

Q. If you are building AI systems today, what are the hardware concepts that would most change how you designed them, and what do most engineers not understand well enough?

Jim Ledin: Data movement can often be more expensive than the actual computation steps. The latency of moving large data structures across different levels of the memory hierarchy can dominate and leave a lot of compute bandwidth idle. This is a concern even with the very highest performance AI-focused systems. Getting the memory access right relative to the processing is a genuine challenge. You definitely do not want to be iterating across large data structures multiple times in an algorithm if there is a way to avoid it. Going through data linearly is probably going to give the best performance.

As you increase parallelisation of algorithms across cores and processors and across GPUs and other devices, other constraints appear. Synchronisation, where tasks on different processors need to sync up, is a real constraint. The communication bandwidth between processors, whether they are inside the same device or communicating board to board or rack to rack, all of these affect the efficiency and speed of processing, not just the number of cores you can throw at a parallel algorithm. It is important to understand the cost associated with all of these interactions among parallel activities and optimise around them to get the best overall performance.

And then optimising compilers do a great job of scheduling instruction execution and keeping pipelines full, but there are things you can do in code that make it harder for them to do that, and things you can do that make it easier. In performance-critical inner loops, minimising branching can help avoid pipeline stalls. Part of what goes on in a modern processor is trying to predict what will happen at a branch in your code, an if-else type block. The processor may guess right, which means it is very efficient, or it may guess wrong and have to back up and start down the other path. If you can minimise or eliminate branching within the most performance-critical loops, that makes it easier for the optimiser and the rest of the system to run as efficiently as possible.

Q. What is actually happening under the hood in a GPU that makes it effective for AI workloads, at a level that goes beyond the standard explanation about parallelism?

Jim Ledin: Most of the processing in a transformer-based AI model, at least 80% of the execution time, is these tensor operations, which are implemented in hardware as matrix multiplications. GPUs and TPUs have very specialised multiply-and-accumulate hardware specifically designed to perform these operations.

The current generation of Nvidia GPUs implements what is called single instruction multiple thread, or SIMT, execution. A group of 32 threads runs in lockstep, meaning they are all executing the same instruction but on different data streams. SIMT also supports branching, so you can have if-else logic in the code. But this has a performance cost. If you are executing through a stream of data on SIMT code and you come to a conditional instruction where some threads take the if part and some threads take the else part, the hardware executes one side, the if part, only on the threads where that condition applies, then goes back and executes the else part for the other threads. At the end of the block, they sync up and resume in lockstep. Your code can have conditional logic in these lowest-level operational sequences, but there is the drawback that you effectively have a pipeline stall where it has to go back and execute a different thread. You have the flexibility, but there is a cost.

GPU and TPU performance comes as much from high memory bandwidth, getting data in and out as fast as possible, and latency minimisation, as it does from effective thread scheduling across the many thousands of cores within a GPU. All of these things, memory bandwidth, minimising latency, thread scheduling, and using SIMT effectively, all affect GPU performance in addition to the raw ability to parallelise across cores. You really need to manage all of these aspects to get the best performance, not just maximise core count.

Q. The memory hierarchy from cache to RAM to storage is often discussed in theory but rarely in practice. Can you give a concrete example of where a misunderstanding of memory hierarchy caused a real performance problem, and what the fix actually looked like?

Jim Ledin: There was a web server in some Linux distributions in the early 2000s called Tux, which ran in kernel space. It avoided a lot of the transfers from user space to kernel space that a web server normally has to perform. It only served static pages, because since it was running in kernel they did not give the pages dynamic generation capability.

One issue with this server was poor cache locality. The amount of data it kept active on each request seemed to be excessive. Under high load, with lots of users hitting it at once, the state information grew to exceed the size of the level 2 cache in the CPU. Performance dropped off sharply.

Some engineers examined that and determined that by evaluating the cache limitation against how the code was structured, they could reorganise it so the amount of data per request would be smaller and therefore remain within the cache limit up to a much larger level of usage. Similarly, instructions also have a cache in the CPU, and by reorganising the processing and batching some things, they were able to increase the degree to which instructions would remain in the instruction cache during web server processing. The fixes they implemented increased application performance by about 40%.

This was basically examining the behaviour of the application in the context of the limitations of the processor hardware and coming up with solutions that respected those limits. For other applications, similar fixes might involve restructuring data. A large array of structures might be more efficiently processed as a structure of arrays in a way that better aligns with cache limitations. But in these cases, while the design approach is to look at the limits of the system and try to work within them, to really understand what is going to have a big impact you need to implement it and benchmark it in a realistic environment.

Q. There is a growing tension between the abstraction layers that frameworks provide and the hardware cost those abstractions hide. At what point does that become a serious engineering problem?

Jim Ledin: Early on in the development cycle, abstractions are great. They can greatly accelerate development and limit mistakes. Where it becomes dangerous is when abstraction obscures what is happening with the data layout in memory and the execution patterns, basically how the processor is interacting with data as the algorithm proceeds. This is especially critical in large-scale real-time systems with demanding performance requirements.

In addition to using abstraction where it makes sense, engineers need to understand what is happening underneath the abstraction in performance-critical applications. I am not suggesting abandoning abstractions. They are entirely appropriate at the level where they preserve meaning and understanding across a team. But they begin to create a problem where they obscure the costs.

The most effective approach is a two-layer design. Use the most expressive code at the edges of the system, and in the core, use more performance-aware code. It is not always obvious where to place the boundary between performance-aware code and more expressive code. It may take some benchmarking, trials, and iterations to identify the best location for that boundary. But knowing you need to draw it is the starting point.

Q. You work on architectures for systems like self-driving vehicles. What makes those architectures fundamentally different from a standard distributed backend system, and what should engineers working in conventional contexts take from that?

Jim Ledin: A self-driving vehicle is both real-time and safety-critical. The software must meet all of its deadlines, its time limits for producing a response, or that is not just a glitch to blow past, that is a system failure, and that cannot be tolerated. There must be fail-safe responses when unexpected situations occur. Only in the most extreme circumstances, like an unrecoverable hardware failure, would the system be able to stop processing.

A self-driving vehicle tightly couples sensing, computing, and actuating, seeing what is around it, deciding what to do, and steering and controlling vehicle speed. That is pretty different from loosely coupled distributed systems. A distributed system might typically implement retry mechanisms if something fails, and if a system goes down there are online and offline redundant capabilities that can be brought up, basically switching to a backup. Rather than using that approach, safety-critical vehicles provide a level of redundancy where dual processors operate in lockstep. If one experiences a failure, the system continues on the one good one until a repair can be made.

This can be extended further. The American space shuttles had three computers operating in parallel. One advantage of three over two is that if you have two computers and they give different answers, you have to decide which one is good and which is bad. If you have three and two give one answer and one gives another, you probably know the third is bad.

The way engineers working with conventional distributed systems can apply these principles is in situations where the design needs to be fault tolerant while minimising or eliminating processing interruptions. Rather than waiting for a failure to occur, you have enough running capability in operation simultaneously that you can detect a failure and keep things going the whole time while bringing up redundant capability. A lot of large systems already operate this way, but systems that do not could potentially deliver a higher availability level using these techniques.

Q. For engineers who want to build real working knowledge of systems and hardware, what is the most direct path in?

Jim Ledin: Start by understanding how processors operate at the simplest level of a single instruction. There are four steps in an instruction: fetch, decode, execute, and write back. Fetch is when the processor retrieves the opcode bytes from memory. Decode is when the processor assigns work to units within it, like an ALU for an addition instruction. Execute is when it actually does the computation. And write back is where it stores the results in registers, in memory, and in status bits within the processor. Essentially all processors operate at that very low level.

That mental model then scales upward to more complex processors and their capabilities. That is the reason the book starts with the 6502 processor. It is pretty simple, only three registers, 8-bit, nothing about it is overwhelming. But once you understand it, you can build upon that knowledge to get to the modern processors, which have hundreds, if not more, instructions available and many divergent capabilities. It all builds upon those very simple foundations.

Q. Looking ahead five years, what skills will matter most for engineers working at the intersection of software, hardware, and AI?

Jim Ledin: The most important thing is to stay up to date and remain aware of changes as technology advances. Four years ago when the previous version of the book came out, it was not at all clear to me, or I think a lot of people, what was going to happen with AI in the coming years. Pay attention to what is going on around you. Pay less attention to announcements driven by financial considerations or hype from companies focused on their performance in the stock market. Pay more attention to what is actually having an impact in the real world, and learn more about those things.

The sources matter. There are trustworthy websites with genuinely good information about current ongoing activities in CPU development and other computer-related areas, as well as more in-depth sources like scientific papers if you are willing to dig in at that level. Even pursuing formal education, which does not necessarily mean going back to college, could mean taking online courses to develop depth in areas where you might be behind. Certificate programs can be a real path to updating your skills.

Today, the thing is AI. Developers do not just learn programming languages anymore. You need to be learning how to interact with AI and use it effectively to develop better software. The way to really understand these systems requires the ability to reason across all of the abstraction layers, from the software framework at the top level all the way down to the hardware that runs the code. You do not need to break out the assembly code generated by your build tools, though that is sometimes valuable and can be very helpful, either for learning purposes or if you are really in a hot inner loop that needs maximum optimisation. More often it is about understanding the constraints, how the processor works best with pipelines and caches, and orienting your code to work within those environments.

It is also becoming increasingly critical to understand heterogeneous computing environments. It is not just writing code that runs on a CPU. You might have code that interacts with a GPU for a parallelised algorithm, whether it is a language model or something else. And there are specialised accelerators that may be implemented within large-scale systems that speed up specific parts of the operation. There is a lot to learn, and it takes curiosity and sustained attention to stay current.

Q. How would you explain the CPU versus GPU distinction to a senior software engineer who has never had to care about it before?

Jim Ledin: A CPU is optimised for low-latency execution of complex branching code. Branches do have an impact on performance, but CPUs are designed to handle that and minimise it. GPUs work best with highly parallelised, high-throughput execution of linear code, operating on massively parallel workloads. GPU cores work best when they are going through parallel streams with minimal branching.

If you are developing an algorithm and you are not sure whether it should run on a CPU or be split between a CPU and a GPU, the GPU only really becomes attractive when you have enough work for it to do that it can be parallelised, and enough that it will amortise the costs associated with moving data onto the GPU, launching the kernels to execute the code, and doing the management work to transfer data to and from the GPU.

The GPU is really not a general purpose computer. It is more of a specialised device that needs to be managed by something else. You cannot write a program that just runs on a GPU. It needs to be started and managed from a CPU, and you need to get enough benefit from the work you are doing to make all of that worthwhile. If you cannot keep the GPU busy with this kind of work, the CPU implementation may actually win, because it avoids the data transfer and scheduling overhead entirely.

Clean C++ Code, and the Hidden Cost of Complexity with Sándor Dargó

Saqib Jan — Wed, 22 Apr 2026 11:30:00 GMT

Sándor Dargó has spent years making large C++ systems easier to maintain, safer to change, and cheaper to run. At Spotify he works on codebases where performance, binary size, and clarity have to coexist, and where the cost of getting any of those trade-offs wrong shows up in production. He writes daily about C++ on his blog, speaks at conferences, and sat down with Deep Engineering Live to talk about C++26, what it takes to write code that survives real-world conditions, and what the shift to AI-assisted development is doing to the way engineers work.

Watch the full conversation below.

A note on format: the transcript below has been lightly edited for clarity and readability. During the live session on Deep Engineering Live, questions were displayed on screen for Sandor to read, and we also brought audience members on stage to ask their questions directly.

Q. There is a thread running through your talks on clean code, binary size, undefined behavior, and now C++26. What problem are you actually trying to solve?

I try to reduce complexity in real-world systems. After all, I think that’s our main job as software engineers: to turn the complex into the simple. If you think about clean code, it clearly reduces the cognitive load. If you think about binary size, it might reduce operational cost, depending on your situation. It might even lead to more users, though I’m not sure I should delve into that one. Undefined behavior clearly reduces hidden risk. New standards like C++23 and C++26 reduce boilerplate and enable safe and more readable abstractions.

I think all of these topics connect. They make large C++ systems more maintainable and more evolvable. And most of my talks start from problems I actually encountered. I try to solve my own problems, but they are not unique. I just try to share what I learn on the go.

Q. From the vantage point of a staff engineer responsible for a large codebase, which two or three features in C++26 do you expect to most change everyday design decisions?

Everyone is talking about contracts and reflection. That’s going to change everything. I’m not sure about the time scale though. If you look at C++23 support right now, even that is not complete yet, especially if you look at the differences across compilers. You go on cppreference, check what’s implemented on which compiler, and we are simply not there yet.

Given that time scale, I’m not sure about the answer. But contracts and reflection are the big ones. I don’t think I’ll be able to use those in a production environment in the next one or two years. I hope I’ll be wrong.

Q. If you were reviewing an architectural proposal that leaned heavily on these features, what are the first red flag questions you would ask?

It depends on the environment. If we are in a widely-used production environment and these are very new features, I’d probably ask if those approaches are actually proven to work, and how maintainable they are. For the time being, we simply lack the experience with these new features. We are still trying to discover how to use them properly.

Being among the first adopters is sometimes good. Sometimes it’s better to be in the second line. It really depends on the environment. But I would look for already-proven, maintainable usage. This is pretty much what happens with almost all new major standard versions. There are people coming and saying, can we use modules? And you often end up saying, certainly not yet. There’s no cross-platform support. I can imagine that will be the case for reflection and contracts in the first few years.

Q. What does a responsible adoption plan look like for a big feature like contracts or reflection?

It really depends on your environment. If you target one platform and you know which compiler you’re using, and the feature was shipped as ready, then you can go ahead and try. But if you have to support different platforms or compile with different compilers in the same pipeline, you first have to check if all of them are supporting that feature properly.

In some of my earlier environments, I simply couldn’t start using even C++20 for a long time because not every feature we needed was supported on all the different compilers. In other teams, we said, okay, we use this compiler, it’s shipped, let’s go for it.

What you have to make sure is that even if for some reason you have to fall back to the previous compiler version, you don’t have to change your code. It would be quite a pity to move to a new version, start using concepts from C++20, and then in two weeks they say there’s a problem, we must go back. And then you realize it’s not just updating the compiler version, you actually have to change the code. So check that you have a safe fallback plan.

Q. Your talk “Clean Code, Horrible Performance” is a deliberately provocative title. What is the actual answer?

The title was a question, a provocative one. Someone very active in the community told me I shouldn’t have said it because it was misleading. Maybe it was. The whole point was to frame it as a question. My answer is no. Clean code does not imply horrible performance.

We must admit that in some constrained environments and on hot paths, you must optimize for performance and forget about readability. Otherwise your software just won’t meet its nonfunctional requirements. The most well-known example is probably how the square root function was optimized for Quake III. But in my experience, even in environments with very high throughput, readability and maintainability gained through clean code were always more important than optimized performance. Wherever I’ve worked, network latency and database read and write times dominated.

At the same time, I’ve seen people optimizing for heap allocations, saying we shouldn’t allocate for a string there, while at the same time they were making network requests in a loop. That just doesn’t make much sense. Amdahl’s Law says the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that part is actually used. Slowness is also relative. If your code takes a long time to execute due to network latency, then relatively speaking, the heap allocation is not so slow anymore. I’m not saying you should put everything on the heap. I’m saying don’t worry about things that don’t really matter in your environment.

Q. If you had to write a one-page policy for a large C++ codebase covering the trade-offs between readability, performance, and binary size, what would be on that page?

If it’s a one-pager and you don’t know the exact environment or the nonfunctional requirements, it would probably be language-agnostic. The number one point would be: default to readability. You still read code more often than you write it. And not to mention agents, but they also prefer simplicity.

Second: if you have to optimize, measure first. Don’t start optimizing for performance before you prove it is actually a problem. You don’t optimize just because you can. You optimize if you need to. Otherwise, you might just waste time, or worse, you think you’re optimizing the necessary while you don’t touch the real problem. Measure first, optimize after.

Third: optimize only the hot path. You’ll find that with the measurements. Keep the hot path isolated and well documented. That will help you later.

And last but definitely not least: make the trade-offs very explicit. In code reviews, but also in the code itself. Leave comments. Because if you sacrifice clarity, document why. Otherwise someone later will come in and think, this doesn’t make any sense, let me make it cleaner. They are unaware of why certain choices were made. I’ve been there. I came in thinking something didn’t make sense, and by the time I realized it slightly changed the binary size in a way that mattered, some pull requests were already merged. Trade-offs there will be. But make them conscious and share the knowledge.

And this is probably even more important in the new world of agentic coding. Agents will not know the context that teams share with each other. You have to have things written down, and preferably in the codebase, because that’s what they can read.

Q. What are the silent killers of binary size that creep into C++ systems over months or years?

That’s a really broad topic. I wrote a series of articles on this and had a workshop at CppCon on the effects of programming styles on binary size. I made it very explicit that those articles and the workshop were not for embedded engineers, because they operate on a different scale and care about different orders of magnitude.

There are environments where every single byte matters. I never worked in such an environment. But there’s the other end of the spectrum where you might have to think about, bear with me, hundreds of megabytes. I know many of you might laugh, but I’m not kidding. When I first heard about binary size as a problem, it was due to a common library that many services shared. We hosted maybe two dozen services on that server. A few changed almost every week, others barely changed in a year. We had to keep ten or twelve different versions of the same library, and that library was over 100 megabytes. Just by making sure all services got the new version every few weeks, and we didn’t have to store more than two or three versions of that big library, we solved the binary size issue. That was seven or eight years ago.

In terms of programming patterns, the most overlooked area is unoptimized compiler and linker settings. You might gain the most from there. We tried many different code-level changes and it was satisfying, shaving off a few kilobytes here and there. Then we changed some settings and half a megabyte was gone. That’s often completely overlooked.

Template overuse is another one. We had a framework where adding a new object with no logic whatsoever, just the boilerplate, already added around 20 kilobytes because of the heavy templating. We moved away from that. Unnecessary use of std::function can also be problematic. We maintain our own backport of move-only function from C++23 specifically for that reason. And exceptions, while not always a silent killer, can be significant, though there’s interesting work being done to reduce their footprint considerably.

Q. How do you move code review conversations from taste to shared criteria?

Arguing over taste is never a good investment of time. Not just in engineering. You need to move from taste to agreements. And to do that, you have to communicate, discuss, and eventually decide. And it might not be very popular what I’m going to say, but your workplace is not a democracy. Certain people have more to say based on their experience and their responsibility.

But what’s important is that you introduce some shared decision framework. Track binary size in the CI pipeline so you see the effects at the end of every build. Track performance metrics. Once you have numbers, the conversation is not about taste anymore.

For the things you cannot quantify, like coding style, coding dojos are genuinely useful. You practice together, explore different approaches without delivery pressure, and over time you move from phrases like “I like this more” to “that’s actually the style we agreed on.” Discuss, educate together, share what you learn, and measure what matters to you.

Q. What are the most common mistakes when working with time and clocks in C++?

The most common mistake is choosing the wrong clock. Maybe you don’t fully understand the different guarantees each clock offers. For example, instead of using steady_clock to measure a short interval for a retry logic, you use system_clock. And then later, due to some bug, you figure out that system_clock is not a monotonic clock. It can jump backwards due to NTP adjustments or manual clock changes.

Another problem is unsafe conversions and cross-system time. Time is relatively easy when it’s in one system. But when you have different systems and different platforms, you can end up with clocks using different epochs, different precisions, or different time sources. When you try to compare or convert times from different systems, be very cautious. Test with all the different platforms. Debug and see what’s going on.

If you still use C-style APIs, things go wrong easily because they don’t give you the type safety that chrono durations give you. You might have to use C-style APIs, but try to isolate those parts and do the conversions at the boundaries. Within what’s within your control, rely on modern C++ time representations. Use chrono wherever you can.

For APIs specifically: keep your APIs abstract enough so that they are testable. Don’t rely on the system clock directly. Inject a time provider so you can test different assumptions about your code.

Q. You run a daily C++ quiz and have been blogging for years. What gaps have you noticed consistently, even in experienced C++ developers?

There are two main ones. The first is what I’d call a depth gap. C++ is a massive language. The standard is about 2,000 pages. Even if you are an expert in one area, it doesn’t mean you master all the others. You might be a master of template metaprogramming but know nothing about multithreaded programming. Best practices can be quite different across industries and across different kinds of C++ environments.

We should be humble enough to acknowledge our boundaries and say “I don’t know.” In the beginning of your career, it’s natural to do that. And with decades of experience, you’re confident enough to say it again. But in between, it’s more difficult. The sooner you can make that shift, the better. I once said in an interview that I didn’t know anything about a particular topic and didn’t want to guess. They said, well, we don’t really use that either, let’s skip it. I got hired at the end.

The second gap is fundamentals. I’ve seen many senior-level engineers who are really good at architectural questions, articulate and thoughtful. But they had difficulty writing some very simple algorithms under pressure. Not hard LeetCode problems. Simple ones. I’m not a fan of LeetCode-style interviews, but you do have to be able to solve problems live with someone watching. That’s something you won’t learn on the job. You have to practice on your own.

Q. What has the shift to AI-assisted development changed for senior and staff engineers?

There are two parallel shifts. One is the language itself evolving. With C++23 and C++26 we need a bit less template metaprogramming wizardry now that we have concepts. And safety has become a central topic in a way it wasn’t before. You can see that in the kinds of proposals the committee is now accepting.

The other shift is about how we work. As a developer, you’re expected to be professional in agentic coding. To be an AI-first developer, some would say. It’s as if you’ve become a team lead of agents. You keep giving tasks to them, reviewing the code, tuning your instructions.

Q. Viktor Nikolov joined us from the audience to ask a follow-up question on this. He wanted to hear more about the AI shift and what it means for engineers day to day.

I think as a developer in this new world, you have to learn to like your job again. Or still.

Before, you’d get your tasks at the beginning of a sprint or a week, and then you’d go back and start to explore the requirements, explore the code. It took some time. You slowly built up the models in your head and thought about the different kinds of solutions. You might even enter the so-called flow state, which requires focus and a bit longer time. And I think we’ve kind of lost this over the last few months.

We became, often, just prompters. Many of us complained even before that we are living in a world of constant context switching. But it just became even worse. Because at the same time, most probably, you will try to prompt different agents with different problems at the same time, and you keep jumping from one window to another. Maybe from one meeting to another, because others are also moving faster. At least they think they move faster.

Basically, you’ve lost everything, or almost everything, that you liked about your job. But we have to adapt somehow to this new situation. And mentally, it’s very difficult.

I read something very interesting recently on The Pragmatic Engineer, which is a great Substack if you haven’t come across it. They quoted research saying that in the beginning you ship more code, because it became so much easier. But you don’t just ship more code. You ship worse code. And that gain in speed is vanishing after a few months because you start accumulating technical debt at the same time. What first seemed faster becomes not faster, but the debt stays.

I also try different ways of working with agents that keep me happy but also try to speed me up, approaches that don’t remove what I like in this job but actually help. It’s difficult. And I’m happy to continue this conversation with anyone who wants to reach out.

Q. What would you tell engineers starting to build with C++ today, whether in a new codebase or an existing one?

The most common thing I see is defaulting to shared pointer when unique pointer is the right choice. People complain that smart pointers are slow, but they are defaulting to shared pointer instead of unique pointer, which is fast and cheap. Often you don’t really have to draw a line, you just have to know what to pick and not default to the easy option.

More broadly: performance is not for the sake of performance. You don’t write faster code because you can. You write faster code because you need to. If you don’t need it, default to readability and default to safety. And if you work in an environment where network latency or database latency dominates, you will not care so much about the cost of a heap allocation. Optimize for your actual environment, not your assumptions about it.

And document why you made the choices you made. Not what the code does, but why it is structured the way it is. That’s what makes a codebase survivable over time. Especially now that agents are reading it too.

Q. C++ versus Rust. Some engineers in the audience asked about this. Are C++ jobs being taken over by Rust?

I’m not sure how many jobs are actually being taken over. I had C++ colleagues who fell in love with Rust and moved to other companies just to use it. I don’t necessarily see that as a huge threat. C++ is not going away anytime soon, simply because it’s an old and evolving language and we have plenty of systems out there that you just won’t replace. Even if C++ is not strictly needed for a domain, the cost and risk of replacing it is too high. There will be COBOL jobs for decades for the same reason.

What I do think is that the overall pie of engineering jobs is growing. Rust is taking a bigger slice, but the pie itself is bigger. And moving between languages is becoming easier because agents can help you understand an unfamiliar codebase quickly. That lowers the switching cost over time.

C++ is already evolving. We are talking about C++32. The language is not standing still.

Sandor Dargo is a senior software engineer at Spotify, the author of a daily C++ quiz and blog at sandordargo.com, and a regular speaker at C++ conferences. This conversation was recorded live on Deep Engineering.

Knowledge Graphs, GraphRAG, and Real-Time AI in Production with David Knickerbocker

Saqib Jan — Wed, 15 Apr 2026 12:30:00 GMT

This conversation with David Knickerbocker keeps returning to a single conviction: the best engineering starts with intentional problem definition, and most AI failures happen when teams rush to use a tool before understanding what they are actually trying to build.

Knickerbocker has spent his career across cybersecurity, data operations at Intel, McAfee’s AI research team, and healthcare IT, before founding Bert Intelligence and Grooveseeker. He is the author of Network Science with Python, published by Packt, which argued years before GraphRAG became mainstream that graphs and natural language processing belong together as a single discipline. He has been writing code since he was six years old and spent twenty-eight years living in Okinawa, Japan before returning to the United States.

The conversation covers what it actually takes to build a knowledge graph system with data fresh up to a minute old, why his Verdant Eye system treats knowledge as claims rather than facts, how graph anchoring reduces hallucination space in ways that similarity-based retrieval cannot, and why deliberately forgetting old data is not a failure mode but a design principle. He also walks through his purpose-built testing philosophy, his three production GraphRAG systems, and what working with open source intelligence in adversarial environments teaches you about AI that clean-dataset engineers never have to confront.

You can watch the full conversation below or read on for the complete Q&A transcript.

1. Most AI systems treat knowledge as a static snapshot. You have built your Verdant Eye system around the idea that knowledge should update continuously. What does it actually take to engineer a knowledge graph that stays fresh, and where does the real difficulty lie?

David Knickerbocker: For me it is not so much about what breaks. It is about how do I actually do this, and how do I engineer it. Everything in data science and engineering really starts with problem definition. You start with what you are trying to do. If you want to build a world AI and be able to answer questions about things that happened a minute ago, then that is your problem statement. And so then you think about how to get that data into the database so that it is there and it is fresh. But then you also have to get AI to be able to use that data, so there are kind of two sides to this coin.

It really comes back to intentional engineering. The AI industry feels very shiny and very new, but there is a lot of old school discipline that is still extremely useful to me. I am a very intentional designer, developer, and engineer. You start with the idea, you go through the ideation, from ideation you create your spec, from the spec you do your project management, you assign tasks and do the work. It feels like vanilla old school engineering to me.

The approaches I use are KISS, keep it simple, and YAGNI, you are not gonna need it. When you are a minimalist engineer and you think in MVPs, you are building the minimally viable product you are aiming for. When you build the minimal thing it is much easier to test and validate that it works than if you throw a whole bunch of spaghetti at the wall and see what happens. Nothing really breaks on my side because I am an old school engineer and I am intentional with everything.

2. Freshness and accuracy often pull in opposite directions. Something that just arrived may not yet be trustworthy, while something stale may still be reliable. How do you design a system that balances recency and trust, and what signals do you use to make that call?

David Knickerbocker: In the world of open source intelligence, it has less to do with right and wrong. It has less to do with facts. What I am looking for with open source intelligence is really claims of what is going on in the world. You can have two different groups that are in opposition from each other. One group will say this is the truth, another group will say this is the truth, and they will be in direct conflict with each other. I do not make that decision, and I do not allow my AI to make the decision about what is true or false either. I am more interested in what people are claiming is going on in the world.

Because if you take what is claimed and you cluster it, you can see that this thing is happening over here and this bad thing has happened over there. I think in terms of ribbons. I come from natural language processing, so I think about clusters not as baskets or clumps of stuff but more like ribbons. You have a whole bunch of information and this top ribbon might be this bad thing happened. The next ribbon might be this event is happening at the library. The next ribbon might be a punk rock show is happening at this nightclub.

The trueness and the falseness is a much later thing than the awareness of what is being said. That is how I think about it.

3. What does real awareness mean in practice at the data ingestion layer? And how is your system different from just running an agent with a search tool?

David Knickerbocker: If I use my GraphRAG and I say what has happened in Portland in the last hour, or the last five minutes, or the last minute, it will be able to answer that question. And if nothing has been reported in the last minute then there is just nothing to report. An empty dataset is better than a hallucination.

My systems are constantly getting data. When I was building my GraphRAG system, one of the questions I use for calibration is just what is the latest information, because I just want to see that the latest information is coming through. That prompt is very reliable. The answer that comes back is anywhere from a few seconds old to maybe a minute and a half old. The Internet moves at the speed the Internet moves.

I liken it to the difference between a snapshot and a movie. If you use a tool to do a search and find out something, you are getting a snapshot of time. My systems capture the heartbeat of the Internet themselves and they are always listening. It is much more like a movie compared to a photograph. When you are talking to companies that need urgent information and you can run a query and it comes back thirty seconds old with something that was just seen on the Internet, that looks really different from spinning up agents and using tools to hit a search engine. A search engine will give you a few answers. My systems are always listening and always capturing. I can rewind the Internet itself and play it back forward again.

4. Where do you see most engineering teams underestimate the cost involved in building graph systems? And what is the failure mode you keep seeing repeated?

David Knickerbocker: I remember research I did back in 2012 and there was a famous finding that most tech problems are actually people problems. They are not tech problems. That comes down to communication, interpersonal skills, things like that. But getting to the technical side of things, one thing that used to drive me nuts was the rush to use graph databases before they were even understood.

This bothered me so much in 2020 and 2021 that I actually wrote a book called Network Science with Python. I wrote it because I was annoyed watching teams spend months building graph databases and then not really getting further than populating the graph database. Things are supposed to start when you populate the graph database. That is not the end.

At that time I was using graphs at Intel for data flow mapping, source code analysis of legacy code, mapping how legacy code would create outputs across thousands of scripts and hundreds of servers. I got well known for this at Intel and McAfee. But I was never invited to the cool kid graph database parties. I was always just doing stuff with graphs and using it to map out data flows and using them to fix production outages. Dead serious stuff. And it was really frustrating watching teams get stuck because the graph skill was not there.

I think the failure is probably a common one with what is going on today too. There is this rush to use agents before even understanding AI. And if the understanding is not there, then it is just wishful. You are saying please work, please work, please work. And if you do not know how it works, you can mistake whether it ran correctly or just ran. There is a huge difference between it ran and it ran correctly.

5. You have argued for years that graph and NLP belong together as a single discipline. GraphRAG is now proving that in mainstream AI. What did teams building with NLP alone consistently get wrong that a graph layer would have fixed?

David Knickerbocker: Language and graphs go together. Similarity in language is not equal to same. I will say that one more time. Similarity is not equal to same. Similar sounding things can be very, very different from each other. A graph kind of anchors things into a piece of context.

This was really clear to me even when I worked in data operations, because there is a lot of language that goes on in servers. It is not just look at the file, look at the blah blah blah. There is a lot that goes into those log files. If you have a hundred servers then multiple people created the different log files. There is quite a lot of natural language in log files and source code and all kinds of production things. Even working in data operations at Intel, not even as a data scientist, I was seeing language everywhere and already mapping out how production systems were working.

Graphs show you where things go. But all of the context about what that node even is is often carried by language itself. It was just crystal clear to me a long, long time ago that graphs and language go together. When I was writing this book I even felt afraid that people were going to hate it. You know, it is three years later and it is 4.9 out of five or whatever. But it was so unusual when I was writing it because nobody was really talking about how graphs and language go together the way I was. At the time I was doing a series called a hundred days of NLP, natural language processing. Even back then, using Twitter data, I was realizing that you cannot do natural language processing and leave off graph. It is ridiculous to even do that. If you are working with social media data, you see person A talk to person B about this thing happening. What do you have if you throw away the language? You do not have anything. All you have got is a graph. All of the context is gone. It was crystal clear to me in 2017, and it frustrated me for several years.

6. Your first NLP and graph experiment was eight years ago. How has entity extraction and relationship linking changed since then, and what has stayed the same?

David Knickerbocker: The very first one I actually used was the book of Genesis from the Bible. I am not religious, but it is ancient text. It blew my mind that I could pull families out of ancient text and actually map it as a graph. I did this in 2018 and it is still on my GitHub. I can actually go back to my first code and see what I did.

I am sure it was part of speech tagging because that was before my book and I had no idea what I was doing. I just kind of made it work. Builders build. You just figure out how to do it the first time and then figure out how to do it better after that. There is my small little screen window, just adding color and trying to add size to nodes. Very manual. But then you scroll down to cell 25 and you get to page rank, where I am mapping out who the main entities are. That is where the notebook gets important. Network science is more important to me than visualization, because when you are doing network science you get to do things programmatically. If I want to know whether the punk rock scene in Portland is growing or shrinking, I do not want to visualize that. I want to do that programmatically, turn it into a graph, do time series analysis, and know if the graph is actually increasing or decreasing in density.

What has changed is really how you create the graph and how you visualize it. Back then it was part of speech tagging with a ton of cleanup. That evolved to using spaCy models. And then LLMs have changed the game because it is painful to download twelve different spaCy models when you can just use an LLM these days. Entity extraction has improved a lot since 2015. I mostly have to throw away less. Less cleaning to do.

But there is a dangerous side to this. With older NLP, people were critical because there was something messy in there. When you are using LLMs, everything just looks perfect. And that is kind of a dangerous downside. People are a little too trusting of LLMs compared to how they treated older NLP. The cleanliness is real but it creates false confidence.

What has stayed the same is the network science and the mathematics. Page rank is still very important. Betweenness centrality is still very important. Community detection is still very important. My book is not going to go out of date because of that. The things that change are really how you create the graph and how you visualize it.

7. GraphRAG is often sold on the promise of reducing hallucinations. What does it actually take to get from fewer hallucinations to genuinely accountable output where you can trace a claim back to a source?

David Knickerbocker: My system is about claims. The node is attached to the claim that it makes, so there is no hallucination there. The hallucination space is smaller with nodes because you are starting with a node and you are traversing it. You are starting with your anchor space and going from there.

If you are wondering what jazz events happened in Portland, Oregon, you are connected to the Oregon node, connected to the Portland node, connected to the jazz node. There is very little chance for hallucination. But if you are just using a RAG system, it is just going to look for similarity. And in a GraphRAG system, if there is no match then the output is that there is no match. There is no hallucination opportunity. Whereas with a similarity-based system, there could be similarity even if it is only a single word in a paragraph. That is not a zero type thing. That is a really frustrating thing to me as an NLP person.

I like to have the discipline of a graph. It is the same discipline I felt from data operations, because you cannot mess up when you work in data operations. When the database is down, you have to fix it. If you come up with some similarity-based bull for your manager, he is going to be mad at you. You fix the problem when the database is down. That discipline of a graph is what I feel GraphRAG gives AI, rooting its answers in physical spaces, and that really reduces the opportunity for hallucination. There is less for it to bulk up around.

8. Temporal drift is a real problem in knowledge graphs. Facts become outdated, relationships change, and the graph can silently become wrong. How do you detect and handle contradiction and drift at scale without requiring an engineer to review everything?

David Knickerbocker: My system does not judge, and my system is about awareness. I think about a living system. You are a living system. I am a living system. And you do not remember everything you have ever been told. I cannot remember what I had for breakfast. Our brains are naturally throwing away old information and naturally learning new information, making room for that new information. When I build systems, I like to think about how life does it, and then I try to build that kind of thing into it.

My system is called the Verdant Eye. Not the Verdant Brain. The Verdant Eye sees, and it does not contain eternal memory, because that is not what an eye does. An eye sees. When the scene changes, the scene changes. What is in front of your eyeballs changes all the time. Your eyes do not need to be recalibrated. The thing has just changed.

Operationally, if you give a system infinite memory, your database bills are going to skyrocket for the rest of your life. It is never going to be possible. Think about data operations, think about transactional databases. These living systems have been with us for a long time. Anybody who has worked in data operations knows how living systems work because they have worked on living systems, they just do not call them that. In a transactional database you operate off of what you need, and data that is not needed eventually gets archived. In a human body, memories eventually fade away. If I stop thinking about a thing, it will eventually go away.

When I am building artificial intelligence I am never tempted to build something with infinite memory forever, like the machine from the Hitchhiker’s Guide to the Galaxy. I do not want to build a super AI. I want to build AI that actually serves us human beings. I want to build AI that does not boil the ocean, that can be bootstrapped by individuals, that does not cost a trillion dollars.

9. You have built your own testing frameworks for GraphRAG rather than relying on standard benchmarks. What outcomes are you testing for, and how do you know when a system is actually working?

David Knickerbocker: Everything I do is intentional. There is a really cool intelligence report I read a couple of years ago that said even datasets need to be designed for the use they are going to be used for. Down to the dataset, you should be able to visualize how somebody is actually going to use that data. There is no testing framework anybody else can give me that is going to be fit for purpose for what I am trying to build, because I am not trying to build general intelligence. I am trying to build intelligence that serves a specific purpose.

There is a scene in Rick and Morty that is one of my favourite scenes. Rick makes a little robot and this robot wakes up and asks what is my purpose. Rick says you pass butter. The robot asks again thirty seconds later. Rick says you pass butter. And the robot says oh my god. But that is the entire purpose of that robot. Its whole purpose in life is to pass butter.

I have three GraphRAG systems right now and each one is independent. The Verdant Intelligence system is for high level situational awareness, looking down on the world, what is going on in Michigan, what is going on in Oregon, what is going on in California. My second system is called Grooveseeker, and that is street level intelligence. Not what is going on in Oregon but where is the punk rock event happening tonight on what street in Oregon. That graph system has a very different set of rules than the Verdant Intelligence one. My third system has thirty years of artificial intelligence research. When I am building these systems and I want to understand what people did twenty years ago I can just talk to that graph and find out. Each one of these goes through its own testing.

For the Grooveseeker system, I set up a couple hundred questions and go through multiple rounds of the same question to make sure queries are coming up correct and reliable. If it is not hallucinating, it is doing good. If it is getting me to the right location, it is good. If it is getting me there at the right date and time, it is good. The final test of my world AI was I stopped proving it in articles and just used it to go to a punk rock show. I downloaded my data, asked what is going on in Portland from March 10 to March 13, figured out five events I wanted to go to, narrowed it down to one, bought the ticket online, went to the show, saw all the bands, and hung out with one of them. My AI did not take me to a nonexistent venue. It made a real memory for my family. That is how I know it works.

10. You are working with open source intelligence, which means dealing with adversarial sources, deception, and deliberately misleading data at scale. What does designing for that environment teach you about AI that engineers working with clean datasets never have to confront?

David Knickerbocker: I really encourage AI people to learn a little bit about open source intelligence. If you are going to build artificial intelligence to understand the world, the open source intelligence community has been using natural language processing and graphs to understand the world for quite a while. There is a lot to learn from them.

The real world is a messy space. It is not just that websites can disagree with each other. Websites also have malware. If you point your servers at websites and just download everything on them, you need to be prepared for the consequences of downloading malware. There are all kinds of things when you are dealing with the Internet.

My systems do not care who is right or wrong. They are observers. My systems will see three sides to the same story. There will be the left side and there will be the right side, and then sometimes there is something really extreme. And it does not mean that any one of them is wrong. I wrote an article about open source intelligence recently and I mentioned that bigger clusters are not more important than smaller clusters. In open source intelligence, everything matters top to bottom. If you are using an agent to do an Internet search you are going to get back what the search engine gave you, maybe ten things. If I use my API and say what happened in Oregon in February 2026, I am going to come back with ten thousand things. My APIs do not return ten. They return full context. That is a difference in completeness. It gets back to the snapshot versus movie idea. I can rewind the Internet itself and play it back forward.

The judgment part needs to be downstream. My system is a fast layer to AI, and it does not do the judgment thing. But there are certain things that are just still good to be a human being about. If something from an extreme source sounded like something dangerous was heading in the direction of your community, that would be an actionable insight, and you would go to the mayor or the police or someone like that. I am not going to build that kind of automation into the system. There are certain parts of being a human being that I like keeping.

11. What advice would you give to engineers who are starting to build knowledge graph and GraphRAG systems today? And what should they not do?

David Knickerbocker: First of all, ask why you are doing anything before you do anything. Do not follow crowds just to follow crowds. You should have a good reason. I do believe that GraphRAG is something you should probably just start with because it is more reliable in my opinion than vanilla RAG. But that is my opinion.

I am not a crowd follower type. I am a bit of a rebellious type. But I think there is a lot of creativity in being like that. The AI space is a very creative space. If you follow the crowd you are going to do what everybody else is doing. If you sit outside and you look at plants and you think about nature, you can hit insights you will never get from following the crowd. If you are just looking at LinkedIn and seeing what everybody else is doing and reading the same books as everybody else, it is very important to actually be grounded in the world and to think about life itself. If you are going to do anything with intelligence you might as well think about real intelligence. These language models are nothing compared to what is in my backyard. They are not passing tokens. They are not complaining about maxing out their tokens. They are trying to collapse everything to the bare minimum.

My own philosophy of AI I call absolute zero. Collapse everything to zero. My world AI has zero storage, zero AWS cost, and my AI bills are extremely low, because I just collapse everything down to the bare minimum. And that is also the reason why I have real-time AI, because I was able to collapse everything down to the minimum.

I encourage people to read the old stuff too. Some of the best insights come from old papers. My graph partner and I were talking about how he got an insight from something thirty years old. A couple of years ago I created something off of Claude Shannon’s information theory, and it was a different implementation than anything else. You can do these kinds of things if you are an original thinker. If you only follow crowds you are not going to do anything except follow the crowd. If you try to create a product and you are no different from any of your competitors, then what are you doing? I just encourage independence. Get back to the science. AI has to be rooted in science and engineering. When it gets really loud, that is not always the best time to pay attention to the loudness.

Small Language Models and the Future of Production AI with Karun Thankachan

Divya Anne Selvaraj — Thu, 26 Mar 2026 07:55:56 GMT

This conversation with Karun Thankachan is a practical tour through small language models in production, starting from the limitations of general-purpose LLMs and repeatedly returning to a single constraint. Cost-effective reasoning for specific tasks is a different engineering problem than general-purpose reasoning, and good engineers choose their tools accordingly.

Thankachan is a Senior Scientist at Walmart, where he works on language model systems for retail AI applications. He has a background in machine learning research from Carnegie Mellon University and has spent time at Amazon before his current role, building production AI systems at scale.

In our conversation, we also talked about ReasonLite, an open-source library that brings chain-of-thought distillation, program-aided reasoning, self-consistency, and trace-budget control under one unified interface, making SLM training feel more like hyperparameter tuning than a collection of disconnected scripts.

He also covers SLM-Fusion, a multi-model orchestration framework that handles routing, merging, and serving across multiple specialized SLMs, including an OpenAI-compatible FastAPI gateway that abstracts the entire reasoning layer as a microservice. Finally, the conversation turns to where the industry is headed and why RAG and context engineering are winning over fine-tuning right now, and what to watch as diffusion models become more mainstream.

You can watch the full conversation below or read on for the complete Q&A transcript.

1. Currently, your career spans both academia and industry—and perhaps things in between—starting from advanced research, for example, at CMU, to applying AI at Walmart. Can you perhaps share with us how this journey led you to eventually focus on SLMs?

Karun Thankachan: Yeah. So I started my career as a software development engineer back in India. There, I had the opportunity to work under a director who was starting the data science and machine learning team there.

I had the opportunity to work on a bit of data analytics, big data science, and eventually machine learning. That sort of sparked my curiosity in machine learning, leading me to do my master’s in machine learning from Carnegie Mellon.

And that’s where I got a bit more interested in the research side of things. I had the opportunity to work under a few professors there. Professor John Stanford, in particular—I was able to publish a pretty good paper in AAA, and got into the weeds of deep learning. Eventually, that led me to land a role at Amazon and now currently a senior scientist at Walmart.

It was, however, at Walmart, with the ChatGPT/LLM wave, that I got involved a little bit more in the field of language modeling and NLP.

Our current director had a vision for agents that could solve specific business problems, and we started trying to develop toward that mission. During that time, I realized that a lot of building agents is a little bit more software engineering than machine learning. It was a lot more about designing evals that could give you feedback on how your LLMs are performing, building guardrails that could make sure that your LLMs are behaving the way that you want them to behave, and optimizing for cost and latency. A lot more engineering focus than, let’s say, machine learning focus.

And I missed a little bit of that machine learning flair. And that’s when I started investigating on my own a little bit more how I could be involved in the machine learning side of things within the AI wave.

And I stumbled upon small language models—language models that you could actually fine-tune and optimize to the specific task that you want. It felt a bit more in the domain of machine learning. It felt more like you were understanding how a model was working and helping the model learn patterns in your data, which was sort of what got me interested in the field. That’s what got me interested in SLMs.

And it’s sort of my opinion that right now, we are in a race to see who can define this new customer experience that would be based on LLMs. And that’s why we are relying a lot more on foundational models, and we’re hitting them via APIs, plugging and playing them into our experience to figure out what kind of new, reliable experience we can provide to the consumer. And whoever provides that new, better experience to the consumer will take over a huge amount of the market.

But afterwards, once this new experience is well-defined, then we will go back to that age of cost optimization. And that’s where SLMs are going to come back into the picture, because they are able to reason more cost-effectively on specific tasks, as opposed to LLMs, which are more general-purpose reasoning. So that’s sort of why I still keep very invested in this domain of SLMs, and I’m hoping that converts in the next five or six years. Yeah.

2. Let’s talk about ReasonLite which was introduced as a way to perhaps tackle the fragmented, unclear evaluation practices and high token costs that hinder reasoning in small models. What gaps did you see in existing SLM distillation toolchains that made you create ReasonLite?

Karun Thankachan: Sure. So maybe to take a step back and explain what ReasonLite is: LLM models are models that have a tremendous reasoning capacity—general-purpose reasoning capacity. And since they are models with billions of parameters, they’re also able to, in very layman’s terms, remember a lot more details to be able to reason out a solution for a question that you ask.

For instance, if you ask an LLM how do you bake a cake, an LLM might be able to remember all the steps that are required to bake a cake—preheat the oven, mix flour and sugar, add X, bake for X amount of minutes. So it’s able to remember a great amount of detail.

An SLM, in comparison, doesn’t have as many parameters, so it won’t be able to remember as much. It’ll be able to maybe remember things like flour, eggs, cake.

And how these language models work—they’re essentially what we call autoregressive models. So what it has generated thus far will influence what it will generate next. So if you can’t remember a lot of the prior steps to baking a cake, like preheat the oven, flour, egg, cake mix, stuff like that, you might not continue generating the correct answer. An SLM might say flour, eggs, brownie instead of flour, eggs, cake, because it didn’t have the correct prior before it.

But even though LLMs have billions of parameters and have a lot of memory, it may be overkill when you want reasoning for a specific task that you have in mind. It’s really great for general-purpose reasoning. But for specific reasoning for a particular task that you have in mind, an SLM might be able to get you there. And the only thing that you have to do is update its parameters, which are currently now for subpar general-purpose reasoning. Update it to become good for specific-task reasoning.

So how do you teach these SLMs how to build out that reasoning? There are a lot of techniques in the market, like CoT distillation. To provide an example, here you ask the LLM, “Hey, I’m going to ask you a question. Let’s say a person went to a store where they’re selling apples for $3. He bought four apples. He gave the shopkeeper $20. How much does he get back in return? Tell me the answer, but show me the steps also.”

So the LLM will write out: the cost for an apple is $3, four into $3 is $12, he gave 20, so 20 minus 12, eight back. Your answer is 8. It will show you each of the steps and then give you the answer, the 8. So like we mentioned earlier, these are autoregressive models. So what it generated earlier will help it generate more accurate answers in the future. So if I’m able to get the SLM to output responses in this similar manner, or at least think in this similar manner, then it’s more likely to get at the final answer.

So what we do is we ask the LLM to generate its entire chain of thought, and we feed this chain of thought into SLMs and ask it to generate a similar chain of thought. We don’t do this for general purpose. Rather, we do it for our specific tasks. If you try to do it for general purpose, again, the parameters will get updated in every which way, and it won’t be able to generate good answers again. But if you do it for a specific task, the parameters will get updated to reflect that specific task. It will start to be able to solve that specific task.

Similar to CoT distillation, there are other techniques, like contrastive rational training, which is essentially you tell it, “This is the answer that I want, like four into three is equal to $12. That’s what I want. Four into three is equal to $11. That’s not what I want.” So you push it toward what you want and push it away from what you don’t want. So there are a lot of these techniques out there for helping train SLMs to perform or provide reasoning on a specific task.

But when I was building out SLMs to reason on specific tasks, I realized a lot of these techniques were written in notebooks. They were written in scripts. And what I wanted was something similar to what ML practitioners are familiar with today, which is a kind of hyperparameter tuning, where you have all these knobs and you can turn them on and off. You can adjust the parameters, and you can figure out what set or what combination of techniques helps the model learn the pattern the best so that it can generalize in the future.

So I wanted all these techniques under one roof, like CoT distillation, self-consistency, program-aided distillation, contrastive rational training, curriculum scheduling. I wanted all these techniques in one package, which I could control similarly to hyperparameter tuning. And that’s why I developed ReasonLite. Everything was split out in files, and I wanted to bring it into one package. And with this now, hopefully, practitioners can call the package, tune it just like they would with HP tuning, and that sort of, I feel, solves a pain point in current SLM training.

3. ReasonLite integrates program-aided distillation, using external symbolic tools or code to verify intermediate reasoning steps. How does this approach work in a real-world training pipeline? Can you give an example of using a tool, say a calculator or knowledge base, during distillation, and how it improves a student model’s reasoning accuracy without overly complicating the production workflow?

Karun Thankachan: o maybe to explain what program-aided distillation is, maybe taking our previous example of a person going to a store, buying four apples, which are worth $3 each. They pay the shopkeeper $20. How much do they get back in change?

If you give that question to an LLM and ask it to give you an answer with a sort of chain of thought, then what happens is it does the calculation. It says that, hey, four into three is $12, and 20 minus 12 is $8.

So here, the LLM doesn’t actually have a calculator doing the calculation. What it’s doing is it’s looking at four into three, and it’s saying that 12 is probably the most likely answer. But at times, an LLM could generate a chain of thought that says that four into three is $11, just because it’s not actually doing the calculation. It’s just predicting what’s the most probable answer.

Same thing with 20 minus eight. It might not give you $12. It might give you 20 minus eight as $9, because it’s just predicting what’s the most likely number. It’s not doing the actual calculation.

So this is a little bit harmful when you are trying to do chain-of-thought distillation at scale. Just to refresh everyone’s memory, what is chain-of-thought distillation? You ask the LLM to show the steps that got it to the final answer, like four into three is 12, 20 minus 12 is 8. Those steps—show it. So that’s the chain of thought.

Along with the answer, that’s the entire chain of thought. That chain of thought, you feed it to the SLM. And then the SLM tries to generate that same chain of thought using its much smaller parameters. But it’s only being trained on chain of thought for a specific task, so it will be able to capture that limited amount of chain of thought.

Now the problem is, if these chains of thought that we are generating from the LLM itself are incorrect—like four into three is 11, or 20 minus eight is 9—if it’s from the LLM itself and it’s incorrect, then the SLM obviously can’t be expected to learn the correct answer. And you can’t sit and verify all your chains of thought, especially when you’re training at scale.

So how do you make sure that the intermediate, especially math-oriented, steps are correct? You ask the LLM to generate it in a way that is like Python language. The code would be related as four into three, and c is equal to four into three. Answer is equal to 20 minus c. So your answer is 20 minus four into three. It’s eight.

So instead of the LLM actually doing the calculation, it just writes the code with the inputs that you provided. The code is taken and run in something like Python or a calculator. The answers are then attached back to the chain of thought, and then you feed it into the SLM.

So this way, with this kind of external program that’s embedded into your distillation—i.e., program-aided distillation—you can make your chain of thought a little bit more accurate, and you can get your SLM to learn only on the correct answers instead of any incorrect answers.

4. One feature of ReasonLite is a trace-budget controller to constrain the token usage of chain-of-thought traces during training and inference. In a production deployment, why is controlling the length of reasoning traces important for cost and latency?

Karun Thankachan: When you’re actually serving answers to users, you run into actual engineering concerns. One is obviously latency. When a user types in a question, you want to give them an answer in a fairly short amount of time so that the user doesn’t drop off on the site. And you maybe don’t want to provide a very verbose answer unless the user is explicitly asking for it. If they’re asking for something simple, you want to give them an accurate answer, a comprehensive answer, but it doesn’t need to necessarily be verbose.

So during that time, if your model is trained to think in this chain-of-thought manner, where it’s trying to explain breakdown steps and then get the answer—which is generally good practice—but if it’s trained on fairly long chains of thought, that might kick up your latency and increase your cost, because each token costs a little something to produce, even from the SLM.

So you might want to have some kind of guardrails around it, so that the latency doesn’t increase to an unbearable amount, and the cost doesn’t become very cumbersome or essentially very expensive. The compute doesn’t run up, essentially.

So it’s to prevent that that you have these trace budget controllers. And how it sort of works is, you can enforce it in different manners. You can enforce it by saying that, hey, for any particular inference call, you shouldn’t take more than X amount of tokens. If you’re starting to hit your X amount of tokens, cut your chain of thought short with whatever you have and provide an answer. It might not be an accurate answer, but it helps you make sure that your token cost won’t grow beyond a particular point, and your latency also won’t exceed a particular value.

Now, an obvious question that people might have is, hey, if I limit my token usage and if my chain of thought isn’t allowed to grow, won’t I get bad answers? Which is a very reasonable question. So typically, yes. If your model is trained to produce very verbose chains of thought, you will run into that token-limit issue again and again with the token budget controller. You’ll run into the issue again and again, and your model won’t be allowed to express all its thoughts and therefore give a good answer.

So typically what we do is we have eval metrics that track things like whether the model is being useful to the user or not, like a thumbs-up, thumbs-down feature in ChatGPT. So if you get a lot of thumbs-down features, and if you’re seeing that for all those requests your token budget controller was cutting off your chain of thought, then you can understand from your evals and from your logs that your model is actually being too verbose.

We need to retrain the model so that the chain of thought is shorter, it’s less verbose, and it’s able to get to the answer quicker. So that’s why this kind of trace-budget controlling is important in production settings, and how it helps you limit token usage.

5. Techniques like chain-of-thought prompting and self-consistent decoding—generating multiple reasoning paths and aggregating answers—can significantly improve reasoning accuracy. However, they also increase compute cost and latency by running the model multiple times. How do you balance these trade-offs for production systems?

Karun Thankachan: So, maybe I can take a step back. What do we mean by self-consistency? Essentially, like we mentioned, LLMs are not actually doing the specific calculations or specific reasoning. They’re not actually understanding, or they’re not able to derive meaning. They are autoregressive models. So based on whatever they’ve seen, whatever they’re generating right now, they’re going to generate the most likely answer next. So sometimes it may be generating incorrect things.

But if you ask the model to generate an answer to a specific question 10 times, then the majority of the time, it might actually give you the right answer. So what you can do, or what is a decent practice, is: with your LLM, you generate maybe not just one chain of thought. You ask it to generate 20 chains of thought. Then you check the final answer for these chains of thought, and if the final answer across those chains of thought is similar in a majority of cases, that is your final answer.

For instance, let’s go back to our apples example. Four into three, 12. Twenty minus 12, eight. Let’s say 12 chains of thought generate eight, the other five generate nine, and the remaining three generate something like 10. So the majority vote is eight. Now you know that this is probably the right answer. Let’s pick up all these chains of thought and use that to train our SLM.

So self-consistency is a fancy way of saying majority voting. You’re essentially asking the same model to generate the response to a question again and again and again until, in the majority of cases, it starts giving you a specific answer instead of different answers all the time.

Now, if you try to use self-consistency during inference, you’re essentially asking the SLM to answer the same question multiple times—let’s say 10 times. You’re picking the majority answer, and then you’re giving that majority answer to the user. The problem is, if you do it at inference time, instead of answering one question once, you’re answering it 10 times. So the cost becomes 10x. The latency becomes 10x as well during inference.

So typically, you don’t want to use self-consistency at inference time, so that you can control the latency. You only typically use self-consistency during your training loops, so that you can figure out what chains of thought to actually feed into your SLM. So the simple rule of thumb is: self-consistency is better for training time, where latency isn’t the concern—and cost also, to a certain extent, because you’re doing a lot of these things in batch, and you can append other techniques on top. You can actually manage the cost and manage latency. So use it at training time. It’s not something you need to use at inference time.

6. In ReasonLite, you emphasize not just final answers but also intermediate reasoning quality—providing targeted reasoning probes and symbolic verifiers to assess a model’s thought process. In a practical setting, how do you evaluate whether a distilled small model is truly reasoning well versus just guessing the right answer?

Karun Thankachan: Got it. So essentially, when you are trying to train an SLM, and like we discussed thus far, what it generates is based on what it has learned so far—what it is currently generating in its chains of thought, or in its thinking, essentially. So when you are evaluating an SLM, it might not be enough to evaluate it on whether it’s generating the final answer correctly or not.

Again, going back to our apples example: four into three, 12; 20 minus 12, eight. Eight is the final answer. Does that mean you evaluate it only on eight? You could. But let’s say it did four into three, 11; 20 minus 11, eight. It still got to the final answer, but it’s because it’s not really doing the correct things. It’s just, again, making guesses about what’s the most probable answer. And it somehow stumbled on the correct answer.

So your model might actually deviate from the behavior that you desire, but your answer is still correct. We don’t want those kinds of things to spread into production. So we want to not just evaluate the final answer. We want to also make sure that we evaluate the steps in between.

So how do we actually do that? We can check what we call stepwise behavior. There are a few things that you can inject into the model—symbolic verifiers or reasoning probes. These are, I think, two things that we have implemented in ReasonLite.

So maybe to give an example of how these things function: a reasoning probe is trying to figure out if the SLM is able to do a specific substep well. For instance, we have mathematical reasoning probes that help you test if your SLM is learning math very well. Take 17 plus 8 is equal to 25. When you do that addition, there is this behavior called carry behavior, where seven plus eight is 15, so you need to put the five and carry over the one. Then one plus one is two—25. That’s how you do the math. So this carry behavior is something you want to see specifically if your SLM is learning. Within ReasonLite, there are functions that help you test specifically for carry behavior within your SLM. So that’s a way of evaluating if your SLM is behaving well on substeps.

Similarly, step verifiers are another way of evaluating if your SLM is behaving the way you want it to. For instance, the apples example again: four into three is equal to 12. That is a substep. You want to verify if that substep is accurate or not. So you take the output of the substep. The step verifier takes whatever it is doing, runs that code, generates an actual answer, and matches it up. So it’s able to see, at the substep level, if it’s giving you an answer correctly.

So these kinds of reasoning probes and step verifiers are things that you can maybe add on to the final-answer evaluation, and they’ll give you a little bit more information about how your model is actually behaving.

7. Let’s shift into talking about SLM-Fusion and multi-model orchestration. Modern AI deployments often involve multiple models, from small domain-specific SLMs to large general LLMs. But traditional serving frameworks usually assume a single static model, which leads to inefficiencies under dynamic workloads. SLM-Fusion, from what I understand, is an open-source library proposed to bridge this very gap by unifying model merging, routing, and serving in one system. Can you explain the impetus behind SLM-Fusion and also talk to us about how it works in a real scenario?

Karun Thankachan: Got it. So SLM-Fusion—just to give a little bit of context—this paper was written somewhat a while ago, before multi-agents became a little bit more popular and that kind of architecture became a little bit more popular. I would say maybe it’s a little bit outdated at this point.

But SLM-Fusion, the idea essentially is: typically, with LLMs, with their general-purpose reasoning capability, you can, with enough RAG and context engineering, get them to answer questions within multiple domains, even with a single LLM, as long as you have your RAG engine built well. You have a retrieval layer that gets you specific context related to this new domain, and you do context engineering well enough that only the relevant info stays inside the context of the LLM.

But if you’re working with SLMs, since you are fine-tuning your SLMs on a specific task, that fine-tuned SLM is only able to reason very well for that specific task at hand. You can’t just use one SLM and be hopeful that it will be able to pick up or be able to reason in a new domain as well, because you are training it—its parameters are limited. You’re only able to reason for one specific task.

So in this case, what you typically do is train multiple SLMs, and then you figure out a way, based on the questions that are coming in, how to route the question to the appropriate SLM. So that’s essentially the idea behind SLM-Fusion. The key thing is: how do you route it to the correct SLM? How do you evaluate when the routing was inappropriate? And how do you not base the routing on just hard-coded rules, but learn the routing from user behaviors?

Sort of like: hey, it routed to SLM A, the user didn’t particularly like that response, but based on whatever rules we have, that was the SLM to route it to. So now, how do you reconcile the fact that that was the SLM to route it to versus the user behavior? Was it one question and then a follow-up question that switched it to another domain?

So those kinds of things—how do you actually evaluate that, how do you learn it from telemetry, and try and update this routing over time—that was the core of SLM-Fusion. Now with multi-agent architectures, it’s becoming a little bit easier, but if you’re working in SLMs, some kind of routing is good. There have been multiple papers, both within ICLR, ICML, and AAAI, that have come around this routing concept as well. But it’s a lot more updated at this point.

8. In production, when would we prefer merging models over just choosing one? Can you perhaps discuss an example use case where merging two specialized models could yield better results than using a single model alone?

Karun Thankachan: Sure. So again, it comes to how different the reasoning is that these two different SLM models would have to learn.

For instance, let’s say within a retail scenario, you have a reasoning model that—let’s say it’s an anomaly detection model—that sort of needs to decide, looking at sales of an item, why the sales dropped anomalously. So if sales of an item drop anomalously, there could be multiple reasons that drove it. So if I were building an SLM model, if I saw sales dropping, then the next thing I would have the SLM model do is be able to generate multiple hypotheses and then figure out what is the appropriate one to chase down and try to answer.

Within that same context, if, let’s say, I wanted to fix the anomaly, I would want an SLM model where I would give it some context on, “I want you to go and hit this system, change the value to something like this, and hit the system, change the value to something like this.” Here, the SLM model doesn’t need to have that kind of broad thinking in terms of hypothesis generation. It needs a little bit more specific tool understanding, a little bit more integrated with API calls.

So the reasoning between both these models would be very different. One would be a bit more broad—generate hypotheses, figure out which one is the right one, and then tackle it, I mean run those hypotheses, figure out what is an appropriate answer. And this one is a little bit more tool-oriented, a bit more in-depth, a bit more specific. It can’t afford inaccuracies because it’s interacting with the tool.

So the reasoning would be very different. In these cases, rather than trying to build one SLM that could maybe do both, it might be a better idea to separate it out, and it might be a good idea to bring these two into a routed format where you generate the hypothesis, you tell the user that, “Hey, I evaluated X hypotheses. These two seem to be the most likely root reasons.” Then the user sort of tries to understand, okay, maybe I also think that, okay, out of all the ones I’ve evaluated, this is probably the reason. Let’s try and fix this. Let’s adjust these metrics or adjust these settings here, and then you route it to the second SLM.

And that SLM sort of makes all the necessary tool calls, all the necessary adjustments, and it has more in-depth, specific reasoning built into it. So that might be a good scenario for routing.

9. One of the core features of SLM-Fusion is an adaptive routing layer that can be rule-based, learned, domain-specific, or cost-aware in deciding which model or ensemble handles a request. How do these routing policies work under the hood? For instance, what would a cost-aware router consider—latency SLA, API throughput costs, query complexity, etc.?

Karun Thankachan: Sure thing. So within the router, we have a few ways you can decide what SLM to route it to. The simplest way, and the easiest way to get started, is just rule-based routing. You see certain domain keywords, and you can route it to a specific domain SLM. The slightly more advanced manner is getting an embedding out of the user query and figuring out which sort of base embedding it matches the most. So each SLM would have a domain-encapsulated embedding associated with it. So it’s everything related to that domain in an embedding. So if the user query matches this domain-specific embedding, route it to this SLM.

Now, the advantage with this sort of embedding-based matching is that, if the user asks a specific question that is maybe multi-domain, and you routed it to the wrong SLM, or it might be the case that you need to split that question—route the first portion of the question to this SLM, get a response, route the second portion of the question to the next SLM, get a response.

So instead of this embedding being static, what SLM-Fusion does is provide you the opportunity to adjust those embeddings based on how well you have done on the questions users have asked in the past. So using your logs, you can pull in your logs. The ones that you didn’t do well on, those ones you can narrow down on. You can figure out how to update your embeddings for those specific ones that you didn’t do well on.

And for a particular question, if you feel like it’s a multi-domain question, within the router itself you have a tinier SLM that can split multi-domain questions into separate questions. So with these sorts of knobs, you are not just hard-coding how to route it, but you are able to learn over time how the routing should evolve. And you are also able to address multiple-domain questions by using the routing module to split them into different questions and orchestrate it in a manner where you can still use your SLMs, and you don’t need to try to condense everything or try to get SLMs to interact across domains. So that would be the core way to use this sort of routing more.

10. Let’s talk a little bit about telemetry-driven feedback loops. What signals, according to you, are most valuable for such a loop in a production setting? And how do you feed this feedback back into the system?

Karun Thankachan: Got it. So it really depends, but the most critical one, I would say, is how the user is responding to the queries. So just like ChatGPT’s thumbs-up, thumbs-down—some kind of user satisfaction score. That would be the best way to assess any sort of generative system, because the responses being generated are evaluated by the user. And if it’s not a helpful one, there’s no point in any of these generative systems.

So being able to track user satisfaction scores and attach them to your logs—your chain of thought, your final answer, your user satisfaction scores—that sort of logging system is what we call telemetry. And once your logs are all stored and generated, being able to search through your logs and figure out which ones you didn’t do well on, and having enough logging to figure out which SLM it routed to, why it routed to that SLM, why it tried to split the question into separate portions—having all of those logs in one place is what is going to help you build that feedback loop and improve your routing over time.

Apart from user satisfaction, you could also use things like token usage. Is the compute cost actually building up? Is maybe a question that was designed for one domain being unnecessarily split into multiple questions and maybe just sent to the same SLM again and again? I’ve seen that happen also. So checking if your token cost for any of the responses you are giving is spiking. Similarly, if your latency is spiking.

So these three, I think, would be the top metrics to attach to your telemetry, or have tracked along with your telemetry, with timestamps and request IDs, so that you can map it properly. And then you can improve your routing layer over time.

11. Thank you for that. So now, quantization is a common way to reduce inference cost, but mixing models of different precision—or even merging quantized weights—can be tricky. So what did you build in SLM-Fusion, again, to use it as a case study, to handle quantized models effectively?

Karun Thankachan: Got it. So, I guess, just to explain why quantization is tricky: quantization is nothing but using different integer formats. So with quantization, what you’re doing is you can represent things as 32 bytes, 16 bytes, 8 bytes, or 4 bytes. The lower you go, the smaller your models become, the faster the multiplications become, and therefore the faster your SLMs become. So you can make your models smaller and faster the more quantized they become. But again, as you make them more quantized, you lose a little bit of information, so they won’t be as accurate.

So how it helps you use different SLMs that might be in different quantization modes—it gets a little bit tricky here—but we have these things called tensors. When these calculations are taking place, we do them in these large-scale 3D matrices called tensors. And how the calculation within an SLM works is, you sort of align these tensors, or align these channels, pad them as necessary to get them to the same quantized integer formats, and then sort of carry forward the calculation.

So, a little bit more on the math side, but the key thing is aligning the tensors so that you’re not assuming that all the models are at the same level of quantization. You try to identify whatever quantization it is at, then sort of work through packages that we already have. It’s not something new that SLM-Fusion is providing, but most of the popular deep learning packages already provide this. But aligning per tensor, per channel, so that the calculations actually flow through.

And in terms of, apart from just the quantization, building adapters into your models is another way to perhaps mitigate this. Adapters are still, I would say, a little bit unproven in terms of the value they add for the number of parameters they introduce. But in some very few scenarios, where the domains are similar enough, but you need a slight change in parameters so that it adapts—not to a completely new domain, but maybe a complementary domain—in those cases, I think adapters work. But for quantized models, if they’re in different quantized states, having adapters can help you maybe bridge that gap as well.

So, a little bit on the math and technical side: alignment of tensors. A little bit less mathy, but more on the modular side: adapters to help bridge the gap. So those are the two things that I think SLM-Fusion had that help you work with different quantized SLMs.

12. All right. Now, SLM-Fusion also introduced a FastAPI-based Fusion Gateway that is even OpenAI-compatible for inference requests. So how do you see a system like this being deployed in a production microservice architecture? Could it sit alongside existing serving frameworks, perhaps?

Karun Thankachan: Yep. Yeah, definitely. So the FastAPI backend is essentially there to support that same thing. The idea being that, within microservice architectures—again, maybe taking a step back—the core idea is that anything that has to do with one specific function is split out, modularized, and kept separate. So your reasoning engine, if it is like this multi-SLM model, you can keep it separate from everything else. You can update it as required without impacting any of the other microservices in that environment.

And with the FastAPI backend, the key idea is that you can hit it just like you would any other kind of service that you can abstract away. So what we typically call, I guess, reasoning as a service—RaaS, if you want to call it a new domain. So whenever you need a little bit of, “Hey, I think I need a little bit of human reasoning at this particular stage to make a decision on what to do next,” then just hit the API endpoint like you would in any kind of microservice architecture.

It abstracts away all the reasoning. It will do the routing within, it will pick the SLM, it’ll generate an answer, and it will send you back a specific API that follows the contract. And that API isn’t just something that’s generated by the SLM—it’s filled in so that the contract is always maintained between whatever service is calling the reasoning-as-a-service microservice.

So yeah, that way, you can just abstract the whole thing away, and you can put it in any kind of production environment, with the typical guardrails that you have—like trace-budget controllers, latency holders, and everything. It will actually stick to the SLAs that you typically expect in a multiple-microservice architecture system.

13. Now, quantization is a common way to reduce inference cost, but mixing models of different precision—or even merging quantized weights—can be tricky. So what did you build in SLM-Fusion, again, to use it as a case study, to handle quantized models effectively?

14. SLM-Fusion also introduced a FastAPI-based Fusion Gateway that is even OpenAI-compatible for inference requests. So how do you see a system like this being deployed in a production microservice architecture? Could it sit alongside existing serving frameworks, perhaps?

15. Finally, Karun: any emerging trends, perhaps in governance or tool integration, that you believe will significantly impact how we deploy language models in production?

Karun Thankachan: I think, right now, diffusion models are becoming a little bit more commonplace, and that might be a trend worth checking out. Apart from that, I guess the main thing to focus on is that, within LLMs, maybe six months ago, there was a split between: is investing in parameter-efficient fine-tuning—LoRA, QLoRA—along with alignment techniques like DPO and PPO, a good investment of time versus just focusing on RAG and prompt engineering? It looks like the industry is shifting a lot toward RAG and context engineering. One, because maybe it’s cheaper. And for the other things, you need specific hardware, and you need to hire people who know how to do it. But it also seems like you can actually get fairly accurate answers and fairly good reasoning from your LLM models if you actually set up a good RAG pipeline, and if you bolster it with good retrieval—a way to improve or rerank the retrieved documents and again select the best ones on top of it. So don’t just have a simple RAG pipeline. Fit a model on top, maybe improve the accuracy of your retrieved documents with the reranking model, and also focus a lot more on context engineering. So don’t bloat your context with a lot of information. Look into context compression. Look into eliminating things from your context if they are irrelevant. Just having irrelevant things increases hallucinations. So a lot of investment in good engineering, I would say, combined with good retrieval, seems to be giving a lot more accurate answers, a lot less hallucination, and a lot better reasoning as well. So that seems to be where the industry is focused right now. It would be interesting to see if it switches back to fine-tuning, or if it switches back depending on how this diffusion trend plays out and how the cost-versus-LLM trend plays out. I think those are some trends to keep an eye on to see where we need to switch next.

Trade-offs in Modern System Design: A Conversation with Archit Agarwal

Divya Anne Selvaraj — Thu, 26 Feb 2026 05:13:34 GMT

This conversation with Archit Agarwal is a practical tour through modern system design—starting from first principles and repeatedly returning to a single constraint: real systems live under trade-offs, and good engineers choose those trade-offs deliberately. Agarwal is a Principal Member of Technical Staff at Oracle, where he works on ultra-low-latency authorization services in Go. He has 11+ years in backend engineering across .NET and Go, and he writes The Weekly Golang Journal, focused on turning system design into usable, operational guidance—especially around performance and efficiency.

He lays out the inflection points that justify splitting—deployment friction, widening blast radius, and the need for truly independent scaling—while emphasizing that flexibility comes with a real operational tax. On cost and resilience, Agarwal makes the same argument from a different angle: engineering decisions should be evaluated as performance per dollar, not performance in isolation. He describes building cost awareness into the design process via observability, explicit cost discussions, and being disciplined about scaling only when needed.

Finally, the conversation shifts from production architecture to interview performance. Agarwal recommends that candidates stand out by aligning on requirements first, surfacing trade-offs explicitly, and communicating clearly enough that the interviewer can follow the “commit history” of their reasoning. He also explains how he expects candidates to handle changing constraints midstream—by absorbing the change, restating it, and selectively updating only the affected parts of the design—while building breadth through fundamentals, real-world problem practice, and a few deep specialties.

You can watch the full conversation below or read on for the complete Q&A transcript.

Emerging Trends and Challenges in System Design

1. We’re seeing this pendulum swing in architecture, with many teams rethinking a pure microservices approach and embracing modular monoliths to reduce complexity and cost. How do you decide when a microservices architecture is truly warranted versus keeping a system design simpler?

Archit Agarwal: To be very honest, this is the first question that I ask myself when I start designing a new system or a new module. And the rule that I follow is very simple: If the problem isn’t complex yet, don’t overengineer it. Start with just a monolith.

New engineers that come into the industry come up with a lot of these buzzwords—event-driven architecture, microservices, serverless. They’re great, but you cannot apply everything in just one go until your application really needs it, right? So that is a key difference between any interview-ready engineer and a genuinely good engineer: a genuinely good engineer would not want to implement everything up front. He would engineer things around the problems that we are facing.

In any early project that you see when you start with a project, the requirements are always changing. You have very little understanding of the domain, right? And the scope is very small. So you should not go into implementing every new buzzword that you see in the industry. You start small, start with a monolith, and design in a way that, in the future, if you want to break that down, you can easily do that, right?

And if your application requires a low latency—for example, if you’re working on a financial kind of system—you cannot live with only microservices. You will have to evaluate if microservices are good for you. Ideally, if you use microservices, there is always going to be additional network hops, and it will be slowing down the system, right? So I would always say that microservices aren’t the magical fix that fixes bad architecture, right? They just distribute that over the network.

So when you start writing your application, start with a monolith and then start understanding if you have the pains where the pain of having the monolith is greater than the pain of splitting it. Ideally, we would have a lot of signals when we can identify whether we should move out of a monolith or not. A few of those signals are: your deployments are getting bigger and slower, you have a larger blast radius on the bugs that you will see, or you need a lot of independent scaling.

For example, if you have a sale for an e-commerce platform, if there is a sale coming up, you would always want your payment-related system to scale larger than your login system, right? So if those are the requirements, you definitely start moving out of a monolith and move into microservices.

And there are a lot of other things. For example, if you need different tech for different problems. If you want to have analytics, you would want to use different technology for that, right? So in a monolith, you cannot have your project written in multiple languages.

So microservices definitely give you flexibility. They also give you headaches, so you should always choose wisely.

2. With cloud spending at an all-time high, there’s sustained CFO scrutiny on engineering decisions. How do you incorporate cost considerations into system design?

Archit Agarwal: Ideally, I would say this is a point where every engineer becomes a philosopher. I remember one quote from—I don’t know where I read it, but it stuck to my mind—and it said that a good engineer would design for performance, but a great engineer would design for performance per dollar.

So any engineer who is thinking about the cost with respect to the performance gain is a great engineer. I didn’t truly understand this quote until one of my family members started one of his startups and I was involved with him in all the tech-related discussions. That was the first time when I realized, OK, when I’m fighting with my manager or my senior manager over using a particular tech, why do they always say no if they don’t need it? And I’m always saying that it will help us scale, right?

That was the first time when I started realizing the importance of why I was denied a lot of requests—because those were not the real pain that I was solving for, right? Trust me, every system will definitely cost something, and you need to understand that no business can keep spending money on something that is not needed at that particular moment.

And to be honest, we had one client—I’ll give you one more instance—where, as a team, we saw great advantage. There was one client who was pushing to reduce the infrastructure cost, and we as engineers, again, we were not doing that. So what he did is he introduced a dashboard where we were seeing per-engineer cost of the infrastructure for the development process. And those numbers were huge per month. And to be very honest, seeing those numbers listed against each person’s name, everyone started evaluating whether to use a particular tech or not.

Like, whether it is really needed, or when you log off from the system, should you shut down your EC2 instances or not, right? That is a huge difference, and in six months, we saw a 20% month-on-month decrease in the infrastructure cost.

So I would say I follow a few principles with that. I don’t prematurely optimize, but I stay observant on the infrastructure. I keep my observability to the extreme so that I can have a dashboard and see where my system is lacking, what part to scale, where I should have improvement. So observability is very important in this perspective.

Then I always design my system for horizontal scaling, but I don’t horizontal scale unless it is needed. Because if you have infrastructure which is of no use, there’s no point spending that money. But you should have that in your infrastructure requirements and your lifecycle.

For example, if you are using an S3 bucket and now you have 100 GB of data there which is ideally not being used for months—or will never be used—why do you want to spend money on live data there? You should push it out to cold storage and spend less on that data which is practically not being used.

Then, into the technical conversation: for every story that we start designing, we have design discussions. In the design discussion, we would try and include the costing. At times we see that engineers come up and say that they’ll reduce the latency by 10%, but to reduce the latency, they’re increasing the cost of the infrastructure by two times or three times.

So then the question is again on the engineer: Do we really want to improve the latency with the high cost? If it is really needed, we are OK to spend, right? But if it is not really giving any advantage to the user—of that 10% decrease in your response time—by spending that great amount of infrastructure cost, this makes the team aware that performance without cost awareness is just expensive engineering. So you should not just keep adding to infrastructure cost every now and then.

3. Modern systems are facing record-breaking DDoS attacks and increasingly complex supply-chain threats. For instance, 2025 saw hypervolumetric DDoS attacks peaking at multi-terabit levels and a 188% year-over-year spike in malicious packages in open-source registries. How do you design systems to be resilient against such attacks and vulnerabilities that are increasing exponentially?

Archit Agarwal: In today’s world, I don’t design things thinking that I’ll not get attacked. I always design thinking that I’ll always be attacked, and how would I react when I’m attacked?

Modern systems are operating in very hostile environments. So you should always assume two things: the system will fail—that is for sure, that is inevitable—and then you’ll definitely get attacked now or in the near future. So if you plan your infrastructure and your architecture based on these two assumptions, you’re making good decisions to protect your system against these two things.

Once you accept these things, you can reduce the blast radius of these two things because now you are aware. So how do you do that? There are a couple of things that we start with.

First part is always a layered defense, where you start with your network layer. In your network layer is the first thing—your first defense layer—to protect yourself against any attack or anything. So you can use services that are given by the cloud provider. For example, AWS has a service that is called AWS Shielded Advanced. You can use that. Azure has a service. Google Cloud has a service. Every major cloud provider will have some service to protect with the network layers—you start using that.

Then in your application layer, you start adding code for limiting the request. For example, you start implementing rate limiters based on the geolocation, or IP, or user. Maybe you say that if a user is making more than 100 requests per minute, he’s probably trying to attack my system, because that’s not an ideal flow of a user to call my system 100 times in a minute. So we’ll block that user.

And maybe some bot-type of protection. For example, Google has a bot which crawls to every web page and collects the data for optimizing the search results. But Google’s bot makes sure that it is not overloading the server with a lot of requests to crawl the data. But there are bots that people write—bots that are made to overcrowd your server and keep collecting the data so that they can do some added advantage to themselves with the data that they collect from you. So you should write your application layer to protect yourself against such bugs.

Then your architecture has to have an upper limit on your auto scaling. So you cannot keep auto scaling to 100 servers for one service, right? Because if you’re scaling to that extent, that means there is some malicious activity going on your server suddenly. So you should always have an upper limit. Auto scaling is great until you realize that you’re auto scaling your DoS attacks.

Then the second thing on your defense would be having resiliency principles. For example, if you have a bigger application, you would always deploy it into multiple availability zones. Why? So that if one data center is under attack, you can completely shut down your service deployed on that data center, but still have your application up and running for users because your services are again in different data center—or maybe go multi-region.

Or these days, you can even go multi-cloud, but multi-cloud is not easy. You will have to consider a lot of things around multi-cloud.

Then is your supply-chain security. These days, modern applications are dependent on a lot of external services, so you need to make sure that whatever service version that you are using, you have already validated the service for the security risk—and you are not auto-upgrading until you validate it—because those dependent services are the actual surface area that you are exposing to the attacker. That is the service area—now you can start attacking on the service area. So that is the next thing that you look at.

Then you apply security by authorization, and by authorization you would always do a deny-by-default. You don’t say that I will allow everyone unless he has this role. No—you say that everyone is denied unless they have this particular access. So then you protect yourself.

Then your token should be short-lived. You don’t ideally create tokens that are living forever, right? So that even if the attacker has access to the token, he is only having access for a particular duration. He loses the access after the token has expired.

Then observability is the key. You should always have observability on your systems. You should never miss out observability and logging so that you don’t have visibility on things.

4. Today’s architectures often depend on numerous third-party services and cloud providers, even if not by explicit choice. How do you design a system that remains portable and robust when you’re relying on external SaaS APIs, cloud services, or even multiple cloud environments?

Archit Agarwal: I was expecting that question with all the recent AWS, Azure, Cloudflare outages that have been going on in recent months. And to be honest, every system depends on a lot of different external services—for example, your database, all your messaging queue, your SaaS APIs—all of these are external dependencies. And you cannot create an application in these modern days without having dependencies on at least one of them.

So I would say multi-cloud is not always feasible because it has its own challenges. There are business challenges, there would be some data-related privacy challenges, and you have cost challenges definitely—because if you have multi-cloud, you will have a lot of huge costs that you will have to invest.

So ideally, we don’t design to avoid dependency. We design so that if one dependency creates a failure, the whole system is not down. That is the core intention of designing things. There are a few principles that we usually follow, and I think most engineers would agree.

We have an abstract layer for each external service. For example, if you are talking to a storage service, we have an interface through which our application will talk. Now this interface can any day go ahead and update the dependent service and say that today I’m talking to AWS, tomorrow I’ll go ahead and talk to Azure. So it would be easier for us to keep switching the external dependencies without impacting our actual application. So this is decoupling the application from the external dependency.

Then we can use open standards and some cloud-neutral tools. Standard as in containerization, Kubernetes, telemetry; use some databases that are open-ended—for example, Postgres, MongoDB. And for cloud-neutral tools, you can go ahead with using Terraform, where you can deploy to different cloud providers any day—you can choose between any.

Single region is a single point of failure, and single cloud can also be there, but you will have to be cost-smart on using multi-cloud. You need to make sure that your disaster recovery model is in place. You don’t replicate all the services to different cloud. Only replicate the mission-critical services to different clouds so that your users don’t have impact on their daily very important critical task—but some tasks can still be offline for some time and it’s still OK for them.

You’ll have to plan that, and then unified observability. You cannot have observability divided over different cloud or different region. You should have one single place to look at logs, traces, and everything so that you don’t do the guesswork. You have a curated list of everything at one place.

Practical Architecture Insights from Experience

5. You personally have experience building ultra-low-latency services, such as global authentication systems. What design principles and techniques are crucial for achieving sub-millisecond latency at scale?

Archit Agarwal: Ultra-low-latency systems look very simple from outside, but they’re a totally different type of structure that we are building. So I treat latency as the monthly budget that you have. Now, every network hop or any memory allocation that you do will take something out from that budget, so you will have to be very smart in choosing where to spend.

So you don’t ideally optimize for speed—you eliminate whatever is slow. Start eliminating whatever is slow. So there are a few key principles that I usually follow, and I try pushing my team to follow those.

One is: move the computation closer to the user. So your computation layer should be closer, or deployed into the edge location where the user is trying to access from. So let’s say I’m living in Bangalore and I’m trying to connect to a server sitting in the USA—I will have a lot of latency, right? So do that: fix the compute layer closer to the user.

Then avoid network hops completely in those hot parts where you want ultra-low latency. You cannot have network hops to different microservices. You always use in-memory everything. You don’t go to a distributed cache, you don’t rely on some other network server—because, again, you’re reducing the network hops.

Then you keep your service lean. You don’t use a lot of wrappers. For example, if you are using wrappers, those wrappers—finally—convert that into the native code only, right? So I would always recommend: remove those wrappers and directly communicate in the native language to the machine. That will improve the performance and reduce the latency time on your server.

Then improving your network layers—for example, reusing the HTTP connections will help. So you don’t really initialize HTTP connections again and again on your system. Then using the right protocols—so if your service-to-service communication you’re using maybe HTTP, it’s not good. You can use gRPC. gRPC is way faster than HTTP in service-to-service communication, so you choose that.

And then the last part is always the right hardware and the runtime that you’re running on. If your hardware is too old, too laggy, there is nothing that can solve the problem. You will have to fix the hardware also.

6: If I asked you to summarize briefly, how do you ensure that pushing for extreme performance doesn’t compromise reliability or maintainability?

Archit Agarwal: Ideally, what I’ve observed in my experience till now is that, in an application, not more than 5% of the application actually requires that ultra-low latency. The 95% of the application is still OK with having a little more latency on that side.

So you only should optimize on that 5% which actually requires ultra-low latency. You cannot develop an application where everything is designed for ultra-low latency. So that 95%—I would always say—design it for readability and maintainability. But for the 5% which requires low latency, there we can still compromise on the readability and improve the latency there.

Cracking the System Design Interview

7. System design questions are broad and open-ended, and probably that’s why they’re challenging. Do you recommend using any kind of structured approach or framework to tackle these interviews?

Archit Agarwal: System design interviews are not about memorizing a particular framework. It’s about thinking in a framework. Having a framework will never have a bad impact—it will only help you because now you are more calm, and you’re approaching the problem in a structured way without using buzzwords very initially in the conversation.

I’ve seen a lot of engineers come in to a system design interview and, as soon as I give a problem—let’s say, “design this system”—they start with, “let’s use microservices,” and start using distributed cache. But they didn’t understand what scale I want the system to be in. And when I asked, “How many users are you planning on this system?” they would ideally say 1,000 users or 10,000 users in a minute. But is that really needed? Is that really what I wanted? That’s not in alignment.

So I would always say: start with one to two minutes of quick alignment with the interviewer. Try and gather the functional requirement, where you basically get answers to two main questions: What are we actually building, and what does the user actually need? By this, you will understand what the database model is—whether the system is read-heavy, write-heavy, what type of system it is. Then you go into nonfunctional requirements. Now, nonfunctional requirements are the ones that actually drive the architecture.

So in nonfunctional requirements, you ideally collect data around the number of requests that you are planning on, the scale at which you are operating, the consistency that you are looking for, or is there any latency requirement there. Nonfunctional requirements are the ones that decide the architecture—not the other way around.

So yeah, I would say: consider the system design interview as two engineers discussing a problem. It should not be like you are getting interrogated by the other person. If you are asking the right questions in the initial one to two minutes, you have already impressed the interviewer. He’s already giving all the ears to you now—he’s listening to the conversation, and he’s also interested in giving his thoughts on that. After doing all this, now you can move to high-level design and get into the different parts of it.

8. So according to you, how should candidates break down a complex design problem during an interview to ensure they cover all important aspects? I know part of it is asking those questions, but what else?

Archit Agarwal: Basically, when it comes to a system design, you should try and break that complex system into smaller pieces and then go to the high-level design.

So once you have got those questions answered—basically functional and nonfunctional requirements—then you start by introducing a very high-level design diagram, and then you start zooming into one piece at a time. For example, you have given the high-level architecture where you say that there is a user who is making a request to the API server. Which request goes to the service, and then the service makes the call to the database or maybe the caching layer, and the response is sent back. That’s a very high-level architecture that you have.

Now you start zooming in: What type of API gateway? Do you need a load balancer? Do you need multi-region deployment? And all these are answers that you have already collected from the nonfunctional and functional requirements—and this is how you start introducing your thought process.

And in this process, when you are trying to zoom into each piece, what you do is, ideally, you start discussing the trade-offs. For example, when you talk about database, you say, “I’m using a relational database.” Why are you using a relational database? Why not NoSQL? That is a trade-off that you should introduce in your conversation. Then why are you using EC2, not a Lambda service, right? So all these trade-offs are something that you start discussing, because system design, ideally, is about discussing the trade-offs.

So if you know the trade-offs—why you’re using a particular thing over the other—you have already made progress where the interviewer knows that this person knows things well. He knows his choices. He understands why to and when to make a choice.

So by this time, he will be very confident that this guy will be able to design an application which is operating at a Google scale. Maybe the application is as simple as a to-do application, but he will be able to take it to the scale level that we want.

9. And if you turn the lens inwards a bit from your perspective as a system design interviewer, what is your process for evaluating a candidate’s depth versus breadth?

Archit Agarwal: So honestly, a system design interview is not about the diagram and memorized architecture. It’s about building a thinking muscle more, right? Most people try to study system design like a subject, but I would say: think of system design as a skill that you are adding to your bucket, right? It’s a skill you need to improve with structured and deliberate practice. Start with strong fundamentals—that’s what we just discussed, right? You should have strong fundamentals.

Then start practicing mock interviews. Take help of some person—maybe a mentor or a friend—who can sit down with you. You start designing one system design problem. For example, start with a URL shortener. Start discussing it with your friend or a mentor. And try to form a complete framework where you say that first, in any system design, I’ll get these things answered; then I’ll go to this part; and then I’ll go to this part. Try and do your system design practice in that particular framework so that you are very comfortable.

Be comfortable with the framework itself. You should not memorize the questions that you have to put in, because the questions will keep changing based on the system. But the framework should be good enough so that you have easy traversal through the problem, and it is easy for you to travel there.

Then work backward in a real-time system. So what I usually do is, I question myself on a few systems. For example, if we are using WhatsApp—everyone uses WhatsApp mostly, right?—so I would think about how WhatsApp is able to scale the messaging server. And now I will start exploring articles, blogs, engineering blogs around it, and start understanding how we can do that, right? Or maybe how Netflix is able to scale the streaming globally. That’s a complete different engineering challenge. How is Netflix able to do it? So start backward, think about the system, and then start researching about it.

Then start building things. So then you start building things—and maybe you don’t do it at a global scale, but at least when you start building, you will understand the challenges around latency, or maybe race conditions, or all those constraints that you think about, right? You start feeling that, and you start solving that.

And then the last part is definitely: learn to communicate. Because if you don’t learn to communicate system design interviews, you’ll not be able to excel there.

10. But do you recommend any specific resources, books, or specific real-world exercises for mastering system design concepts and being interview-ready—especially for senior engineers aiming to showcase their expertise?

Archit Agarwal: See, for someone who is aiming for a senior role, I would definitely suggest a mix of a few things—starting from a book, real-world blogs, and then real-world exercises.

So for books, I would recommend you should definitely read Designing a Data Intensive Application by Martin. That is a must-read book for any senior engineer who is aiming to excel in system design. Then there are books like System Design Interview, Volume One and Volume 2 by Alex Liu, right? Those two are very good books. Then Building a Microservice by Sam Newman.

So those are a few very good books that have been written. And if you read those books, you’ll get a lot of understanding on system design. Then you can refer to some engineering blogs by big tech giants. For example, Netflix has an engineering blog. Uber has an engineering blog—and all those big tech giants who are into technical space, and they have a big tech infrastructure that they maintain, they always have engineering blogs. Go refer and read those blogs. Go to high-quality YouTube channels where they’re not just discussing the diagram—they’re discussing the concept, more depth into the concept. So refer those channels, in case you want.

And then finally is designing a system which is time-tested, scale-ready, and you have done that. So system design interviews isn’t cracking by memorizing some answers. They’re cracked by building strong foundation, real practicing problems, and then thinking like an engineer, not an exam candidate.

11. Even experienced engineers can stumble in design interviews. What are some of the most common mistakes or pitfalls you see candidates make—especially when they’re quite experienced and perhaps more confident than some others—and how can engineers avoid these mistakes?

Archit Agarwal: System design interviews are funny because people don’t fail because they don’t know what Kafka is, or maybe DynamoDB. They fail because of the way they communicate with the interviewer.

So I would say that if you’re having good communication—and you’re establishing that communication and having a two-way communication with the interviewer—that’s half of the job that is already done. I’ve seen engineers who jump directly into solutions as soon as they listen to a problem where—let’s say I say, “design this system”—and they would start saying, “I’ll use Redis, I’ll use Kafka.” I would say, slow down. First, understand the scale constraints. For example, how many requests per second are we operating at, or how much data are we expecting per day flowing in the system? Or is there a security requirement?

For example, if you’re operating in a European country, you have different compliance on the personal identifiable information than in other countries, right? So you should start asking those constraints first and then start coming to a conclusion and architecting things, right?

And you probably don’t need to design at Google scale everything. It doesn’t have to scale to Google, right? There are things that are defined for small scale only. For example, let’s say there is an application that I want to design that is only to be used by my company’s engineers—it doesn’t have to go outside that. So why do I need multi-region deployment? I can do a local area network deployment and live with it, right? I don’t even need cloud there.

So those problems you need to understand. Then if you understand how many requests, how many servers would you need, or how big a database do you need, right? So if you start addressing those basic questions, I think you are already sorted and you are on the right track on that.

12. Have you ever seen a case where the interviewee has asked too many questions? Has that ever happened?

Archit Agarwal: Yeah, I have once seen one interviewee who was asking too many questions, and that particularly gave me an idea that the question that I have probably asked him is something that he’s not aware of.

For example, I gave him a system. He didn’t have any idea about the system. He’s never thought about that. He might be using that every now and then, but he has not given it a thought. But it is OK. Let’s say if I’m interviewing a very junior engineer, he might not have thought about a lot of things by then, and if he’s asking too many questions, it is still OK.

But if he’s asking questions that are very small, and I think those are very basic for that particular level of engineer, then it raises a red flag. But asking clarification questions is perfectly OK.

13. Now, as you’ve also said, a system design interview isn’t just about the final answer, right? It’s about how you communicate, how you adapt to the constraints you’ve sort of discovered during the conversation. Interviewers often value a candidate’s ability to clearly explain their thinking and reasoning—and the ability to adjust to constraints that are put in front of them mid-discussion, even. So in this context, how important are communication skills in these interviews, and what does good communication look like for a system design question?

Archit Agarwal: OK—so, honestly, communication is half of your system design interview. Or maybe it can be more. Let’s say if I am capable of designing a beautiful architecture in my head and I’m not able to communicate or explain it to the other person, the interviewer will see that architecture doesn’t even exist for them, right? Because you were not able to explain it to them.

So I have seen candidates who design very solid system design architecture, but they were either too quiet, or used too many jargons, or were too scattered in explaining the information. And in a system design interview, it is about how you communicate and explain to the other person the architecture that you are thinking about, because that gives insight into whether this person will be able to work with a team of architects, product managers, and junior engineers—whether they’ll be able to explain what they’re thinking. The system design interview is also intended to understand your communication skill as well.

On the technical side, there are a few things that I always suggest to everyone. Think out loud. You should not be silent for, let’s say, five minutes and you’re just thinking about the system. Start speaking whatever you are thinking. People need to know your brain’s commit history, basically—whatever you are thinking.

So maybe you are saying that, “I’m choosing this approach because of this thing,” or “given that this is the scale at which we are operating, this option makes more sense.” Start communicating your ideas. Maybe you are not communicating the right thing, which is good for the system—but once you communicate, when you read out loud your idea, you will automatically make more sense and you’ll auto-correct yourself, and it is perfectly OK if you’re auto-correcting yourself.

The interview should not feel like a monologue where you’re just speaking and the other person is listening. Because trust me, if that is happening, you should get the indication that you have already lost the session. So to do that, you will have to start structuring your answers. Basically, what you say is important, but how you say it is more important than that, right? So a good candidate would break the answer into multiple steps. Summarize things. Occasionally, start transitions—like, “Now I would go into, I would start discussing the data flow,” “Let’s start discussing the caching strategies,” these kinds of things.

Check if the interviewer is aligned to your communication or the approach that you are trying to follow, and make that interviewer feel that they are sitting with another engineer who is trying to collaborate and bring up a good system. That’s the intent that they want to see.

Your things that you say should not be meant to impress them. You are not there to impress them with a large amount of jargons that you say, or big words. You should be very clear, concise, and make sure that your communication is so clear that even if the other person is very junior to you, they can still understand. That’s the core of communication, right? Your communication should not only travel up the ladder; it should also travel down the ladder when you’re communicating.

Then listening is another advantage that you’ll have. If you’re not listening to the interviewer, you’ll not be able to respond to the feedbacks that they want to implement—or maybe you’ll not be able to adopt whatever they’re giving as feedback. So you should always try listening more to the feedbacks that the interviewer has.

14. Some really excellent tips there, Archit. But what happens if an interviewer throws a curveball—say, suddenly the constraints change? You’ve sort of thought it through really well. You’re in the flow, you know you’re doing really well, the goal is almost in sight—but this new constraint or change in scope is just thrown at you. So what’s the best way to handle this kind of situation?

Archit Agarwal: To be honest, I love when an interviewer throws these curveballs. Now, why? Definitely they’re not easy. When you are into the system design, you are halfway through and you’re almost there, and something changes—it’s really frustrating.

But, to be honest, that’s the real-world scenario, right? You’re always designing things, and suddenly things will always change. Your actual world is also in that same sense. So if you are not able to adopt, then there’s no point designing architecture, right? So if an interviewer is giving you a curveball, think about it as a chance for you to showcase your adaptability according to the changing scenarios.

So here is how I would ideally approach it. I would not panic, and I would not go ahead and start defending my original diagram, right? I would first absorb what they’ve mentioned and then say, “OK, this changes these things. Now let me think about how we can adopt to this.” Now this gives the other person a hint that Archit is flexible and he’s not egoistic on his design approach, which is one good sign.

Then I would restate whatever they have mentioned to make sure that we are aligned on the same requirement change that we have seen. I’ll always reiterate in my own words, right?

Then the third thing that I’ll do is start highlighting what part of the system will have to undergo changes and what part will remain intact. This also gives a very clear understanding whether I’m able to structure the redesign approach—understand what part of the system still can be the same and doesn’t have to.

The curveballs that the interviewer gives you—the changes that the interviewer gives you—will never be in a way that you will have to scrap the complete diagram, the complete architecture, unless you were already off the track, right? They want to understand: how do you plan what part of the system can remain as it is and what part of the system can change, and how flexible is your system to changes.

And if there is something that is complex, be honest. No one expects you to have knowledge on everything. So if there is something that is complex, think that you are in a two-way communication with an engineer. You can start speaking about it. If this is a complex thing, you can say that this is a bit complex and these are the trade-offs that we’ll have to make—and try and include the interviewer in your communication in those things.

So this is how you will succeed. System design interviews are not about being right all the time. They’re about how clearly you can think, how well you can explain, and how gracefully you can handle the changes.

15. Candidates are expected to know advanced concepts that used to be considered niche, and this continues on very well from what you were just saying just now. So, for example, in a scenario like designing a location-based service, it may be assumed that you have knowledge of geohashing or spatial indexes. So how should candidates prepare for this breadth-of-knowledge challenge that has sort of become more and more expected?

Archit Agarwal: To be honest, the bar is definitely raised. Now, once the things that were termed as “nice to know” are something that are considered that you should know with the same experience level. So I would not deny that fact, but here is the thing: I don’t think a candidate needs to be an encyclopedia on that side. If they are an intentional learner, it’s good enough—because no one can ideally learn everything. Tech space is too big for all that right now. There are a lot of things in tech space. No one can learn everything.

But having said that, in an interview, if you’re getting some question that is out of your league, you definitely will panic. So how I approach my learning and catering to those things nowadays is having four layers in your preparation module.

First thing: build extremely strong fundamentals. Your fundamentals are extremely important because any advanced topic you can term right now has always been starting from a basic system. There was a basic system which had some issues—that’s why this advanced system was innovated, right? So if you know the basics well—for example, you know how a database works, or how indexing in the database works—how can a distributed system fail, or what are the different consistency models, right? If you know these basics, it is more than enough for you to start establishing your knowledge in those advanced topics. So make sure that your fundamentals are very clear.

Then learn the advanced topics through real problems. I would not just go ahead and keep reading articles or books around those advanced topics. I would just say: let’s say I want to start understanding geohashing—so I would not just read about it; I would design a food delivery app to understand geohashing. If someone says that I want to understand Kafka semantics, just don’t read about it. Start defining or designing a real-time analytics system where you include this topic, and that’s how you will deepen your knowledge in these areas.

Now after all this, pick up two to three areas where you will go deep. Because personally, I believe you should have deep knowledge in one or two areas at least, because when you go into an interview, the depth of the knowledge is directly reflected—because that topic you will be speaking more, right? And trust me, any engineer who is interviewing you, if you go deep into one particular topic, they understand that this is some area that you are more interested in. And if you’ve gone to that depth, that means you are already an engineer who understands the gravity of things. So you can maybe think about systems that you can go deep into—like, for example, a distributed system, or a storage system, authentication system, or maybe go deep into performance engineering.

Then practice is important. Practice articulating how you can discuss the trade-offs. Maybe ask a friend to sit with you and talk to them on the trade-offs. So once you start communicating and your friend gives you feedback, you will start improving your communication skills on the discussion of those trade-offs. So that is the fourth thing that is very important.

16. If you turn the lens inwards a bit from your perspective as a system design interviewer, what is your process for evaluating a candidate’s depth versus breadth?

Archit Agarwal: Honestly, a system design interview is not about the diagram and memorized architecture. It’s about building a thinking muscle more, right? Most people try to study system design like a subject, but I would say: think of system design as a skill that you are adding to your bucket. It’s a skill you need to improve with structured and deliberate practice. Start with strong fundamentals—that’s what we just discussed. You should have strong fundamentals.

Then work backward in a real-time system. What I usually do is I question myself on a few systems. For example, if we are using WhatsApp—everyone uses WhatsApp mostly—so I would think about how WhatsApp is able to scale the messaging server. And now I will start exploring articles, blogs, engineering blogs around it, and start understanding how we can do that. Or maybe how Netflix is able to scale the streaming globally—that’s a completely different engineering challenge. How is Netflix able to do it? So start backward, think about the system, and then start researching about it.

Then start building things. Maybe you don’t do it at a global scale, but at least when you start building, you will understand the challenges around latency, or maybe race conditions, or all those constraints that you think about. You start feeling that, and you start solving that.

And then the last part is definitely: learn to communicate. Because if you don’t learn to communicate system design interviews, you’ll not be able to excel there.

17. Do you recommend any specific resources, books, or specific real-world exercises for mastering system design concepts and being interview-ready—especially for senior engineers aiming to showcase their expertise?

Archit Agarwal: See, for someone who is aiming for a senior role, I would definitely suggest a mix of a few things—starting from a book, real-world blogs, and then real-world exercises. For books, I would recommend you should definitely read Designing Data-Intensive Applications by Martin. That is a must-read book for any senior engineer who is aiming to excel in system design. Then there are books like System Design Interview, Volume One and Volume 2 by Alex Liu. Those two are very good books. Then Building a Microservice by Sam Newman.

So those are a few very good books that have been written, and if you read those books, you’ll get a lot of understanding on system design. Then you can refer to some engineering blogs by big tech giants. For example, Netflix has an engineering blog. Uber has an engineering blog—and all those big tech giants who are in the technical space and have big tech infrastructure that they maintain, they always have engineering blogs. Go refer and read those blogs. Go to high-quality YouTube channels where they’re not just discussing the diagram—they’re discussing the concept, more depth into the concept. So refer to those channels, in case you want.

And then finally is designing a system which is time-tested, scale-ready, and you have done that. So system design interviews isn’t cracked by memorizing some answers. They’re cracked by building strong foundations, really practicing problems, and then thinking like an engineer, not an exam candidate.

Coroutines vs Virtual Threads and the Kotlin Java Decision in Practice: A Conversation with José Dimas Luján Castillo and Ron Veen

Divya Anne Selvaraj — Thu, 12 Feb 2026 07:58:23 GMT

Kotlin has moved from “Android-first” to a practical option for Java teams that want safer, more concise JVM code without abandoning their existing Java investments.

In this conversation we speak with José Dimas Luján Castillo and Ron Veen, co-authors of Kotlin for Java Developers (Packt).

José is a mobile-focused technologist with 15 years of experience building Android, iOS, and Flutter applications and leading teams globally; he has worked on 500+ mobile apps, written 7+ development books, and taught at 25+ universities across Latin America.

Ron is a seasoned JVM engineer with 20+ years in the Java ecosystem, spanning mainframes to microservices; he’s an Oracle Certified Java Programmer and Sun Business Component Developer, serves as a special agent and lead developer at Team Rockstars IT, speaks at international conferences, and has authored books on Jakarta EE cloud-native migrations and modern concurrency (including virtual threads and structured concurrency).

Together, they discuss what it takes to make Kotlin a first-class citizen alongside Java in production: writing idiomatic Kotlin, choosing between coroutines and virtual threads, modernizing enterprise systems with Jakarta EE, navigating microservices versus modular monoliths, and adopting modern Android and cross-platform approaches such as Jetpack Compose and Kotlin Multiplatform.

You can watch the full conversation below or read on for the complete Q&A transcript.

Kotlin and Language Evolution

1. From your perspective, what are the biggest benefits Kotlin offers to today’s senior developers compared to Java? And in what areas do you still find Java holding its ground?

José Dimas Luján Castillo: When we started using Kotlin on Android, the difference was very obvious because Java was too verbose. Android is doing some things in mobile development with the Java characteristics. So, for example, it was very easy to see the real benefits—for example, null safety. It’s automatic. So in this case it was very, very fast. The interoperability with Java—because at the end, if you have legacy code in another language, it’s where you want to try a different language just because it’s modern, even if you have details with the language, because you will have a lot of problems if you change to a new technology, framework, or language. So interoperability at the beginning—obviously, we didn’t believe we had good interoperability. But when we tried it, we saw, OK, maybe I can do the next steps in my applications with Kotlin, but I don’t need to fight with the legacy code even if it’s in another language. So I think interoperability was a good point to start with Kotlin for mobile development, and obviously we have other things as pushing the programming in a very easy way to add it—or the synchrony. But I think those two points to start for any mobile development, when we try to start with Kotlin, it’s a good point. Well, at the end we need to remember the first step for Kotlin was mobile development, and it was very easy to be clear if I would start with Kotlin—but if Kotlin doesn’t use this null safety and interoperability, probably the adoption was slow or more complex for the people. Because, as I mentioned two minutes ago, the main problems are still there even if you change the language. So I think that’s my point for this comment.

2. How should engineers decide when to use Kotlin versus Java when it comes to new projects?

Ron Veen: Yeah, I think my experience comes more from enterprise and not so much mobile development. What I’ve seen there is, well, there’s a natural eagerness from developers to learn new things. And like José said, when you switch from Java to Kotlin, it really feels—I don’t know—it makes programming a bit more fun again. That’s something I really found. I could see developers getting enthusiastic. I think one of the benefits is you actually have more concise code, so you write less code. And also, like mentioned, the code tends to be less error-prone. You know, null safety is really a big thing. You can actually see that a lot of frameworks are working towards that now, also with regard to new Java versions, but nullability will always be a thing in the JVM, and Kotlin takes care of that for us.

Now, why would you adopt this technology? Well, I think there could be a number of attractive points. Again, like I said, there will be less code—and less code is good: less code to maintain. There are less errors in the code. It might make your development team actually more attractive to new hires because you’re using a very modern language. And we shouldn’t forget Java really evolved. Java had Project Amber, where it added a lot to the core language—like records and sealed classes and pattern matching—things that already were in Kotlin. Java gets them added slowly—they’re getting added—but I think Kotlin is just always quicker with adding those new features that developers really crave.

3. Many Java veterans fall into the trap of writing Kotlin in a Java style. What are some common mistakes or mindset shifts that Java developers need to overcome to write idiomatic Kotlin?

Ron Veen: I think that’s the thing we all get to at some point. We start with Kotlin and then we write it in this Java style, right? And this is not what we should do. It is a natural reflex—let’s be honest—because first: OK, I want to do a new language, but I also have a project to finish, or a task to finish, or a sprint to complete. But also, we’re still trying to write code in the way we know it and just use the little things. So that can kind of give you some problems there.

But sometimes, as a developer, you have to be willing to relearn the language. You know Java, and now you’re going to do Kotlin. You can do everything in Java style in Kotlin, but you really should try to relearn the language. It’s not a drop-in replacement. You really have to be willing to learn.

I can remember I once pitched it at a project and developers really got, “OK, this looks good, this looks fine.” And then, just to force myself to really do it in a Kotlin way, I tried to shrink the number of lines that were in there. The original from 250 went down to under 100 or something, and that was not technically needed, but it was: “OK, how can I leverage Kotlin’s native way of doing things and make things faster?” So sometimes you just have to pick up a piece of existing code and decide to rewrite it completely—and forget about everything you know about Java.

4. Could you share best practices to avoid “Java-esque” Kotlin and fully leverage Kotlin’s language features?

José Dimas Luján Castillo: As Ron mentioned, that’s the thing. When we start with Kotlin, I think 99% of developers start by just translating the code. That’s the problem, but it’s part of the beginning—because we have one line of code in Java and we want to see how we need to write this line of code in Kotlin.

But later you hear about the benefits in Kotlin, and if you are translating each line, you will see you don’t have less code—you have the same code in a different way. Maybe it’s easier to read, but it’s the same code. But when you start asking that kind of questions, you will see the benefits because you will start trying to look inside how Kotlin is doing that. And you will see five lines or 10 lines in Java—maybe it’s two lines in Kotlin, or three. So you will see the real perspective from Kotlin, and this is the first step, I think, to look at good practices.

For example, you will see not only fewer lines or differences, but the achievement in these cases. For example, immutability first—you will see, “OK, I need to think in immutability first,” because it’s a different way to start the problem. So, for example, you are automatically using good practices in the general programming work without doing anything, because you don’t know sometimes, or you don’t have it clear. We are creating in Java, but we are not thinking in immutability first. So after a couple projects, or some examples you are trying to execute, you will start seeing you are using good practices sometimes—because it’s natural here in Kotlin as immutability.

But now you can do the same thing in other languages, even Java, OK? But you need to do something to have this in your code. In Kotlin, sometimes you do good practices by default. You don’t have to change the code to have it because, by default, you have these good practices. And you will see a lot of those cases—not just immutability. For example: the collections and streams, the lambdas. Lambdas is the same base. Lambdas is not something advanced when you are learning other languages; it’s not the first subject, the lambdas, for example. And here in Kotlin, when you are starting the code, you will see lambdas since the beginning sometimes. Even if you don’t know it’s a lambda—you know it’s code—but later you will see it’s a good point to start doing something. So I think that’s a good way to start using good practices, and sometimes you don’t even know you are using these good practices.

5. I’d like to get your perspective on concurrency. Java’s recent releases have introduced virtual threads and structured concurrency to simplify multithreading. According to you, how should engineers approach concurrency now? For instance, when building a high-throughput service, how do you choose between Kotlin’s coroutine model and Java’s Loom-based virtual threads?

José Dimas Luján Castillo: Well, I think the way to—maybe it’s not the easy way to understand that—but I think if you are coming from Java, you can see concurrency as a model. It’s a—we have a definition about how we can do some stuff—and coroutines is a model, a structure, to think about it.

What are the important or main parts for this? You have cancellation, you have propagation, you have control of the whole cycle. That’s the thing. You don’t need to think too much in all the scenarios because you have the structure. So I think that’s the main concern when we are trying to understand, because sometimes you need to think by yourself in all these scenarios, but not in this case. Coroutines is very clear: you have the ingredients to start doing anything—OK—with coroutines.

Versus, for example, virtual threads: you will have a way—a very simple model—for example, it’s very traditional. We have it very, very clear, because the thread is a—Way to do the things. We have excellent things, but at the end, they can work together, because maybe you can use one for some specific problems and you have other options for other specific things.

It’s not a model about who is better, because that’s another thing. A lot of people, they want to know the faster way to do that: “Let’s go to the way to do that.” And it’s not the case about, in this case, coroutines. If you have some specific problems and you don’t have the model, well, think about: we have coroutines. But if you have a very complex situation, for example, and you need to put less complexity in your system, maybe you can use threads, for example.

That’s the thing: it’s not too complicated to use, for example, the Java virtual threads, because if you have a very complex structure with layers, probably this is the advantage—you can do that. But if you don’t have that kind of level of complexity, you can use coroutines, for example, for other things.

And at the beginning, obviously, the main complexity when you are using the JVM to do that kind of task—the problem is, you have a monolithic focus or scope for that, and now you don’t need to think just in the monolithic way to have an answer. I think that’s my comment for this.

6. How you see these two concurrency paradigms (virtual threads and coroutines) complementing each other in practice.

Ron Veen: Yeah, first, we should say that I think both virtual threads and coroutines basically try to solve the same problem, which is executing async code, but making it appear sequentially in the code so the developer can really reason about, “OK, what is the flow of my application?” So there’s not this thing called callback hell, where we have a million places where there are callbacks all through the code.

Both systems, I think—really, even though in the end they’re running on native operating systems, that’s both virtual threads and coroutines—they’re actually managed by the runtime itself, and it just switches the threads to execute things. So you can really see that throughput is a lot higher with both of them, where, on the side of virtual threads, it is really good for blocking—when there’s blocking things in between. You shouldn’t—like, if you have very memory- or CPU-intensive action, I don’t think virtual threads, in general, is a good solution because it won’t give away control.

But in general, like José said, I don’t think it’s like a race between, “Oh, you should use this,” or “You should use that.” I think you should sometimes just look at what is best for your specific situation. Of course, virtual threads, they also have structured concurrency, which they build upon, which gives you a very nice framework—so that could actually be a good reason to choose.

Again, if you have these difficult, really difficult use cases in general, I think both virtual threads and coroutines is a paradigm shift for developers in their mindset because, normally, as Java developers, we were always told, like, you know, threads are really expensive. So you should be careful with it. We should pool the threads and stuff like that.

And I think what we see now is, with both of these things, that we can actually—calling a coroutine or calling a virtual thread is just as cheap as calling another function in your code. So it’s also a mindset of where you want to use this in your application.

Enterprise Architecture and Team Practices

7. Enterprise Java is undergoing a renaissance with the release of Jakarta EE 11, which brings a modernized, cloud-optimized platform—introducing features like the Jakarta Data API and aligning with Java 21’s virtual threads for scalable concurrency. Ron, given your experience with cloud-native migrations, what do these developments mean for teams looking to modernize legacy Java or Jakarta EE systems?

Ron Veen: Yeah, I think with Jakarta EE 11, the whole ecosystem made an enormous step because now, for the first time, Java 17 is the baseline language level. So we’re really optimizing here. That means suddenly we can use things like records. Now we can use the switch expression, we can use non-sealed classes—all these things which were added by Java via Project Loom—they’re now actually available to also use on the enterprise side. So I think that’s already quite interesting.

Like you said, there’s this new API, Jakarta Data, which has this repository-based approach. And for people familiar with, for instance, the Spring Framework, it’s very close—not similar, but very close—to the repository pattern that Spring uses. So I think that is really good.

They also came—well, they already came earlier—with the Core Profile. Jakarta EE has multiple profiles, and the Core Profile is very interesting because it’s a very slim runtime that you’re getting then, which means it’s ideal for microservices situations.

But yeah, I really think it uses great chances. It promises Java 17 as a minimum standard, but there’s also this technology compatibility kit that goes to Java 21. And then you’re right—suddenly virtual threads also come within reach. And actually, using virtual threads, you have to run Java 21, of course, because that was the first official release where it was a final version. But there’s this thing called ManagedExecutorDefinition, which has this property “virtual,” and if you only set that, you can actually use virtual threads in your Java applications—or Java EE applications. So I think they’re making a real big step.

And about the migration part—just to get back to the question about the migration part—I think there’s many steps that you can actually take. But if it is a simple upgrade, you should first see: Am I already on Jakarta EE at all, or am I still on Java EE?

Now, if you’re still on Java EE, then there are multiple ways to migrate your sources, right? There’s this book—you’re right—from Packt Publishing, where we actually outline a number of these things. There are tools that you can use. There are even dedicated plugins for IntelliJ, for instance, that can help you a lot. It’s a lot about namespace conversion, but there are a few other tricks as well.

Now, if you are already on the Jakarta EE part, then I think upgrading is actually quite simple—basically upgrading your Java version. And this Jakarta Data project is actually quite interesting, and I would just advise architects and developers to say, “OK, use it if you’re building a new feature,” because it’s completely backwards compatible with JPA. So it has the same—you know—so there’s nothing new you need to change. If you write something new with new database tables, just try to use this Jakarta Data for those specific situations and services.

8. How can architects best leverage these new tools—like repository-style data access and virtual threads—when migrating monolithic applications to cloud-native microservices, and what common pitfalls should they watch out for during such transitions?

José Dimas Luján Castillo: Yeah—for example, about the first part, the modernization: it’s clear the enterprise is not dying. We are evolving everything. That’s the first part because, obviously, a lot of people love these phrases to put in a blog. But now, I think the enterprise is very clear—it’s evolving.

And the idea is, obviously, we want to create good tools for everyone, and we want to create good code for legacy, because at the end you will keep using it.

And about microservices and monoliths—obviously, you will find a lot of discussions on the internet and Reddit and blogs and YouTube channels, and everything about what is better for your project or your company. The thing is: the problem is when you are just reading without the context, and you need to understand each context is different. I think that is the first part.

The second is: microservices, in my experience, is not the goal, OK? It’s the consequence for that. Because when you always see microservices as the goal for each software developer or architect, probably that’s the issue—that’s the problem. Because if you don’t have developers with experience, and you don’t have a good architecture definition, you will have microservices with the same problems as monoliths. That’s the thing.

The other part is: the monolith—in these cases, people are looking back sometimes, in some cases, at monoliths as old code. And obviously, that’s a lie, because you can use monoliths without problems for huge systems, and you can save the complexity of microservices. You can use monoliths to scale. Actually, you don’t have to always use microservices to scale. It’s easier to understand and to maintain this because you have less complexity in your code.

So the question to use—“Hey, do I need to use microservices, yes or no?”—for me, I always try to answer a question before, or maybe two questions: Are we prepared for paying the real cost to use microservices—not just the money? (and) are we prepared to use this architecture or not? Because if you are doing this before you have the knowledge for that, probably you will have a huge problem. Because it’s not too easy to change your architecture, change your definitions, your business rules, and everything—and then you need to roll back to monolith because you are not prepared.

That’s my concern, obviously, when I work in teams and they are just focusing on changing to microservices because they read it in a blog. OK—that’s my recommendation, always.

9. What criteria or signals help you decide if a modular monolith might be a better choice for long-term maintainability, and how do you manage the trade-offs for scalability and team ownership?

Ron Veen: Well, this is really kind of a subject very close to my heart because I guess we’ve all been through this wave—there was this FAANG group: Facebook, Apple, Amazon, Netflix, Google—and they all came with microservices, right? So we thought, “This is the way we need to go. This is the wagon we need to jump on.” And I think a lot of us did. And like José said, I think we also found maybe it’s not the right choice, because you should ask yourself: if you have 200 employees, are microservices—for your business—the right solution? Can you actually afford it?

Because microservices, on the DevOps side, require a lot more overhead. There’s a lot more monitoring involved. You need to watch these services, you need to see what is coming out, so there are a lot of metrics needed to keep them running in production. You need to make sure they’re still running in production, because before you had just one monolithic application.

And another thing is that monoliths have become like a negative name, right? Like dinosaurs almost—which it isn’t. I mean, it’s still brilliantly functioning code running on application servers that have been there for a very long time. So there’s nothing bad about them. They’re just being opposed to being “old” compared to microservices.

So you have one running, and that’s easy to monitor, it’s easy to do logging, it’s easy to do debugging as a developer—you know, you just do it remotely. But when you’re switching to a microservices platform, you’ve got dozens or hundreds of services running. You need to make sure each of them is still running. And if you trace a problem, you need to somehow combine all the logs so that you can actually go through the logging and try to find out what happened. So there’s a lot of work there trying to debug a problem with microservices—they’ll keep you busy for a bit. So there are a lot of things there that are happening, I think, where you have to be really careful.

And yes, like José said, it should never be a technology choice, but it should really be a choice based on what we need. From my experience, what I have learned is: I would start with what I would call modular monoliths. The classical monolith is where all the modules are intertwined without very clear boundaries. If you have a modular monolith, basically what it would mean is you still have your modules, but they can’t interact directly with one another. There’s a predefined API somewhere, and one module is only allowed to access another module via this predefined API, which really makes the code less cluttered and far less risky to change—so things will break less.

Because if I would go find an API, then the API will change and I will see it—or the API won’t change, but I can change internal methods without something breaking there. So my suggestion would always be: start with this modular monolith, be very clear on your boundaries, force that we go through APIs, and then just monitor the system—watch it.

And if there are typical situations—like if you find there’s just one module, let’s say a customer module, that requires a lot of changes, which means a lot of redeployment—then that might be a case to say, “Well, maybe I should factor this out, make it a microservice,” and combine the two. There’s nothing against combining microservices with a modular monolith. Or maybe there’s different resource utilization, different scaling requirements—well, that could be another case where you say, “Okay, if this is really the case, maybe I should factor it out.”

But even then, I would say: look at how much the costs are, and only when the cost of keeping it in starts to outweigh the cost of switching to microservices—that would be the moment to make this choice. But again, microservices require a different mindset. You’re also getting the distributed tax that you’re paying, and that can be really, really expensive, and you should be really careful. But I guess then we’ll get into the realm of domain-driven design and bounded context, which might be a bit too much for now.

10. It has been said that introducing Kotlin into an established Java codebase isn’t just a technical change but a cultural one. Key advice from Kotlin adoption experts is to “win the hearts and minds” of skeptical Java developers—not only through hard facts (letting the improved code quality speak for itself) but also via soft factors like easier onboarding and community support. Showing Kotlin’s concrete benefits—for example, its focus on safety and conciseness that addresses fundamental Java shortcomings while staying 100% interoperable—can help gain buy-in. Drawing from your teaching and leadership experience, what strategies do you recommend to gradually introduce Kotlin in a Java-centric organization?

Ron Veen: Yeah, I think what you always find if you introduce something like Kotlin into an organization, you should be very aware that there will always be some gurus in the company who are very focused on Java, who know the inner details, and who might actually see bringing in a new language as a threat to their supremacy. So I think the most important thing is to get them on board, because in general I would expect that 60–70% of developers would be really interested and say, “Oh yeah, we’re going along with this.” So maybe it is not direct.

What we sometimes did is we had a team with some enthusiasts and some critical people, and we let them develop something new. It could either be—if we’re doing microservices—I mean, microservices are brilliant for this, right? Because you can choose whatever technology you want for your new service, so Kotlin would be a great choice there. So that would be a good thing: have this team of skeptics and enthusiasts and then see how they work together on the problem. Because if you just do the one or the other, we’re not really getting everyone on board.

That would be a really good approach, I think. But I sometimes also have done—you have this thing called Advent of Code, right? This is, right now at this time, the period leading up to Christmas, and there are new coding challenges. You could say with your team, “You know what? We’re going to do Advent of Code, and this year we’re going to do it in Kotlin,” and try to see how we would work it out there. So that could be a thing.

Of course, you can do things with hackathons in your own team and say, “We’re doing these coding sessions, and let’s try to explore how we could use Kotlin there.” And finally, I think if you would do code reviews with the whole team, you could also go through Kotlin and gradually explain: “OK, what are we doing here, and why are we doing it this way in Kotlin?” That makes people really see, “OK, I’ve done it the Java way,” and I think that really plays—for instance with collection classes—because Kotlin, that’s such a great collection of, well, collection letters, and then people in the code—with you, the skeptics—can see, “Oh wow, this is actually quite nice,” and very concise how we’re writing the code.

Enforcing it upon someone, I don’t think it’s ever going to work. At least they’ll probably do it, but they won’t do it by heart. So I think, in general, with these steps you can actually win their hearts.

11. How can engineering leads support their teams’ upskilling and address any resistance, ensuring a smooth transition without disrupting productivity?

José Dimas Luján Castillo: I think maybe the good news or the bad news—it depends—but it’s more about a leadership issue in this case. Because I think we have a beginning that’s very clear: we need people saying the next steps for that. Because if you don’t have this follow-up about leadership, probably you will see a mess in some cases.

That’s good news because if you have the person in your team, obviously it’s an easy way to do that. But if you don’t have it, you need to create it or you need to hire these people, because that’s the thing. Maybe you have people with good leadership, but without the knowledge for the adoption.

Because, for example, the most common errors when you are trying to do that: you need to start with small limits. If you’re trying to migrate everything, or the most complex in your legacy code, probably it’s not a good idea. So my recommendation, for example, in some companies when they ask for something similar is: start with the small models. Because you need to see—if you think it is good writing code, that’s correct because they are working with you—but they really understand the business. Because maybe you will see they know the new business, not the old business, for example, because you are using old code sometimes.

The other part is: it’s a new language, it’s a new paradigm, it’s new code. At the end you need to see if you have a problem with the adoption because sometimes people understand the features, and it is not the real benefit. They understand the modern things, or they understand the faster way to do the things, but it’s not the same as the best way to do that. So don’t be closed for these situations, because obviously the most important is: OK, we will do the migration. I am not saying this is the best way to do the things because probably the team can find a best way to do the things, because they really put the hands on the code. But if you are very closed with that and you say to the team, “Just follow what I am doing,” probably you will miss a good part of the code from the team.

So that’s the other part: just go talk. OK, it sounds weird, but when you are using the team for the migration, probably they are the best way to understand the situation. Because you can’t imagine how the thing is working, but when you are rewriting the code you will see the real things. Maybe you have good code and you don’t need to rewrite this part. Maybe you have to rewrite another model, not this one, because you will see the benefit—and maybe the benefit is not too good to change it. Maybe the benefit is in another model. Sometimes that happens.

Because the problem is: if you think the leadership—or just one person—knows everything about the project, probably you are not with the correct answer. Because people need to put their hands on the code for that.

And just to close that: you need to think in your team. In these cases, you need to trust in that team to take some decisions. Obviously they need to explain to you or explain to the company, but probably they are the best way to understand the problems in those cases.

And you need to give space to the teams to play with new toys because probably they have the answers, but they don’t know yet. So if you let them play with some new features or some new code, probably you will find good things in that.

And obviously don’t try to read productivity in guidelines—if you try to read it as: “If you have more lines of code or less lines of code,” probably you will miss something. So in this case my recommendation is: we need to avoid productivity as “more lines of code or less.” Just keep trusting in the team and decide together if it’s a good part for the business—if we need to change something or not. But if you don’t open the code and read it and try to play with that, you don’t know if the adoption is a good idea or not.

Mobile Development and Cross-Platform

12. The Android development landscape is in constant flux. As of 2025, we’ve seen the rise of Jetpack Compose for declarative UI, more sophisticated modularization of apps, and renewed emphasis on clean architecture principles for maintainability. Based on your experience building hundreds of mobile apps, what do you consider the three most important architectural practices or patterns for modern Android development?

José Dimas Luján Castillo: I think the last two years we really see why they think to start using Kotlin. That’s the thing, because for Jetpack Compose, I think that’s the main concern.

And if I need to say three keys for that: I think we need to separate the responsibilities. Even if you are not using it for Android, I think if you can separate your responsibilities in the code, it is always a benefit for everyone—for testers, for developers, for product owners, for everyone—because it’s very easy to understand each part for everyone.

Trying to use the definition of modules—not because it’s a new way to do that. Maybe that sounds like, “We need to separate everything,” but no. The thing is: if you modularize everything in your project, it’s very easy to add these features in this module as a feature, or delete, or change this module.

For emergencies, for example: maybe you have a critical bug in some part of the code, but if you have modules you can just put it in another part and you can continue with the regular functionality.

So I think that’s the second. And the third one: I think the architecture is changing. You need to prepare your architecture ready for change, because when we create code, we think the code will always be the same code—that’s the problem.

It is very complex. People try to create architectures ready for new changes, because we don’t have the time, we don’t have the money, we don’t have the team for that, but I think great developers and great leadership are always preparing the project for changes—not just for the new framework.

Because if you prepare your project just for the new framework, when we change the paradigm—in this case, for Jetpack Compose, because it’s a really different way to do mobile applications—probably all your code is not ready for these changes. That’s the deal, and that’s the problem. So now you will see a lot of mobile applications where the new features are working with Jetpack Compose, but the old ones probably will never migrate because they are not ready. They don’t need to keep waiting a miracle to change the modules just for the new Jetpack Compose implementation, for example.

So that’s my three keys. And in my context, in my current projects or last projects, I always ask, for example, when I will implement a new technology: what kind of problems will we resolve with that? The cost of the adoption, and the impact for the long term.

That’s my three main variables I have in mind, and I always try to put a number: how many problems I will solve, the cost of the adoption—how many people and how many resources we need—and the impact we will have: long term, mid term, or short term. I think with these three variables I create a table and share it with the team, and after that I have a very clear situation. We decide if we need to follow these new technologies. That’s what I did in the past.

13. José, with your background in Android, iOS, and cross-platform frameworks like Flutter, how do you see Kotlin Multiplatform fitting into real-world projects today?

José Dimas Luján Castillo: I think we are in a very interesting situation in mobile development right now because we have a lot of actors—maybe serious actors—because in the past we had a lot of actors. But the problem is, I am not saying I am not a fan of JavaScript, but the problem for development in the beginning for multiplatform is they had a lot of JavaScript frameworks, and at the end it’s not the same case because it’s not a real multiplatform situation.

But now we have a lot of multiplatform situations. We have Flutter, we have SwiftUI, we have Android with Kotlin Multiplatform. So you are really creating a codebase for using in the other option of mobile development without having to add some patches for the situations. So it’s a real competition for who is the best option.

Now, we don’t have a clear winner for that because that depends for each team. For example, if you have an iOS team and you have five developers using iOS and Swift, why do they need to learn Kotlin? Maybe it’s not the best case for that. Maybe you have the Swift situation. That’s the real thing because you will need to pay—you will need to use money—for this migration at the end.

But now we are in the most easy way to create multiplatform applications, even for mobile developers, even for Android developers or iOS developers. So we have a real situation where we can separate the logic from the UI without problems. With Kotlin Multiplatform, it’s very clear to do that. You don’t need to be expert on Android or Java or Kotlin. If you read the code, you will understand in a very easy way—even because, when we write the code now, you will see a lot of web programming paradigms.

For example, the reactive way to do websites with Vue or React—you will see, when you read the code in mobile development, it’s very, very, very easy to understand. You will see Flutter is OK, React Native is OK, OK, but at the end Kotlin is part of the body of mobile development—and safe in this case too.

So we need just to wait—maybe a couple of months, or maybe a couple of years—to see the real impact of Kotlin Multiplatform. The way to understand Kotlin Multiplatform is very easy if you understand Kotlin. You don’t need to understand too much. Probably you will need to take two or three features because maybe these features are not too famous in backend development, for example, or in the Java world. But it’s two or three, not too much, because at the end you are using your regular Kotlin in backend—you can just use it in mobile development. The problem is you need to understand how the operation system is working. That’s the only thing.

The tooling is very advanced. That’s the other part—you will see Kotlin Multiplatform is running too fast for these changes. They are creating and sharing tools each three months, four months for the people, the ecosystems—and Kotlin too. So it’s very, very good news for the developers because you will see two or three really good maturity tools each four months.

So I think that’s the advantage to use Kotlin Multiplatform for mobile development versus Flutter or React, even if they have more years, because now they are in other stage. They are trying to increase the developers in the ecosystem of Flutter or React, but Kotlin—they don’t need to do that because they passed this stage. Because if you are using Java, you can use Kotlin. If you are a Kotlin developer, you can use Kotlin Multiplatform. I don’t need to convince you to use it because I don’t have the problem with the knowledge. You have these tools—just why is not the problem?

The question is: Why are you not creating mobile applications with Kotlin Multiplatform? That’s the real thing because you have the knowledge, you have Kotlin, you have Java knowledge. Maybe you just need to take a look at the operation system. I think that’s the real situation for Kotlin Multiplatform. They are just waiting for more people in the ecosystem because if you know Kotlin, you can do that.

14. We’ve been talking about Jetpack Compose, we’ve been talking about Kotlin Multiplatform, which are, you could say, new technologies or new approaches, and we talked about them in the context of adapting them into large-scale apps. But when it comes to a large-scale enterprise setting, what challenges should teams be aware of when adopting a new technology or approach while they aim to access benefits such as maximizing code reuse without compromising native user experience, and so on? What are some best practices or difficulties that teams should be aware of when trying to bring something into the stack, so to say?

Ron Veen: Yeah, that is an interesting question because I guess one of the core reasons also for Kotlin Multiplatform to be there is code reuse—specifically, if you have business logic, that’s really the code you don’t want to have distributed over different platforms, but you would like to have centralized.

So I think everything goes down to that, and that you should really focus on which parts you really think you should share. Because if, again, you look at code reuse, and especially in enterprise environments—reusing parts—it depends on your architecture, doesn’t it?

Because if you have this classical architecture where you have one large single deployable unit, then code reuse could be quite easy. If you get to the microservices situation, well then code reuse becomes a little more tricky, I think. And then, you know, the whole DRY thing—“don’t repeat yourself”—suddenly becomes a bit more fluid, that you could say, well, sometimes we just don’t want to reuse in order to maintain the independence that microservices should have.

So, again, here I think it all depends upon what is the architecture you’re trying to support. So I think there’s no one-fits-all solution here—so, no.

About the Book: Kotlin for Java Developers

15. Your book, “Kotlin for Java Developers” is aimed at software developers proficient in Java who want to learn Kotlin for professional development – it’s especially relevant to Android engineers, JVM backend developers, and full-stack Java programmers maintaining legacy systems. What inspired you to write this book together?

Ron Veen: Well, the good thing is—Jose actually—the bulk of the work was already there, right? So I actually came later to the project, and the majority of this book—and the credits for the book—should really go to him, because he had already written a large part. I think I only added a few chapters.

But I think I came to the project for a simple reason: anyone who writes code—if you have, like, 10 developers and you say, “Write this piece of code for me,” then you’ll get 10 different solutions in the end. It could be typical small things or something, but everyone has their own style. You can always see it with code reviews: “Shouldn’t we do it like this? Why not like that?” And I think it was the same here with the book.

I think Jose had already written the majority of the book, and—just like with code—it’s always good to have a second opinion about it, and that’s basically where I came in. We started chatting with one another and talking about, “OK, should we rewrite it like this? Is this better?” Just fine-tuning a lot of things and having two perspectives on the book.

So that’s how I got to it—or actually what I did on the book. How I got to it was easy: I was approached by Packt. So I said, “We have this book about Kotlin.” I thought, “Yeah, Kotlin—that’s great.” That really touches me. I think it’s a great technology and should be used more often. So yeah—if there’s a way that we can spread the good news, I would really love to do that.

Then I got to see all the work that Jose had already done, and I thought, “Well, this is just brilliant,” and yes, I’d love to be a part of this. I just hope that Java developers really see it.

What I really liked about the approach to this book was it’s not like I’m telling you Kotlin from A–Z. I really love that we have this approach where we say, “Wait a minute—you’re already a Java developer. I don’t need to teach you what loops are, or iterations, or, in general, what functions are, or lambdas. You already know that. I’m just going to teach you how you can make the transition to Kotlin.” That’s why I think there’s a lot of information in there for Java developers in a very concise way, so it’ll save them time if they use the book to switch.

16. José, What was your inspiration? What specific gaps or common struggles did you observe among Java developers that made you think, “I need to write this book, and I need to help them”?

José Dimas Luján Castillo: Well, I need to say something at the beginning, because Ron said I started the book early—but yeah, that’s true. But actually, Ron’s part was very important for the project. I was stuck on a couple things, and he read it and he made the right suggestions. I think that was the point to try to move forward, because obviously we have a lot of complex situations to understand.

I wrote a lot of books in the past, but this is the second in English, because it’s not my main language. Obviously, that helps a lot, I think—maybe not for him because he’s very good with English, but for me, because I think that helped a lot for that.

And about the coding: when I start thinking about how I need to solve—or I want to explain—the things I know… I am a teacher too. I was a teacher for 18 years, maybe. And I note, for example, when I try to tell people—like at the beginning, 18 years ago—it’s very easy. It’s not the same case when you have developers with previous experience.

So I take that in my mind, because I know a lot of people want to use Kotlin because they are coming from Java—that’s the part. It’s a different way when you are starting from zero or scratch, but when you have developers with experience, it’s a very different way to do that. So I take that in mind while I was writing the book.

I was thinking too much that the other part is: I need to take time to explain the syntax. But at the end, the problem is the way we try to focus the situation. For example, am I trying to explain how we can write this line of code in Kotlin, or maybe we need to… I think we really need this line of code, because maybe in Kotlin we don’t need it.

That’s what I always try to put in the reader’s mind: if we need it, OK—this is the way to do that. But the question is: do we really need this line? Because maybe in Kotlin we have another way to do that, and maybe we don’t need to use it. That’s my second question always when I try to explain something, because that’s the real way to create the bridge for Java developers to Kotlin.

OK, you will need—maybe you will need that knowledge more; maybe it’s better if you don’t need it. But you know why you have the answer. Or maybe you need to find the answer when you are writing the code. That’s the other part.

The other question—or comment—I always have in my mind is: we need to be respectful with Java. Since the beginning with Kotlin, when we try to sell the idea to use Kotlin, I don’t like too much, actually. I have a very weird experience in the past when I was a Google Developer Expert and I was the third Kotlin Developer Expert, and the problem is: they removed me because I’m not always saying the best option is Kotlin for the developer.

And I really think that, in some cases, maybe you have situations when you don’t see a very specific answer or a good answer for that—but it is the real part. We need to be respectful with Java because, at the beginning, without the experience of Java, we can’t create Kotlin. Not me—the Kotlin team—because obviously they use a lot of experience in Java, all of them. They are very experienced people with Java, and they try to see good parts of Java and take it into Kotlin, and they are increasing good parts.

So if we sell Kotlin as a killer—I think that’s disrespectful for the technology, because it’s not. The way to do that sometimes is actually a compliment for the technologies, the architecture, the projects. But the other part is: more than sometimes it’s just your preference—how you prefer to write the code in this way or this one—because maybe you don’t have very clear architecture or this paradigm to use it.

So that’s the other part. I always, in my way to express it, when you need to compare Java with Kotlin, I always try to be very respectful with that because a lot of the good things in Kotlin—maybe 90% of them—they are coming from Java, because Java is the first part of your experience. So I always try to keep that in mind when I write the book and put the examples. Obviously, we have a lot of parts better in this case, or very fast implementation—I need to say it, obviously—but with the correct words and the correct approach for the people. For me, that’s my line to follow to write the book.

And obviously, I want a couple parts that are really practical. In these cases, I use very simple scenarios and add complexity in the code. I start, maybe, for example, with just one definition, but at the end of the chapter you will see it takes more functions, because that’s the way to follow that. I don’t prefer to do anything by default because when I read books in the past—

You sometimes see magic code just appear on the next page. So I don’t like too much to say, “Hey, we have new code—just read it and move forward,” because I don’t think that’s the way. Obviously, in other kinds of books, on other kinds of technologies, maybe you need to do that because it’s a framework and needs to be—and you can copy the template because it’s too… but even in the necessary parts, I try to do that for the book. That’s my—maybe five or three points to follow when I write the book.

17. What do you hope experienced Java professionals will do differently after reading Kotlin for Java Developers, and how do you think this will help them tackle real-world challenges more effectively?

Ron Veen: Well, I think, again, like we said, this is really trying to explain to Java developers that Kotlin isn’t that different, because, again, 80% of your logic and things will still be the same. Yes, you will use a bit different syntax, but it’s still a syntax that’s quite familiar to you. You might use different collection functions, because there are different ones.

But I think the really important part here is that you’re actually starting to see the value, but you’re also starting to recognize when it doesn’t really matter if we do Java or if we do Kotlin—they’re both running on this brilliant thing, which is the JVM, right? The Java Virtual Machine that, in the end, runs the code.

So it is not as big a step as you might think it is when you move from Java to Kotlin—as opposed to moving from Java to Go, or from Java to Rust. The steps are really small, and I think our book just helps you write idiomatic Kotlin, where you can actually see, “Oh, right—I’m actually seeing what I should do.” And like José said, it’s not like translating it one to one.

Actually, I think JetBrains, in the IDE, they have the function where you can select a Java class and say, “Convert to Kotlin.” Well, it would technically work, but it still wouldn’t be idiomatic Kotlin, right? So I really hope that, with the book in hand, you would actually write Kotlin as Kotlin is meant to be—so I really hope that’s where we’re getting.

If you want to dig deeper into the mechanics of moving from Java to Kotlin—writing idiomatic Kotlin, handling null safety, using coroutines for concurrency, and taking advantage of features like extension functions and DSLs—check out Kotlin for Java Developers by José Dimas Luján Castillo and Ron Veen (Packt, Oct 2025). Written for experienced Java developers, it teaches Kotlin by mapping concepts directly to familiar Java constructs and then goes further into interoperability, generics, data and sealed classes, coroutines and flows, and DSL design—across backend, Android, and cross-platform development.

The C++ Programmer’s Mindset on Abstraction Costs, “Future You,” and Thinking with the Machine: A Conversation with Sam Morley

Divya Anne Selvaraj — Thu, 22 Jan 2026 05:13:13 GMT

C++ rewards engineers who treat problem-solving as a deliberate process rather than an improvisation. In this conversation, Sam Morley returns repeatedly to that theme: decompose the work until it becomes a set of solvable, “atomic” parts, then choose abstractions that fit the real constraints of the system. He argues that abstractions are never free, even when runtime overhead is low, and that good design means balancing competing costs: build time, cognitive load, flexibility, and performance. That same pragmatism shows up in his emphasis on leaning on the standard library, iterating from “working” to “fast” based on measurement, and understanding when low-level details like cache behavior and memory access patterns should influence how you structure code.

Morley, author of The C++ Programmer’s Mindset and a research engineer with a background in mathematics who maintains a high-performance C++/Python library for data science, also frames maintainability as a problem of empathy for “future you.” He discusses writing code that can be understood months later, structuring systems with clear separation of concerns, and treating concurrency and memory safety as design problems rather than afterthoughts. Along the way, he outlines practical guidance on thread-safe architectures, where synchronization mechanisms go wrong, and how ideas from Rust’s ownership model can sharpen a C++ engineer’s instincts about lifetimes, pointer safety, and undefined behavior.

You can watch the full conversation below or read on for the complete Q&A transcript.

1: For an experienced engineer, what does adopting the C++ programmer’s mindset look like in practice? How does it change the way you approach complex software challenges?

Sam Morley: For experienced engineers—and probably some less experienced engineers as well—they’re probably using this framework of computational thinking already. The framework itself, as I came to discover when I was putting this together, is really a set of common elements that one finds you do when you solve problems. It’s less about the actual components of the framework and more about how one connects with the different mechanisms and features and facilities within and around the C++ language that make this an interesting discussion topic.

So, one might be quite experienced at solving software problems, but what we’re doing here is more about connecting those with a broader thinking about the system and the language—and all of the facilities around those—which is hopefully the additional knowledge that I’m imbuing. As I said, most people are already kind of familiar with this sort of framework, even if they’re not conscious of that fact.

One of the things I want people to take away is that it’s sometimes very helpful to really think about your process. So when you do solve a problem, look back and think: How did I do this? How did I break this problem up? What abstractions did I find? Where did I find them? Where were the common elements—things that I’d seen before? What were the things that I’ve never seen before? Try and make a mental note of those facts, because these are the things that will come up again. From a longevity point of view, it’s important to not only remember your solution, as it were, but also your process.

Because if you fall down the same hole every time, then it becomes quite easy to fall down it again if you don’t remember that there was a hole there. So if you document your process and think about your process—at least maybe not physically document, but mentally document the process that you go through to solve a problem—then you can remember these facts much better.

Moreover, if you’re experienced, then you should also be mentoring your more junior people, and this is also helpful for them. So it’s important, for lots of reasons, to think about what you’re doing and how you’re doing it.

And that’s one aspect of the book. The other is connecting it with the C++ language, the broader system in which it operates, and how you marry those two things together to make an overall, hopefully better, more efficient, and faster solution.

2: In your book you talk about breaking down challenges, and choosing the right abstractions to build the most efficient solutions. Can you walk us through a concrete example where this approach made a big difference?

Sam Morley: I want to start by challenging this a little bit. The notion that one can solve a problem without breaking it down into smaller parts is kind of folly. I don’t think it’s possible, really. You might not be conscious of the fact that you’ve broken it down into smaller parts, but you’re almost surely doing it. Even something as simple as doing some arithmetic—you might think that you’re just adding N numbers together—but really what you’re doing is you’re adding two numbers together, and then adding the result of that to the next one, and then adding the result of that to the next one. It’s an expanding-brackets problem. Whether you are conscious of this fact or not, the point is that you’re solving several smaller problems that look like the same big problem.

However, the way that you break down problems obviously matters, and some ways are more efficient than others. For that reason, it is a good thing to get into.

I want to talk about something that I did a few years ago now, which involved taking frames out of a very large number of video files, sending them to one of the ML services on Azure—so this was over a REST interface to Azure—and then we got the results back, and we had to store those on disk in files. This is a very meaty topic, meaty problem. There’s lots of different elements here: there’s loading all the files and then decomposing them into individual frames; then there’s sending all of those frames off to the Azure service; there’s getting the results back; and then there’s writing them to disk. So immediately there are four components to think about.

In trying to process this—once you’d expanded this to 100,000 videos or something, each with a few hundred frames—the numbers here are pretty enormous. In order to get them to and from the Azure service in a meaningful amount of time, we had to multiplex this. So we had multiple threads all sending frames to multiple Azure endpoints, because each of the endpoints is rate-limited, so you can only send, I don’t know, 10 requests a second or something to each of the things.

But actually, sending requests was not the bottleneck. Getting the results back was part of the bottleneck. The biggest problem that we actually encountered was writing the results back to the disk, because this was hundreds of gigabytes of results at the end of the day. What we ended up with was that we were getting results back from the service so fast that we had to build in some back pressure into this system to slow down when we had a backlog of things to write to the relatively slow spinning-rust disks that we had.

So there we have a very interesting structure. We start off with these four big components. Within the first component, we have reading video files from the disk where they were resident, decomposing them into a number of frames that was then passed on to another subsystem, which was responsible for sending these frame things up to the Azure service and waiting for the results. This was quite carefully orchestrated so that we didn’t hit the rate limit—so figuring out how to do that was an interesting problem.

Then there was another component which was taking the results that were being returned from the Azure service and collecting them into a buffer. This was the sub-problem of figuring out that we were thrashing the disks trying to write all these results out. So we installed a buffer in between. We wrote into a buffer and then had another worker process that would take the things from the buffer in big chunks and write them out to the disk.

So there was this sort of filtering down of problems. You start off with big, challenging, meaty problems at the top, and then each one of those gets decomposed into smaller bits, and then smaller bits still, until you reach a level where you either have an existing algorithm to do it, or it’s some functionality handled by a library, or it’s some other kind of interaction with the world. Talking to Azure, talking to a disk—these are sort of base-level problems that you can solve quickly.

It’s all about bringing down the level of the larger problem to these small, atomic things which you can actually solve using facilities that you have. That’s the real challenge. But I don’t think that you could just write a singular piece of software that would do all of these things together without breaking it down into these components. I don’t think that’s possible.

3: Abstraction in Detail (Chapter 2 of your book) covers when to use different language features—simple functions, classes, templates, etc.—for a given task. How do you determine the appropriate level of abstraction in modern C++?

Sam Morley: Abstraction is tricky. It really depends on the purpose of the code. What am I trying to achieve with my code? I want to upfront say there are no zero-cost abstractions. People will claim up and down that things are zero-cost abstractions. They’re really not; every abstraction has a cost. Now, this might be a runtime cost, which is what people usually refer to, but that’s not the only type of cost.

Templates, for example, have very little runtime cost, but they do have a significant build-time cost. Including lots of templated code might make your runtime faster, but it will surely expand your build time. They also have a pretty heavy cognitive load—the ability of the programmer to reason about programs which are heavily templated is significantly higher than just ordinary plain C++ code with no templates.

So getting that balance right—what am I trying to achieve with this code? Is this supposed to be a set of components of high-performance systems that really need to have the best possible runtime performance, and I don’t care about build time? Or is this a general-purpose thing that needs to be extremely flexible, and I do care about the cognitive load of people who are going to work with this task? It’s all about balancing the different competing costs and also competing utilities. How flexible is my system? How fast is my system?

Now, it’s not necessarily true that abstractions are always bad. Sometimes you can use an abstraction and it adds very, very little to any of the loads. For example, introducing a very small templated helper function is very useful and it adds basically no overhead, and if that’s used correctly it can be a big help to the program.

But sometimes—and I’m especially guilty of this—you can over-abstract. I’m a mathematician; we like our abstractions. You can over-abstract and make the thing more complicated than it needs to be, and at this point you start to lose something. It might be runtime performance if this is a virtual class hierarchy, or it might be build-time performance if you have heavy template code. Or it could be that you no longer can reason about your software because it’s now so complicated and filled with all sorts of clever bit-hacking tools and abstraction mechanisms that you no longer understand. It’s finding a balance.

Now that I’m conscious of this fact, I try to keep my abstraction as minimal as possible. I look for the minimal amount of abstraction I need in order to solve the problem without going too far and over-generalizing it. This has come back to bite me recently: I over-specified an interface to the point where it only satisfied the conditions in one very specific instance, and I had to rework the entire interface to make it fit the actual thing that I should have programmed against in the first place. It can come back to bite you. Hopefully that doesn’t happen very often, but when it does, it is always painful.

This is why thinking about the abstraction up front is important, and it goes hand in hand with the way that you decompose your problems as well—the thinking about what the abstractions might be if you decomposed it in a particular way versus a different way. If you pick one or the other and it turns out not to be the right choice, then now you understand that the abstraction was made, maybe the problem—and the more problems that you solve, the more you get used to this idea.

4: What guidelines help decide when a straightforward function is enough versus when you should introduce a class or a template to solve a problem?

Sam Morley: Yeah—if you start with a simple function and you make it a class template or you make it a function template, like I said, that can afford you a lot of flexibility at very low cost. I do this sometimes internally. For instance, I have a function which does something to a pair of integers, and I don’t know exactly what type of integer I want to use later on, so I just template it. Because the cost of doing this is basically nil, it means I don’t have to go back and refactor my code later when I change my mind about what integers I use. That kind of thing can be very low cost and high maintainability—high friendliness when it comes to programming.

The cost of moving from a simple function to a class is higher, especially if that class has virtual functions—if it’s abstract in the other sense of the word. If that’s the case, then you’re now incurring a runtime performance penalty, which may be warranted. Runtime performance penalties are not always a bad thing. As long as they’re away from hot code—the bits of code that need to run at maximum speed—you can get away with an awful lot of slop when it comes to runtime cost, especially in instances where the bandwidth and runtime latency is limited by some other factor, like a network connection or a disk or something like that.

But really there are three reasons—or at least three reasons—why you might want to use a class instead.

The first is that you have some kind of internal state that needs to be managed carefully. For example, a std::vector or std::map manages its internal storage, and if you were to code this by hand in line in a function, you would almost surely get something wrong. These are managing the state very carefully, and you then don’t need to worry about those details. Your code is much more readable if you’re using a std::vector than if you have a bunch of goto statements for resizing a buffer when it overflows and things. This is not very nice code to read.

The second reason to use a class is if you have some kind of behavior that needs to be flexibly abstract. What I mean by that is: you have an interface which reads and writes data from some source, but the source of the data is unknown. You might be reading from a disk or reading from a network socket, and this is a really great place to encapsulate the reading and writing process because it’s the outward interface. The bit that you’re really programming against is the same in both instances. You have a read function; you have a write function. It doesn’t really matter how that is implemented behind the scenes, as long as those two things work. Wrapping this in a nice class doesn’t have to be a virtual class; it can be a template, or something like that—some combination of both perhaps. It is a very convenient way of packaging the behaviors that are specific to one mechanism for doing that thing.

The third reason is that you need some point in which you—or some other developer later on—needs a point of customization. This is a slightly nuanced point. C++ templates are very powerful, but function templates are a little bit more tricky to use sometimes than class templates, and the reason for this is the way that class templates can be partially overloaded, partially specialized, whereas function templates can, but not directly in the same way. So it’s a really powerful technique to use a class template inside a function template that allows you to provide a different specialization that will customize the behavior of the function template without directly interacting with the code within. This is a very nuanced use, I contend, but it is very useful. I’ve used this pattern a few times in my code, and I’ve seen it in other code as well. I think the first time I saw this was in NVIDIA’s CUTLASS library, and I think I’d used it before that, but without being conscious of the fact that I was using this particular pattern. It is very useful, and I think it’s somewhat analogous to a sort of bridge or command interface that you might find in the Gang of Four, but with templates instead of virtual classes.

So those are my guidelines. If you have other uses, I’d be interested to hear what kind of reasons you would use a class rather than just sticking to a simple function.

5: One key to proficient C++ is knowing the standard library. How important is it for developers to leverage the STL’s algorithms and containers instead of writing their own from scratch?

Sam Morley: OK, so there are two things about the STL which are really important to remember. First is that it is a set of very flexible and very generic algorithms and containers for a very wide range of purposes. And secondly—and probably more importantly—is that it is there always and you can always use it. “Always” being a little bit tricky there—embedded developers, please don’t get angry with me—but for most C++ developers the STL is a sort of thing that you can rely on and use.

Whenever you need it, and generally speaking, these are very, very good, very high-performance facilities, and they can make your life much easier. So what the STL does, in effect, is make your development window smaller: you spend less time implementing standard things and more time implementing the difficult things. They raise the floor of what is the base-level problem that you can solve without thinking about it.

If you remember when I was talking through my example earlier, you have these layers of problems. You start with big problems, you make them smaller, you make them smaller, and eventually you get down to a set of problems—maybe not at the same level everywhere—but you get down to problems which you know how to solve using standard tools or libraries. So what the STL does is it gives you one level up from having to write those things for yourself. It’s one less problem to solve, and this means you can move much faster. You can develop much faster.

Now, they might not give you the performance that you need. You might have to change the way that these work in order to get the performance that you need, but a large amount of the time the STL will probably give you all the performance you need, providing that you’re using the right algorithms and containers. OK, but that’s a separate question. The thing that it does do is speed up the development cycle.

If you implement something from scratch the first time and it doesn’t perform as well as you need, then fixing that might become problematic. And moreover, you might not know whether that actually is a bug source, or whether there’s some characteristic that you missed somewhere else in the problem, or whether this new thing that you’ve implemented is causing the problem. You can get around that with testing and things, but really, if you’re prototyping something, you might know that you can’t use those things in the future because they won’t perform well enough. But building it with the STL things at the beginning is the right way to get started, and it means that you can find a solution. It doesn’t have to be the best solution.

Solving problems is an iterative process. You don’t always find a solution—let alone the best solution—the first time round. You probably have to take many bites at the apple. So first you solve the problem, and then you make it fast. And only by measuring do you know which bits are not fast. So starting with the STL will probably get you most of the way, and you’ll probably find that other parts of your software are the slow parts.

Now, there are some caveats. First, a lot of libraries provide faster, or slightly more flexible, or things with different properties which are basically drop-in replacements for the STL. For example, Boost containers are a set of more expanded and more flexible container types that are drop-in replacements in most cases for STL equivalents. Abseil has the same set of things, and probably other libraries too. These are really great if you’re already working in, say, a project that’s using Abseil—you already have all of those container types at your fingertips—and sometimes they do perform better. And things like small inline vectors are extremely useful for a lot of things, and both of those libraries provide such a thing.

Now, the other side of that is the algorithms. Similarly, there are other libraries that provide standard STL-like algorithms. NVIDIA Thrust is one that comes to mind. This is parallel algorithms. C++—I think 20 or 23—introduced these different dispatches for the standard algorithms, which causes it to run multi-threaded or to do it on a particular execution context, I think they’re called. Thrust was sort of prior to that, and it’s specifically geared towards running on NVIDIA GPUs and NVIDIA libraries, but it’s the same set of functionality, actually. It’s a set of very general-purpose algorithm template functions which dispatch very cleverly through various pathways to give you a fast implementation of whatever that algorithm is doing on whatever device you’re doing it on. And it’s a very clean and efficient way of writing very parallelizable, very general-purpose code.

There is one more caveat that I want to mention, and that is that writing custom containers is a very dangerous game to play. Writing containers is hard. There are so many things you have to keep track of. You have to keep track of the construction and destruction of your elements. If that’s not a trivial thing, that is something you have to be very careful of. If you’re doing bulk allocations, you need to be careful that you have properly moved everything, and how you handle the errors. If something goes wrong during the copy, during the allocation, how do you unwind that? What guarantees can you give to the outside world—the rest of your program—about how that process happens? And moreover, how do you efficiently move things from an old allocation to a new allocation?

These are all very complicated and difficult things. I’m not saying that people aren’t capable of doing it, but I am saying that it’s very difficult to get right. If you are reimagining containers, then you should be asking why rather than how. There are genuine reasons to use different containers, but I don’t think you should be implementing them necessarily yourself. I would reach for a standard container library—like Boost or Abseil containers—and rely on the work of a lot of people to maintain those good implementations rather than trying to hack together something yourself.

6: Do you find that mastery of the standard library is a distinguishing factor in how efficiently developers can solve problems in C++?

Sam Morley: It surely can be. This goes back to the notion of what is the smallest problem that you know how to solve without thinking, and having a very good understanding of what is in the standard library—what the things in the standard library are capable of delivering, and how you might reasonably do that—will certainly raise this floor.

If you know that the standard library contains binary search functions, for instance, then that immediately is taking the place of having to solve a problem of how you binary search through something. Obviously this is a very well-understood thing; it’s just an example. But knowing how to make use of some of the more tricky and multifaceted std algorithms—for example, transform, reduce—knowing how to make use of that efficiently will make the range of problems that you can solve without doing a lot of hard work yourself quite a lot larger.

However, it’s not necessarily true that you can’t be efficient without the STL. You can absolutely be very, very productive—productive is probably a better word than efficient. The factor is speed, speed and convenience. Like I said, the STL allows you to get going very quickly because it’s there, it’s ready to use. You don’t have to worry about linking or importing or doing anything difficult. And moreover, you don’t have to worry about licensing and things, which do come up occasionally. It’s there ready to go, and you can just use it. So it makes a big difference to how quickly you can deliver solutions.

It also makes a big difference in how quickly you can iterate on solutions. If you build something that works but is slow, then you can make it faster. I don’t think it’s necessarily important for you to use the standard library exclusively. If you’re already working in an ecosystem that provides standard-library-like abstractions, possibly more flexibly, then by all means use those things. If you always have Boost available to you, then use Boost. Boost also provides a great set of many, many more features besides what is in the standard library, and making use of those things will also enhance your productivity.

Similarly, if you’re in Abseil, then use Abseil. But you still should keep track of what is in the standard library, because if you move away from a project where you’re familiar with Boost, or familiar with Abseil, familiar with Folly, or whatever library stack you’re using now might not be the library stack you’re using tomorrow. The STL is a constant factor. If you’re using C++, you more or less always have the STL, so having it in the back of your mind all the time is always a good idea. And it certainly will make you faster—not necessarily in code execution time, but certainly in the development time.

7: C++ is a multi-paradigm language with many powerful features, some of which can be a double-edged sword for maintainability. Since the goal is to build scalable, maintainable solutions, what best practices do you suggest to keep C++ codebases clean and manageable?

Sam Morley: Yeah, this is a tricky question. There are, of course, a lot of general-purpose good practices that apply here—things like documenting your code and leaving lots of comments about how your function operates, what guarantees it expects, and what guarantees it gives, and understanding that.

Before we jump into this, I want to introduce the notion of “future you.” Future you is your future self, and for all intents and purposes, this is a different person. Because when you’re writing some code, you understand things in the context of what you’re doing at the moment. Future you will have lost this context. So when you come back to your code in a month, six months, a year’s time, and you look at it and you think, “What was I thinking to make this code?” almost surely the answer is, “I don’t know.”

So writing comments is not just for other people—it’s also for yourself. You don’t have to go overboard and say, “I add these two numbers together,” because that’s not a useful comment. But I’ve taken to doing this quite recently where I’ve been working on some very intricate mathematical expressions and processes: I’ve taken to writing very big, chunky block comments. It’s like, “Right, OK, this is where we are in this process. This is how the next set of things works. This is what it should do. This is broadly how I’m going to implement the algorithm to do this.”

These comments save me so much pain when I jump off the project for a week and then go back and have to remember exactly what I was trying to do. It takes you a few minutes to sit and think about what that thing was, but that’s time well spent because now you’re thinking about the problem. This is where you can do some of this work of breaking down the problem—abstracting, finding common patterns, things that you recognize, things that you know how to implement—and then you should be able to spot those elements in the thing that follows. Doing this work in the code, in the body of the code, will keep it there so that when you come back to it, you can remember what you were thinking.

And moreover, this also applies to other people—not just future you. But that’s general-purpose advice.

Specifically for C++ things, and more with scalability in mind: having a very strict separation of concerns is a very good idea. You want to keep code that does numerical computations away from code that talks to users. You want to separate different functionalities as much as possible, and ideally you want to test those in isolation. Having a very modular, very pick-and-choose kind of situation will really help with that.

Sometimes it’s not possible to do this easily. Sometimes separating things can be really hard work. But being able to test and benchmark your high-performance components in isolation can really help you understand what they’re doing, how they’re doing it, how fast they’re doing it, and make sure that everything there is correct before you integrate that into the rest of your program.

It also means that if you’re doing some work that involves distributing large computations over a large cluster or on the cloud or something, you can write the different distribution mechanisms separately and then just reuse your tight-loop computation routines inside those. So it affords you a great deal of flexibility to modularize your code and separate them into separate libraries, or even just separate namespaces within a library. These kinds of things can make a big difference in the way that you can test and run your code.

A couple more points: you should always pay attention to thread safety, even if your application is not going to be multi-threaded. You should be thinking, at some point, this might be multi-threaded; I might need to access this class, these class members, from different threads—so how do I make sure that that’s a thread-safe thing to do?

And the third thing is to make sure that you keep your build system clean. I use CMake, typically. Make sure you keep that clean, and keep it in a way that is easy to see what the individual components are. Moreover, if you need to extract bits and put them in their own library, make sure that’s an easy process, because build systems can get left behind, and having a broken build system is far worse than having broken code. It’s much harder to figure out what exactly has gone wrong if your build system is broken. So those are my points.

8: When using advanced features like template metaprogramming, clever lambdas, or other C++ “power tools,” how do you ensure the code stays readable and team-friendly rather than turning into an overly complex “wizardry”?

Sam Morley: Yeah—I mean, wizardry is the right word. I’ve seen some horrendous template metaprogramming in my life. I’ve written some horrendous template metaprogramming in my life. I’m going to be the first one to admit that it’s never worth it.

Generally, I stay away from template metaprogramming nowadays. The need for it has diminished somewhat with concepts and constexpr functions being part of the standard now, and the amount of flexibility that those afford you going up. The need for very complicated template metaprogramming has gone down.

There are other reasons, of course. Templates are very expensive from a build-time point of view. Instantiating a complicated template metaprogramming construct can easily double the compile time for a particular C++ file. And that’s not healthy if you’re building 10,000 of these—that’s a lot of time. There’s a good reason why Google, when they wrote Abseil, kept their metaprogramming to an absolute minimum. They’re very explicit about this fact. It’s because the compile-time costs are just too high.

And moreover, going back to the “future you” idea: if you write template metaprogramming code, future you will have a hard time understanding it, because it’s one of those things that makes sense while you’re writing it, and then it becomes immediately impenetrable. So I would stay away from template metaprogramming as much as possible. There are some isolated things that are useful—like using SFINAE to enable or disable particular instantiations of templates and things—but always keep that as minimal as possible.

For lambdas, lambdas are interesting because, used correctly, they can really enhance the readability of your code. They can really make it much easier to understand. On the flip side of that, they can really, really make it hard to understand what the code is doing. So my general advice for using lambdas is: keep them relatively short, and avoid having lambdas which capture and modify values that are a long way away.

What I mean by that is: suppose you have a big function that is performing some kind of calculation, and at the top you have a couple of lambdas which capture a row number. Let’s say you’re doing a matrix multiplication. It captures a row number, and the lambda accesses data from a particular row and then advances the row number. Now using that lambda will always cause confusion because the row number is a long way away from where the lambda is used. So every time you think, “What is this lambda doing?” it’s modifying something that you’ve not looked at for a long time because your screen has been further down the page.

Done correctly, this can be quite a powerful pattern. Done incorrectly, it really is a hindrance to you remembering what your code is doing. Almost surely in this instance, if you have a value which is initialized and then only ever modified or used by a lambda, it would almost surely be better encapsulated in a class of some description separately, so that the dependency on this thing—and the fact that this is a value that’s only modified or used by the class—is very explicit.

So that’s my thoughts, but that only really applies if the lambda is modifying a value. If it’s just capturing and doing something to it, that’s different. One of my favorite uses of lambdas is to capture a pointer that’s come in as a span or something that’s come in as a function argument, and then return particular subspans or particular elements from that span. For writing a matrix multiplication, for example, you might want to return a submatrix, or you might want to return a row or a column, and using a lambda for that purpose is really helpful because it saves the amount of work that you have to write again and again. And also it’s not modifying anything. Modifying is the problem.

As soon as you’re just returning a particular row, a particular column, or a particular element, that’s less problematic. In the past, you probably would have used a macro for doing these kinds of operations, but this is just C++. We don’t use macros anymore.

So those kinds of uses are fine, but I would generally try to keep your lambdas very short—and if they do need to capture things, remember the locality in the code of where you’re capturing from, and try not to let that drift too much.

9: Let’s talk a little bit about performance, concurrency, and safety, specifically in C++. You have a chapter in your book on understanding the machine, covering topics like modern CPU architecture, memory errors, SIMD instructions, and branch prediction. Why should today’s C++ developers care about these low-level details?

Sam Morley: OK, so let’s think of it like this. Suppose you are driving down a road. If you’re going along an unfamiliar road, you have to drive slower. You don’t know where the turns are. Suppose it’s dark—you don’t know where the turns are. You don’t know what the traffic is like. You don’t know what the road condition is like, so you drive slower to be cautious. And this is what writing code without thinking about the system is like. In this world, the system that you’re running the code on is the road, the code that you’re writing is the car, and you’re thinking ahead about what the road conditions are going to be like—although you actually know what the road condition is going to be like in a lot of cases.

And in those conditions—like if the road is flat and straight, the road condition is good, there’s good visibility, there’s little traffic—you can go faster. And this is really what understanding the machine is all about: understanding how the different levels of cache interact, and how one retrieves data and then operates on it efficiently is a big part of how you make applications fast. If you ignore the cache, the code will work, but it will be much, much slower.

So, for example, most people in computer games have this discussion of structure of arrays or arrays of structs. The pattern is very simple. If you have, say, a set of objects inside your game, do you put those in a vector of structs, where the struct has all the different properties—like position, velocity, mass, whatever—or do you put them in separate arrays? One array for positions, one array for velocities, one array for masses, and so on. And this makes a big difference because of the cache and also because of vectorization. If you’re going to operate on positions only, then having a contiguous set of positions in memory means you can fetch them and operate on them very efficiently. Whereas if you have an array of structs, then you’re fetching positions but you’re also fetching velocities and masses and all the other stuff that you don’t need at that point in time, and you’re wasting bandwidth and you’re wasting cache.

So that’s one of the really classic examples. Another really classic example is matrix multiplication. Matrix multiplication is interesting because, in one direction of your matrix, you’re accessing data sequentially, which is really good. That’s really great for cache hierarchy. In the other direction, you’re accessing it with a huge stride, so the elements that you touch as you move from row to row down a particular column are far apart in memory, so you have to go a long way between these elements. So this is really bad for cache locality.

In order to address this, you do tiling. You take a small chunk of your matrix and use the data in that as much as possible so that you make the most of those expensive load operations, and you do as much operation as you can on that small tile of matrix. Then you move to the next tile.

In the book, I show a very marked improvement over a very naive implementation—it’s like a factor of four or something—and this was the point at which I started to engage a bit more with the pipelining and SIMD parts of this. You can dramatically speed up.

And if you want examples of this kind of thing, FFTW is a really great code base to look at. It’s a very difficult code base to read because it’s a C code base and it’s full of macros, but you can spot some elements of what they’re doing. The pipelining is the process they’re using, and this is to sort of hit the compiler with all of these things so that it can stack up all the other operations and make the execution much faster, because it stacks all of these things up at once rather than having this situation where, “I need this value, but now I have to wait for it.”

Also, they will use lots of SIMD operations and vectorization at the end. So that’s where I would suggest that people look. This is prevalent across all compute domains. It’s just about understanding what is the limiting factor in the performance of your software and then having some knowledge of the underlying computer—or whatever system you happen to be operating on—and really making use of every part of that.

For machine learning, for example, the models are huge now—billions of parameters, trillions of parameters even—and throughput really matters. Taking an extra microsecond to do a computation might not sound like much, but those micro-efficiencies really make a big difference in the long run. For general-purpose compute, if you’re interacting with a disk, or interacting with a network, or interacting with a user, then those details might not matter because you’re limited by something else. So it’s all about understanding where and when it’s appropriate.

10: Can you share an example of how understanding hardware behavior can guide a C++ programmer to write more efficient or optimized code?

Sam Morley: Well, I mean, OK—this structure of arrays discussion is certainly one example of this. I come from a sort of scientific computing, high-performance compute for machine learning kind of background, or at least that’s where I am now, and here I always have to think about this.

One of the real classic examples of where you really need to understand these things is matrix multiplication. Matrix multiplication is interesting because, in one direction of your matrix, you’re accessing data sequentially, which is really good. That’s really great for cache hierarchy. In the other direction, you’re accessing it with a huge stride, so the elements that you touch as you move from row to row down a particular column are far apart in memory, so you have to go a long way between these elements. So this is really bad for cache locality.

So in order to address that, you do tiling. You take a small chunk of your matrix and use the data in that as much as possible so that you make the most of those expensive load operations, and you do as much operation as you can on that small tile of matrix. Then you move to the next tile.

This is something that you have to think about if you’re writing high-performance code, because you can’t just write the naive triple loop and expect it to be fast. It will work, but it will not be fast. If you want it to be fast, you have to structure your computation so that it plays nicely with the cache and the memory hierarchy. And the same kind of thinking applies to lots of other algorithms as well.

So that’s a really quick example of how understanding hardware behavior—specifically cache locality and memory access patterns—can guide you to write code that’s much more efficient.

11: Your book also delves into parallel computing and even GPU programming, which is notoriously difficult with pitfalls like data races and deadlocks. Coming back to the mindset aspect of things, what mental models or strategies do you recommend for designing multi-threaded C++ applications?

Sam Morley: Yeah, thankfully modern C++ really does make this a lot easier. There are two different scenarios I want to highlight.

The first is where you have a large amount of data that you need to process and you want to do this in parallel. Now, with some caveats, this is relatively safe to do in a multi-threaded environment because you just give each thread a different range of values to operate on. There’s never any overlap, and each thread goes away, does its work, and the results are put in the buffer, and there’s no overlapping. There are no data races; there are no problems there.

And this is a safe thing to do, and it’s very easy to do with parallel algorithms or OpenMP and things like that, which will do a lot of the hard work of checking that these things are not violated for you. Setting up the problem so that it works rather than the other—there are some conditions on that. Operating on self-referential data, or data that refers to other parts of the data, is obviously going to cause problems. But that wouldn’t be an appropriate usage of those things anyway.

The other type of multi-threaded environment that you might have is where you have several worker threads that are handling different events within a bigger system, and here you have shared state. So each of the threads has some kind of global—or inter-thread, at least—state that they need to access. This could be for communicating between threads. So, for example, you might have one worker which is dispatching work to all of the other worker threads. This would be your main thread stacking up operations it needs performing, and the typical way that you would do this is with a queue.

So you’d have a thread-safe queue that you put work into. Each thread comes along, queries the queue, and says, “Is there any more work for me to do?” If so, it takes the job out and works on it in isolation, and this operation is thread-safe. It has to be thread-safe.

But also, you might have some global configuration or some kind of global data that you need to access everywhere. And there it becomes really important to understand what it really means to be thread-safe. Thread safety is a tricky thing. You need to understand where things can be mutated, who has ownership over particular things, and where that ownership can change.

Ideally—and this is something that will come up later, I’m sure—you want to have this model where only one place in your code—one thread, one function, one whatever—can modify a value at any given time. This can be achieved in one of two ways. Either you design the architecture of the program so that one thread can only ever touch one value—this is the distributed data type model—or you have a synchronization mechanism like an atomic or a mutex-locked value, or some other kind of thread mechanism for controlling access to a particular resource.

In the latter case, it’s very easy to get this wrong. Deadlocks can happen. You can still end up with data races if you use these things inappropriately.

So what I would suggest is that if you do have to do multi-threaded code, you read very carefully the documentation on cppreference or some other equivalent source for all of the different synchronization mechanisms that are available in C++, and you really try and understand what each one of those things is for and how it operates. Then you’ll be much better equipped when you are trying to design a class that needs to be shared between multiple threads—how you manage the mutability. That might be in a mutability—mutable values within the class—or exterior multiple mutability where you need to take a mutable instance of the class and actually do something with it.

Ideally, you need all of that to be thread-safe, and knowing what the different options are will enable you to actually write this code. Hopefully that will mean that you don’t have deadlocks or data races. Always test your code.

12: Robustness and security are critical in systems programming. With C++’s manual memory management and undefined behavior guarantees, how can C++ engineers improve the safety of their code?

Sam Morley: Go and learn some Rust. I know a lot of C++ programmers turn their nose up when Rust is mentioned, and generally the feeling that I get from a lot of people is that, “Oh, we don’t need Rust. We can do all of this in C++.” But that’s not the point. The point is that Rust has a bit of a learning curve, particularly for C++ developers, because they go into it with a C++ attitude, and the Rust compiler isn’t having any of that.

The Rust compiler forces you to think very carefully about ownership and lifetimes, and whether it’s safe to move things from one thread to another. That’s its whole design: managing access, the validity of values across an entire system, and very carefully managing the enforced properties—whether it’s safe to send things or share things between threads. They have these two traits called Sync and Send, which basically determine whether you can share things or send things between threads safely.

The same applies to async programming. Even if you’re not using multiple threads, you still need to think about this for async programming as well. Learning a bit of Rust will force you to think about these things up front, and many other good things that you should definitely think about—like unsafe code. These are things that C++ programmers sort of take for granted without actually thinking about what they’re doing.

When is it actually safe to dereference a pointer? The answer is almost never. It’s almost never safe to dereference a pointer. That’s fundamentally an unsafe thing to do. You don’t know where that pointer came from. You may do, but you don’t really know where that pointer came from. You don’t know whether it’s valid or not. These are things that you have to reason about as the developer.

Rust forces you to think of this as an unsafe operation, and because of that you’re far more cautious about actually doing it. And these concepts—this way of thinking—is transferable. Learning Rust, learning a bit of Rust, will make you better at writing safe C++.

The reverse is not true. Learning C++ will not make you good at writing Rust code. In fact, it will probably make you very frustrated. But getting over that frustration and understanding why Rust enforces these things is important, because these are the same principles that allow you to write safe code anywhere, not just in Rust.

13: Are there any particular practices or modern C++ features you advocate for to prevent things like buffer overflows, memory leaks, things like that—while retaining the performance and control that C++ offers?

Sam Morley: Yeah, absolutely. I mean, it’s not exactly a new feature, but using std::array rather than C-style arrays is definitely a huge win. Smart pointers mean you don’t ever manage memory by hand.

There are some cases where you might actually do this, but most of the time, writing operator new in your code is an anti-pattern by this point. Use a smart pointer; use a container.

The mantra of my containers section is: just use std::vector. It applies most of the time. And use std::span rather than using raw pointers or C-style arrays for passing data around. It adds this extra sort of memory safety—and yes, it does carry a small runtime performance cost, but that’s negligible compared to the risk of your code crashing out because of an invalid memory access, or producing—worse—producing garbage and it going unnoticed.

The best-case scenario for a bad memory access is a crash. That’s the computer responding to a bad thing. If it goes unnoticed, it could happen for months before you notice that this has been producing garbage the entire time, by which point you’ve wasted months. So those are the things that I would reach for first.

But the other thing is: stop using C functions. The C functions that existed a long time ago have numerous documented vulnerabilities in this sense. gets—the function from the C library which does an unchecked read from standard input to read a line of text—is fundamentally unsafe. I can make a line of terminal input as long as I need, and that’s a sure way of getting a buffer overflow. There are safer equivalents, but generally speaking, don’t use the C library if you can avoid it. It’s not safe, and using it will always cause some problems somewhere—especially the I/O functions like gets and puts and sprintf and things like that. These things you have to be very, very careful about.

14: Let’s finally talk about your book, The C++ Programmer’s Mindset. You’re both a research engineer and a mathematician, and you maintain a high-performance C++/Python library for data science. The book itself combines practical insight with academic rigor. What drove you to write The C++ Programmer’s Mindset. Did you observe a gap in how C++ developers approach problem-solving that you wanted to address with this book?

Sam Morley: So it’s an interesting question. Going in, of course I had to do a bit of market research around this, but my feeling was: I like solving problems.

The main motivation for me writing this book was to share my feelings about solving problems—my enthusiasm for solving problems. There will always be a new problem to solve. You’ll never—almost surely anyway—you will never encounter a situation where you’ve solved all the problems. There will always be a new one, and it will be interesting because it’s new. And the more problems you solve, the better you get at it, for sure.

But this is not just a passive process. As I mentioned at the beginning, a lot of people are doing this process of computational thinking using this framework that we described. A lot of people are doing this without thinking about it, and one of the things I wanted to highlight in this book was: in order to get better at solving problems, you need to be conscious of what you’re doing to solve the problems. You need to think about what it is that you actually need to do and how you can do it—not just in the context of the problem, but in the context of thinking about the problem, understanding the problem.

And something else that I feel quite strongly about is that I feel like a lot of C++ developers could benefit from being conscious of the environment in which they operate—thinking about the operating system, the underlying hardware, thinking about what the different mechanisms that they’re using are, how those things are informed by and inform the problem-solving process.

Do I need a map, or do I need a hash map, or do I need a vector? These are design questions that are informed by the implementation, and those relationships are really what the book is about. It’s about thinking about the language, the hardware, the operating system—all of those things combined—in the context of solving problems, and how the process of solving problems is informed by, and informs, the choices that you make elsewhere.

So that’s the message that I eventually decided was going to be the topic of the book.

15: What mindset shift or new capabilities do you expect a seasoned C++ developer to gain after reading your book?

Sam Morley: Yeah—so, seasoned developers might feel that they already have a pretty strong grasp of solving problems, and this probably is true. A lot of very talented engineers out there. I would suggest, though, that everybody has something to learn. You don’t—you can’t ever know everything. So the sort of mindset shift is: you can’t know everything. So learn as much as you can from as many people as you can, and hope that that fills in as much—as many gaps—as you need. And so that’s the sort of philosophy that I would hope that seasoned developers would take away from this.

In terms of new capabilities, seasoned developers might already be pretty familiar with cache hierarchy and things like that. What they may not be so familiar with is this linkage between the problem-solving process and the implementation details and the other factors. The computers are complicated machines, so understanding all of these things is impossible, of course, but you can understand parts of it, and moreover you can tune your problem-solving process to fit what you have and where you’ll be working. It’s a two-way street, and that I hope is something that even senior engineers can think about while they’re reading.

One of the key things that I mentioned very early on in the book is this “future you” idea. That will be helpful for you in the future—future you—but it will also be helpful for less senior people who are learning this process for themselves, and being able to point out to them where and why certain parts of the process can be so tremendously helpful, and imbuing this understanding of how all of these different moving parts interact with one another can be really, really powerful. That is something that I hope that even a seasoned engineer can gain from this book.

To go deeper into the ideas Sam Morley discusses in this interview—treating C++ problem-solving as a deliberate process, choosing abstractions with a clear-eyed view of their costs, and connecting design decisions to the realities of hardware, build systems, and team maintainability—see The C++ Programmer’s Mindset (Sam Morley, Packt, 1st ed., Nov 2025). The book introduces computational thinking as a practical framework—decomposition, abstraction, and pattern recognition—and shows how to apply it using modern C++ features to build solutions that are maintainable, efficient, and reusable. Across small examples and a larger case study, Morley covers using algorithms and data structures effectively, designing modular code, analyzing performance, and scaling work with concurrency, GPUs, and profiling tools—aimed at intermediate C++ developers who want to strengthen both their technical toolkit and the way they approach complex software challenges.

Here’s what some readers have said:

Rethinking Test-Driven Development for the AI Era: A Conversation with Kevlin Henney

Divya Anne Selvaraj — Thu, 11 Dec 2025 05:11:40 GMT

Test-driven development sits in an awkward place in many teams: widely cited, unevenly practiced, and often misunderstood. For some developers, TDD is a niche technique that only applies to greenfield code; for others, it is reduced to “writing some unit tests” after the fact. In between those extremes are practical concerns about legacy systems, language ecosystems, CI pipelines, AI-generated code, and the day-to-day pressures of shipping software with limited time and attention.

In this Q&A, we speak with Kevlin Henney — independent consultant, speaker, writer, and trainer — whose career sits at the intersection of software design and everyday development practice. Kevlin works with companies on code, design, practices, and people; contributes to the Modern Software Engineering YouTube channel; and is co-author of A Pattern Language for Distributed Computing and On Patterns and Pattern Languages in the Pattern-Oriented Software Architecture series, as well as editor of 97 Things Every Programmer Should Know and co-editor of 97 Things Every Java Programmer Should Know.

Across the conversation, Kevlin unpacks why TDD adoption stalls even for experienced developers, the misconceptions that blur the line between “developer testing” and true TDD, and how tests shape design without losing sight of the bigger architectural picture. He talks through introducing tests into large legacy codebases, how language and ecosystem culture influence testing practice, and what distinguishes good, specification-like tests from brittle method-by-method checks. We also explore tooling choices, where TDD fits alongside integration, acceptance, contract, and performance testing, and how team leaders can sustain testing discipline under deadline pressure. Finally, Kevlin shares his perspective on AI-assisted development, the risks of outsourcing tests to generators, and why, in an era of increasingly automated code, testing and review skills matter more than ever.

You can watch the full conversation below or read on for the complete Q&A transcript.

1: Adopting TDD can be tricky, even for seasoned developers. In your experience, what are the main reasons that experienced developers struggle when first adopting TDD?

Kevlin Henney: I think there are different kinds of developers, and they will have different reasons for struggle. At one level, you are asking people to do something different from what they normally do. That is the first challenge. Just as a human being, that is always going to be difficult, particularly when you already have a set of habits in place. Regardless of how effective those habits actually are, we always perceive the habits that we have as being comfortable. That is why they are habits, and sometimes we have a justification for them.

So trying to get anybody to do something different from something they already do is going to be a challenge. The more experience you have, in this case, the more at a disadvantage you may be, interestingly. If you are relatively new to software development, then everything is fresh and every new idea is more likely to be treated equally by you.

But even then, we need to understand that a novice developer can sometimes struggle, and sometimes we have the issue with people who are in that overlap space where they are not necessarily formally a developer, but they do a lot with code. I am thinking particularly of data scientists and engineers who might not consider themselves to be developers, but who have worked extensively with Python and associated libraries such as NumPy, Pandas, and things like that. They are in the development space, but they do not necessarily have the insight of development culture and concepts, and often they have semi-effective workarounds that they have created, which get them by every day. The point is that for most people, this is the case.

When it comes to testing of any kind, we do not necessarily have as good a story for people as we do for creating a feature and doing a demo. These are very well-practiced within the software development space, and often videos and books will emphasize these, and there is much less on any kind of testing, let alone TDD. So testing tends to be more ad hoc. When you are trying to get somebody to do something like TDD, which is a very structured workflow, that is your challenge: you are trying to get them to do something different.

Then one of the other challenges is often the way that TDD is described. There is a simple mantra, “red, green, refactor.” You write a failing test for something, then you make it pass, and then you refactor. Although that is a very simple mechanical description, and it is not wrong, it is not very motivating. It leads to the reaction, “Why do you want me to write something that does not work?” That is not the right mindset. “Write a thing that does not work and then make it work” does not feel like a motivating mindset.

So I think that is another obstacle. Often the examples or the way that TDD is taught make a lot more sense to somebody who has expertise in it, or when you are coaching alongside somebody, than they do when you are just offering somebody the mantra. It is not compelling. I will be the first to say this. When I do workshops and training courses for companies, I will describe the red-green-refactor cycle. You need to know that. But then I go into it, I take it apart, and I say what is really going on.

At that point, it becomes easier to motivate. The first point is that you are not just writing a failing test. You are writing something for a behavior that you do not have. Because you do not have that behavior, of course it is not going to pass. But the goal is not simply to make it pass. The goal is to write what you want for the new behavior.

The next motivation is actually a simple constraint. In many cases, we can end up yak shaving or just running off into the horizon with complex behaviors, saying, “I will just write everything, and then I will come back and test it later.” If we do that, we often end up with things that are not as simple as they should be, and we do not ask ourselves the questions, “Is this what I need? Is there a simpler way to do this?”

So TDD is literally a limiting factor. It is like throttling back the instinct to just throw everything at the screen. Instead, you are going to take steps so that you understand every step and consider it. It is really a scoping mechanism. The idea is: now I am going to make it pass with something that is no more complex than necessary, so that I fully understand what the next step is going to be. I can guarantee that this is always working, but I am also going to give myself the opportunity for refactoring.

When explained like this, I am not going to say that it suddenly turns on all the lights, but it does make more sense. Then we move to the next level, where we say, “Let us forget the red and green. Red and green are side effects.” Your real goal is: tell me what you want to have working. Here is a piece of code. It has a certain amount of functionality. You want it to do something else. What does that extra bit look like? Show me an example. Somebody says, “Well, it should do this.”

“OK, great. Does it do that now?” “No, it does not.” That is why it either does not compile or it does not pass a test, because you are asking for something new. A test is a change request. When you describe it that way, a lot of people say, “Oh, I see. You are writing a change request to yourself. You are saying, ‘I want to have a piece of code that does this, but it does not do that yet. Here is a really concrete example of what I want.’”

Now I am going to work towards that, and I am done within a couple of minutes, and I can continue from there. The point is that you are not simply teaching somebody how to test, although there is a truth in that. You are actually trying to rewire how they think about the very act of coding, and that is hard.

That is why you will find that it is a skill. It is something worth practicing, and it is a practice that, once you have it, you can draw upon. It does not mean you have to do it all the time, but if you have never practiced it, how can you say, “That would be appropriate now as a tool or a technique”? You are trying to rewire how people think about the act of coding, and that is difficult. So you will meet resistance because of change, but also resistance because it is a fundamentally different way of doing something for which people already have some behaviors.

2: You have talked a lot about the mindset shifts required, and you said that adopting TDD itself is a skill. Are there any specific skill gaps you can point out that tend to be the biggest hurdles for developers who do not adopt TDD?

Kevlin Henney: Honestly, sometimes the problem in terms of skill gaps is simply testing itself, unit testing itself. In other words, developers do not have a habit of any kind for testing, or testing is something that happens later and sometimes in a mad rush. Therefore, the tests that are written are quite difficult to read, and people often have this idea of tests being second-class citizens.

Often you look at tests and you say, “Yes, they look like second-class citizens,” and people create tests that are difficult to maintain, sometimes because they have never been shown what a good test looks like. For many people, when they are learning TDD, they encounter the fact that there are many things they are trying to learn at the same time, and one of them is, “What does a good test look like?”

That is an issue. Picking up on what I said earlier, tests are specifications. There are many ways of thinking about testing, but the way that we are encouraging here is specifying. Your test should be an explanation, a description that captures intent, and it should have an example. The example is the centerpiece, and you want to capture the intent in the name. If you have a test that does too much, it is not a good test.

It turns out that many of these things are things that people do not already know or do. So in addition to the workflow, there is an additional skill: how do you write a good test?

The other skill that is often missing is that people do not necessarily have a good code sense or design sense. By that I mean they often do not know what good code looks like. When you say, “And now refactor,” they do not know what to do, because although they may have a refactoring menu available to them, and although they may know the meaning of the word, they do not actually know what “better than this” looks like.

So you end up with code that just gets bigger and bigger, with more ifs and whiles. That is not what I have in mind. Where is the simplification? They are not actively looking for simplification. That is a design skill, and that is quite difficult to teach.

Therefore, if you are going to make the best use of any workflow, and this is not unique to TDD, you need to be actively looking for good design. Many workflows suffer because people are not asking, “How do I make this simpler? How do I make sure I have less code to maintain in future?” You want to write less code. Your goal is not to produce more code; it is to produce less code.

The most common thing I see is that programmers do not know how to write less code than they need. They often go in with code like, “There is absolutely nothing wrong with writing too much code as your first draft.” There is nothing wrong with that. What matters is what you do with your second draft, and that is the problem. Many people do not have that second draft because they have not worked alongside somebody who can show them what that looks like.

You cannot expect this to be an act of magic. You start learning how to code and suddenly you develop a good instinct for the right balance and structure of a method, of a class, and what an interface should or should not look like to be effective and easy to change. Without helping people develop that sense, almost any workflow you throw at them is going to make things potentially worse.

We see that with AI: people who do not know how to code can produce a lot of code. They need to learn to produce less. You can use AI to produce less. The skill is to produce less that does exactly what you want, because then you have less that can go wrong and less to read.

This is something that I do not think we get across well enough. For me, TDD helps with that, because it always reminds me: “OK, now cook it down. You have this; cook it down.” You have tests; they work. You have a safety net. There is a skill there, which is very much code sense, both for the tests and for the body of the code itself.

3: What do you see as the biggest misconceptions or myths about TDD among developers and teams today?

Kevlin Henney: I do not know that there is necessarily just one, but there are a few. One is that you can, that it is only something that you can do with new code. Another is that, to be precise, it can only be used on a greenfield situation. Another is that your TDD is very much centered on your unit testing framework and things like that.

So there are these kinds of ideas, and we live and work in an industry where jargon is often thrown around and sometimes it is very imprecise. When something is described in a number of companies, “Oh yeah, we are doing testing,” that is great. There is nothing wrong with that. The code leads and the tests follow, which is a different workflow. That is perfectly fine. I am not here to tell people that TDD is the only way to work.

What I am trying to avoid is a kind of semantic dilution if “TDD” comes to mean just that developers are testing. That is great, but we would like to call that “developer testing” as a general term, rather than TDD, which is a very specific workflow.

4: Are there any particular false beliefs you frequently find yourself debunking?

Kevlin Henney: Oh, yes. There is one that has two sides to it. First of all, I separate out a couple of things. Sometimes when people are being negative about TDD, they are not talking about TDD; they are talking about unit testing. They are using “TDD” to stand in for unit testing in an environment where, culturally, within the organization, you do not test. That becomes a reinforcing thing. It is not about TDD at all in that case.

Then there is another one that does come from people who do practice TDD. Every now and then you will hear the slogan that TDD is not about testing, it is about design. I know what they are trying to do. They are trying to emphasize that testing is not just an act of verification. We often have this idea of testing as purely about verification, a kind of gatekeeping activity. But saying “TDD is not about testing” is not a true statement, and I always have problems when people present it that way.

At least half of my work is with companies who say, “We want to do TDD,” when what they really want to do is testing. TDD is a discipline, a workflow. You can tell when people are doing it. It is also the most extreme thing they have probably done in terms of testing discipline, so why give it a name that you also use for everything else? I always make sure people are aware there are many workflows.

What I try to make sure they understand is that when we say TDD, we mean something specific. It is not a magic spell. It is a particular way of working that gives you certain kinds of feedback and certain kinds of design pressure. Part of that pressure is on you as a developer to ask, “What am I actually asking for? Do I know what I want from the code?” Sometimes the honest answer is that you do not know what you want. That is a recognition of ignorance, that you do not yet have enough knowledge. At that point you may need to discover that knowledge, perhaps by spiking something or exploring, rather than pretending you are doing TDD when you are not.

Another part of this is the tests themselves. Some of them are actually quite large, and you have to ask, “Do I really want that? Is that genuinely helpful, or is that telling me something about the design?” Often the test is large because the design is causing that. If the interface feels very clunky, then that is telling you something about the design as well.

So as to what it feels like: testing is in fact the way you experience the design. Rather than looking at testing as a purely quantitative activity—“I got this percentage of statement coverage, I have done my job”—you can ask, “What does it feel like to write the test? What does it feel like to use the code I am providing?” If the answer is, “Yes, it is quite easy, it feels natural,” that is good feedback. If it feels like having a code review where you have to do most of the work yourself, then the tests are giving you a signal that there may be something up with the design.

5: There is an ongoing debate about TDD’s effect on software design and architecture. Some argue that focusing on small tests leads to fragmented design or lack of “big picture” thinking. How do you believe TDD influences software design?

Kevlin Henney: Hmm. I think it goes back to what I was saying about having this kind of design sense or code sense. If you are only ever going to think small, then yes, TDD will have those effects and you will end up with fragmentation rather than a cohesive design. That is one of the reasons it is quite important to make sure that you have a reasonable test hierarchy, that you are testing at all levels, and why, when you are doing this, you should always be taking the big picture view as well.

And this is, I guess, where the driving metaphor that is used extensively when talking about TDD becomes even more appropriate these days. When I drive, there are three places that I am typically looking. I am looking at the road immediately in front of me. I am also looking down at the dashboard to see what my car is telling me. And I am also looking at a map to see what the big picture is.

The problem is that I get the feeling that many people, and this is again not just a TDD thing, I find this with different roles in development, are only ever looking at one of these at a time. So it is like, of course, if you are only looking at the dashboard, you are not going to see what is in the road in front of you; you are going to slam into something. But if you are only looking at the things in the road in front of you, that does not tell you what the bigger picture looks like and what the trends are in traffic, for example. So you are not getting feedback at all these three levels. You are only ever looking at one and ignoring the feedback from the others.

So here is the thing. If you are using TDD and that has caused you to end up with a fragmented design, you are looking at the bigger picture. But also, whenever you are having design ideas, the idea is that when you are launching into TDD, you should have a vision of where you are going to go. The problem is that sometimes people do not actually have an idea of where they are going to go. I often have this thought of sketching out an approach. Do not commit yourself to detail. This is not a committed design; it is literally a sketch.

As a sketch, what you are going to do, what TDD is going to do, is fill in the details. For anybody who does draw, and I know that drawing is not a very common skill among developers, it is one of those things where I always ask people what they do. Music is very common. Gaming is very common, whether it is computer-based games or board games. Certain sports are very common. Drawing is not very common, but when you draw, you often sketch the form and then you put the detail in, but that detail sometimes tells you that maybe the form is not right.

So for me, people often launch in hoping that if they start drawing in the bottom right-hand corner a miracle will occur. If you are a brilliant artist, yes, a miracle will occur; you will produce a great picture. But sometimes people are not looking at the big picture. They always need to be asking, how is this going to be used, how does this affect that? So for me, I think that we can look at that.

When some people say, “TDD does not do this,” my answer is, “No, that is your job.” TDD’s job is to do the sketching. It is your job as the artist to see the bigger picture and say, “I am drawing the wrong thing,” or “Maybe that needs to be moved,” or to take the feedback. If you are only taking feedback at one level, that is great; many people take feedback at zero levels. However, you need to be looking at multiple perspectives. Some of them are closer and some of them are further away.

So I do not really accept the criticism that TDD causes this. I accept that there may be a misunderstanding of the role of TDD, that people are sometimes saying, “If I do TDD, magic will occur.” As I told my kids when they were growing up, there is no such thing as magic. There is you, there is you and a tool and a technique. That is it. If you are misapplying the technique, that is not the technique’s fault, so there is a learning opportunity.

6: When it comes to scaling TDD in a larger organization, what challenges do enterprises face in rolling out TDD across teams? Based on what you have seen, what strategies help make TDD stick in the long run at the organizational level?

Kevlin Henney: I think this one is more a case of, although I am very keen on TDD, I do not necessarily know that an organization wants to roll out TDD. It is a workflow practice, and I think if you can get that working within a team, that is great, but there is no reason that another team has to do it. I think it might not be helpful for an organization to be mandating these things.

I think what the organization needs to care about is more a case, not so much of the way that we are producing the tests, but the fact that, do we have builds that work together, do we have comparable testing philosophies across different teams? If you have a team that is doing a more traditional kind of “test later towards the end of the sprint” type approach, and let us say they are really effective and they have some really good design and their interfaces evolve really nicely, I would not mess with that. They are doing a perfectly good job, and because we have organized around teams, that does not really interfere. As long as our teams have some kind of alignment and relationship with an architecture, then I do not think there is a problem there to be solved.

What we do want is the idea that we have a consistent or a reasonably consistent and compatible view of testing across the organization, and that if TDD helps me get that, then that is what I should be encouraging. But I am not going to say that it is going to be the thing I should focus on. I think that what an organization probably wants to focus on at the organizational level is: if we have various build pipelines, do these build pipelines follow similar philosophies of testing?

Because a build pipeline that does not have any testing in it is not really, certainly I will be very careful here, I am using the term “build pipeline” because people will often say, “Oh, it is our CI/CD pipeline.” Is it? Are you doing continuous integration and are you doing continuous delivery? Because CI/CD is predicated on the idea that you have tests. In fact, to be fair, CI/CD is predicated on you doing trunk-based development and you doing a lot of tests. That is what that means. You can go and look at the original books and they are very clear on this. So definitely a lot of companies have build pipelines, but do they qualify as CI/CD pipelines? Not always, not from the strictest definition. I think that is more valuable.

So let us put it this way. I do not think an organization needs to worry about what people are doing in their homes, but they probably need to worry about the road system. In this sense, organizationally, when we look at software architecture, we need to be thinking of software architecture more like urban planning. We want to have consistent rules and models for the roads. We want to have a consistent layout, see what the issues are, and agree on things about roads and services. What people do in their homes and how they structure their homes and how they do it, I think that can be a lot more freeing, as long as we have the knowledge available and maybe one team can coach another and we can say, “You can become our enabling team; we are going to try this practice.” I think that is great, but I am not sure I want the organization to get involved in some of the more detailed practices that support what goes on. I think what goes on, what the output is, and how teams integrate is probably more important than specifically what they do on the inside.

I think what can make it stick is very much, let us build off what I have said. One of the things is that a team needs to feel that it owns its practices. Teams respond, and individuals respond, sometimes quite poorly when they are told what they are going to do and they do not really feel it. If a team is told, “You are going to do TDD,” that is not a way of getting them to do anything well.

If they can make it their own habit, if they can create it, if it is their decision, then that is really important, but also if they feel like they have learned something. Again, this goes back to this idea of, within any large organization – and this is obviously a question of different organizations and different scales – in any large organization we are going to find that there are different kinds of teams trying to produce different kinds of products.

Some people say, “We do not do TDD here.” Be very careful that when somebody says, “We do not do TDD here,” that this is not also, “We do not do testing here.” Again, going back to what we have already discussed, that is what I hear when many people actually say, “We do not do TDD” or “TDD is not appropriate for us.” They are actually using TDD to mean any kind of testing, and so therefore they are using the wrong word. They are actually saying something much more bluntly as, “We do not have tests.” If they said that, that would be far more direct and we could work with that.

We need to work out whether that affects us or not. If it is a team that is just prototyping and giving us the results of prototypes, then that is not important. If it is a team that is prototyping a design, yes, we want tests, because you are telling us that this code, which we do not know whether or not it works, is the basis for what we are going to build. Prototyping can involve TDD and it can involve tests. I have done that a number of times in the past. So really it is a case of trying to understand from the organizational level how to get the knowledge out there and make the knowledge feel much more natural.

For many people, any kind of unit testing habit is the challenge. Having tests that run quickly is the challenge, and I would address these questions. I would treat those as the questions to address, and what we may find is that TDD by example may follow, particularly if we have somebody from within the organization who has experience of that and that is how they drive it and that is how they show it and that is how they demonstrate it. Then that may become a lot easier.

Lead by example in this case rather than by mandate. Basically say, “Look, there are a lot of different testing workflows. Our objective is to get better testing, to make testing more convenient of any kind. Let me show you this. I am going to use a test-driven workflow.” Suddenly when you do that, that is much more open and I think people are more likely to adopt it. Whereas if we have somebody going around measuring different teams in a very obvious way, teams justifiably feel a little bit of resistance, offer a little bit of resistance there.

7: Legacy code is a reality for most teams. If a team inherits a large untested codebase, how would you recommend they approach introducing TDD or even more testing in that scenario?

Kevlin Henney: I think that is a really good question, because it matches a lot of people’s lived experience. The key point is that you have to prioritize. From where you are, perfection is impossible, so you have to look at what is possible, and that is going to be a little different for each codebase and for each team. A lot depends on whether you have what I would call a “maintenance mindset.” If you have that mindset, it is going to be very difficult to adopt TDD.

By “maintenance mindset,” I do not just mean software maintenance in the narrow sense. I mean the broader idea that “we are just maintaining whatever it is we do.” You often see this where initial development has been done in one location, and then the work is offshored to another group. The second group is told, “You are just maintaining it,” and people there may not think of themselves as doing software development. In reality, they are. There is no real separate thing called “maintenance” when it comes to software products. It is all software development. There is not “software development plus maintenance”; there is just software development.

So the first step is to reclaim the right words. You are doing software development. Everything you do has the potential to change the architecture. It is your responsibility not to preserve the problems in the existing codebase, but to eliminate them. “Maintenance” as stasis is not what you want. Your job is to be more ambitious: to make the product better than it was when you received it. How do you do that? One obvious obstacle is that you would love to test everything, but you have poor test coverage. In that case, do not try to test everything. Instead, decide how to prioritize what you test.

A useful way to do that is to look at what is going to happen in the next quarter. Suppose over the next three months you are going to add features in a particular part of the codebase. If that corner of the codebase is already relatively well isolated, then you lean into that. Reinforce the isolation. Make sure you have good automated refactoring tools available. Remember that your compiler will still catch many type-based errors. You can introduce separation and decouple tightly coupled code without relying on a large pre-existing test suite. You can lean on automated refactoring, appropriate review, and, as Michael Feathers puts it in Working Effectively with Legacy Code, “lean on the compiler.”

I have done this with teams: we deliberately ignore much of the rest of the codebase for the moment and decide, “We are going to make this part really good.” Once you have something isolated, it becomes easier to test. Unit testing and even integration testing are really about understanding isolation, loosening coupling, and improving cohesion of code units. Those are the practices that improve your code sense. Many developers these days do not have a clear understanding of coupling and cohesion. They get distracted by principle catalogs that are not very coherent. For example, the SOLID principles do not form a coherent set of design ideas; they are a bunch of things thrown together and they miss many important aspects. I know I will get comments for saying this, but I have been doing this long enough to say that SOLID principles are next to useless if you want to learn how to write good code. You are better off learning and reinforcing the fundamentals.

If you can isolate a small part of the system, that becomes your zone of “new development.” This is a bit like urban planning. In a city you cannot change everything at once, but you can change a particular district. Because that area is separated, you can make it good and benefit from that separation. That is one technique for allowing a team to claim territory and improve not only their testing but also their design. The important idea is that you are not just improving testing habits; you are improving the code itself. Testing and design are not separate activities. Treating tests as something separate is part of the misconception. Tests show you how the code fits together and whether your design is good. If you say testing is difficult, you are actually saying the design is difficult. That is useful feedback: “What do we change so this becomes easier?”

A practical goal to hold is that in six months’ time it should be easier to work on this codebase than it is now. That will involve more than just testing. It will involve changing code, build settings, and all kinds of small things. You are trying to improve the overall situation. Another way to prioritize is to “ask the system itself.” Treat the legacy system as having a body of knowledge and let it tell you what to focus on. If you have a million lines of code, a team of ten is not going to transform it overnight, so do not try. Instead, look at the system’s history. What changed? What keeps changing? Look at the parts of the code that change most often.

It does not matter whether those parts are changing for good reasons or bad reasons. If they are changing frequently, that is where you want to improve both testing and developer experience. If you are frequently changing something, you are more likely to break it, and you are also more likely to benefit from making it better. Parts of the system that are incredibly stable do not need the same attention. That does not mean they are automatically good. Some things are stable because they are terrible and people are frightened to touch them. But if they are not changing, they are no more broken than they were before, and they already “work” in the sense that people rely on them as they are.

So use the system to tell you what to change. The system already has an opinion, visible in its history and defect patterns. Do you have a heat map of where your defects are? That is where you want your tests. In that sense, you can use the legacy nature of the system constructively and positively. I think we often overlook that because it is not immediately obvious, but it is a very practical way to introduce more testing and TDD-like practices into a legacy codebase.

8: You have worked with a variety of programming languages, from C++ to higher-level languages like Python. Do you find that TDD plays out differently depending on the language or tech stack?

Kevlin Henney: Yes, it does, but not always for the reasons people might think. Sometimes it is more about culture than language features. Just as natural languages are associated with different cultures, programming languages have associated cultures, idioms, and toolchains. So you have the syntax of a language, but you also have the tools that are available and the habits that have grown up around them.

Culturally, testing as developer testing is far more prevalent in the Java world. There is nothing inherent in the Java language that makes it more amenable to testing than a language like Python, but testing is more likely to be present. That is because modern unit testing, at least in the popular sense, grew up around Java. The JUnit framework appeared in the late 1990s and was integrated with Eclipse. That made it normal for unit testing frameworks to be integrated into IDEs. Java was the language in which those practices and cultural habits were first formed. As a result, Java developers are much more likely to encounter JUnit and similar tools in an integrated environment early in their careers. In that sense, Java is “better suited” to TDD than Python, not because of the language itself, but because of the surrounding ecosystem.

Python, by contrast, does not have a single standard IDE in the same way. If you are working with Java, you are very likely using IntelliJ or a similar environment. If you are using Python, you might be coming from many different directions. If you are a data scientist, you have a different view of the world. Data scientists do not usually use Java; they use languages like Python. With Python you have people who consider themselves software developers, children who are learning to code, people who are scripting, people who are doing data science, and so on. There is not a single core culture, so you end up with disparate practices. Python itself also predates the period when automated unit testing became a strong habit. That is not to say Python developers do not test, but the cultural environment around Python does not have the same unified testing norm as the Java ecosystem. So in that sense, what you see with TDD or testing is often more about development culture, who is around you, and what information and tools are available.

If we move to C, or C for classic systems programming and embedded work, we see yet another culture. These are contexts where you are much less likely to find unit testing. If people are testing, they often test at the system level and not even at the integration level, let alone at the level of small units of code. So culturally that is an obstacle to TDD.

Then there are the language characteristics themselves. Python is a much “looser” language; it is dynamically typed, and that can actually make some aspects of TDD easier. I sometimes joke that when I am using Python, I do not need a mocking framework because Python is the mocking framework. Mocking frameworks were invented for statically typed languages like Java, where the language does not easily support meta-level behavior. Those languages are less elastic and less plastic. In Python, I can reshape almost anything. The language itself is a tool that can be used to modify itself. At that level, from a purely linguistic point of view, Python can make testing easier.

However, cultural habits can get in the way even there. For example, many Python developers, especially in more data-science-oriented contexts, have a habit of reading and writing files everywhere and accessing files in every function. That makes testing harder, and it is something I try to encourage people to stop doing. In C and C++, there are language constructs that encourage longer build times and more source-file dependencies. There are also design habits that do not lead to natural decoupling or obvious substitution points where you can say, “I can easily put something else here because the design allows it.” In those environments, you sometimes have to push uphill against the prevailing culture of the codebase to get to a design that is test-friendly.

So yes, languages can make TDD easier or harder, but only sometimes is that because of the language features themselves. Very often it is due to the surrounding culture: the design culture, the testing culture, and the practices that have grown up in and around that language.

9: The quality of tests is crucial in TDD. What are some best practices you recommend for writing good tests in a TDD cycle?

Kevlin Henney: That your tests should. So one of the techniques that I always think of is that your tests should be testing one concept, one idea. That does not mean they necessarily have just a single assertion, but they should have a single focus. What is the thing that you are trying to demonstrate? That should be easily summarized by the test name. This is one of those cases where naming something is not merely labeling, it is actually testing as design in this case, because it will cause you to create different tests if you use a different design approach or different naming approach.

My preferred habit is to use tests that are propositions. So let us just take this cup, for example. Some developers might say, “I have a constructor,” and they will write a method testConstructor, and testFill, or testDrink. What you are doing is you are just going through the shopping list of methods and writing a test, and you cannot test like this. There is no way to produce good tests using that technique. I actually do this when I run training courses. I show people that it is impossible to produce good tests using this technique. If you just go one method at a time and say, “I am going to write a test for this method,” you cannot test. You cannot write good tests like that, partly because, in order to drink from a cup, I need to create it.

So therefore I have already involved the constructor. I am not just testing the drink method. I also need to fill it, so I am using the fill method, and then I can drink from it, and then I need to determine whether or not it became empty. I have just used four different operations there. I am not testing a single operation, I am actually testing the interaction. This is why, when we look at the perspective from BDD, behaviour-driven development, that gives us a different way of understanding what you are after. You are after testing the behaviour. The behaviour is not just in a single method, it is the composition of different methods in different scenarios.

Another reason you do not want to end up doing testDrink, for example, is that I can drink in two different scenarios. I can drink from an empty cup, and I can drink from a full cup. That is not one test case, that is two. They are very different, and they have different outcomes. So the first thing is, if you are currently doing that, it is a huge test smell if I see that pattern. If I just see tests that are “here is a method, here is a test method that corresponds to it” — testA, testB, testC — you do not have the tests. It is as simple as that.

I always lay it down as a challenge to people: show me if you have any counterexamples. Nobody has ever been able to come up with a good example that contradicts that observation. What you need to be doing is testing behaviours, or in some cases we would look at it as testing a property. There is a fluid overlap between these approaches. You make a statement, a propositional statement. By propositional statement, I mean that we describe something and the way that it is.

“A new cup is empty.” “Drinking from a full cup empties it.” These two sentences are the test names. So literally your test name should be as easy to read as if it were a specification, which is what I said earlier. In other words, each test needs to be organized and thought of as a specification with an example. Here is the thing that we are showing. This is the expected outcome in this scenario. This is the behaviour or the property that we are entitled to, and that we are requiring at this particular point.

When we start looking at it like that, you suddenly realize your tests are not just a bunch of assertions with bits of setup. You are telling a story. You are describing the system from a specification-oriented point of view. You are giving people a series of logical propositions. If the test fails — if I say, “A new cup is empty” — that is a proposition. If that test fails, what does that mean? It means a new cup is not empty. I can tell immediately by looking at the test name what is wrong. I might not know why, but I know what. Whereas if I say testConstructor and that fails, I have no idea what that even means.

So the point is, your tests are units of meaning. Or, put another way, they are not just verification, they are communication. You are communicating actual meanings. If your testing philosophy is that you are just poking your code to verify it, you are going to end up with tests named after methods, or even worse — and I have seen this a few times — test1, test2, test3. Honestly, that is not going to help anybody.

You can always tell whether or not a team has really understood or has a good testing habit, because if they are testing like this, there is no way they have a good testing habit. They are doing testing as an afterthought. It does not feel good. I would not like to write tests like that, and if somebody said, “Kevlin, why are you not writing tests?” I am going to say, “Because it feels wasteful and it is annoying, it is frustrating.” If you adopt those practices, it is annoying. I would not want to write tests like that.

So test quality needs to be quite high; otherwise you are going to end up with unmanaged technical debt in your test base as well as problems in your code base. You do not want to double your problems, you want to reduce them. Your tests should be a clear explanation of what your system does in the detail, along with intention. For me, that is what I put under the heading of GUTs — good unit tests — and that is a term from Alistair Cobra. TDD does not miraculously cause you to do GUTs. You need to again realize that you are in the driving seat. Having a nice car does not cause you to be a better driver, and I think there are a lot of people who would benefit from that analogy.

Then you need to listen to your tests. What are your tests telling you? Your tests are telling you, “This is not cohesive.” Everything is bound up, and you have too much in one place. If you want to test a behaviour in that, or a related group of behaviours, then that related group of behaviours is its own module or its own class. Why is it hiding inside another class? This is design feedback. Again, sometimes the difficulty of testing comes from the difficulty of the code.

So I would say listen to your tests. My standard answer when people say, “How do I test the private stuff?” — my stock answer is that, generally speaking, you do not. That is a signal that you need to separate something out. You are not dealing with one idea, you are dealing with two ideas, and one of them is hidden inside the other. Pull it out. Do an Extract Class and focus on that. It is clearly important because you value it. You just said, “I want to test these behaviours.” You probably even have words and names for it, but it is hiding embedded inside another class. So give it first-class citizenship and extract it.

At the same time, I recognize that there is a point here. If I told you that and you have a major release tomorrow, that is probably not helpful advice from me. So that is why I do not say, “Do not test,” or “You should never test private stuff.” What I say is that you should take it as a signal, and you probably do not want to do that. So in those cases where we need a little pragmatism, I would say, “Yes, I am either going to weaken the encapsulation on the class in one way or another, but I need to put a huge deprecation, or, ‘This is technical debt I need to manage.’”

If I have ignored that warning three times, take it as a “three times and you are out” rule. If I keep coming back to the same code, and my colleagues and I keep coming back to the same code and saying, “Yes, we said we would fix this,” now you need to properly schedule it as a piece of work, because you are always working around it. You are not working with your code, you are working around your code.

That would be something I would say. Again, that is not really so much a tooling thing as understanding what your test is telling you. When it comes to mocking, I do not have any strong opinions, except that most people mock too much, rather than understanding that excessive mocking is an indication that you have a problem that you should be solving. Do not lean into it by adding more mocking. Lean into it by asking a different question: “How do I mock less? What rearrangement of interfaces or class responsibilities would make this easier?”

I generally think that people use too much mocking anyway, even in quite good designs. There are simpler ways of looking at it, and they confuse themselves. So you end up with a lot of mocks and a lot of mock noise, which is not to say that mocking is not useful. It is just that most of the time, I think the guidance I gave to one team years ago still holds: if you are not mocking, you probably need to learn how to mock. Learn how to mock. But if you are already mocking, you probably need to learn to mock less.

I do not have specific feedback on mocking tools, except to say that sometimes I do not find them particularly necessary because of the language. I made the comment about Python earlier. In some languages the language itself is effectively the mocking framework. So for me the emphasis is less on specific tools and more on what your tests are telling you about your design, your responsibilities, and your coupling.

10: Does tooling make or break the TDD experience?

Kevlin Henney: If you are able to establish the workflow, and the code that you are working with has the right properties, or you are moving in the direction of it being loosely coupled and highly cohesive — you are using good, classic design practices to organize your code — then you are going to get most of the experience that you need, and that will not change too much between testing frameworks.

I used a technique years ago where I would get people initially testing with just plain asserts, just a straight assertion, whatever is available in the language or library, without using a testing framework. Then I would get them to refactor towards a framework. That actually turns out to be quite useful. One company did this for their C and C++ code and actually created a framework that they then controlled, which was very useful for their embedded environment. It is not something I ever really did with the Java folks, and I occasionally do it, sometimes as a bit of fun with Python. But I do not do that very much anymore because these are solved problems.

The point of that exercise was to show people that, first of all, the fundamental ideas in a testing framework are not too complex, but also that you would be surprised how little you need to get a testing workflow. But that said, I like to have a testing framework that supports a number of basic features. Obviously, when a test fails, I want to continue with the rest of the tests. I want to be able to have parameterized tests so that my tests can be data-driven.

Any testing framework that does not support that in 2025 is, in my view, an interesting beta, but it is not yet a proper testing framework. It is a 2000s testing framework. I like to have a testing framework that allows me a way of organizing and grouping tests easily. These features streamline the overall testing experience, but they also allow you to have more expressive tests.

Whether that makes or breaks TDD, I do not think it goes quite that far, although I can imagine being sufficiently frustrated in some cases that it would break. However, I think good tooling improves the experience dramatically, and if you get a better experience, you are going to do more of it. There is something to be said for that: good tooling can encourage better habits and more frequent testing.

In terms of specific tools or frameworks that I personally like using for TDD, I have mentioned some of them already a couple of times — the ones that I think flow best for me. Obviously some of these are going to be a matter of personal experience. If I am using Java, then JUnit 5 is my choice, and that is actually a little bit different from JUnit 4. I found JUnit 4 got in my way a little bit, but JUnit 5 has just enough that allows me not to be working around the framework.

In the C++ space, I mentioned Catch as the framework of choice. I would also encourage the use of Catch for C. In other words, if you are in an environment where you are doing the production code in C, do your testing in C++, because the tools are generally more powerful. That is a common pattern anyway, but I would use Catch there. It allows you to be much more specification-oriented.

There is no surprise, I think, if I say that I am comfortable using Jest with TypeScript and JavaScript. With C#, I have already mentioned NUnit as being the one of choice. Occasionally I will do work in languages where I have less familiarity with frameworks. I did something for a client a couple of years back in Ruby and we used RSpec, and I was quite impressed with RSpec. It had been years since I had used it. I found it still a little limited in some senses, but I also found that I could create some really nice testing approaches with it.

My opinions on that and other frameworks are slightly less strong, but those are the ones that normally stand out. There is nothing unusual in that list. The key point is that the workflow and the design properties of your code matter most, and the tooling, when it supports that well, can significantly improve your TDD experience.

11: Putting TDD in the context of overall testing, how does TDD fit with other testing practices on a project? You have talked about it and hinted at this briefly throughout, but if you were to just focus on this aspect, how do you see it?

Kevlin Henney: Yes, I think normally when we talk about TDD, we tend to lean towards the unit testing side, because that gives us the fastest feedback cycle. There is no single standard definition of what a unit test is, but the one that is perhaps most widely used and accepted is very much about the isolation question: can I isolate a piece of code from external dependencies, external runtime dependencies?

If the answer is yes, then that is a unit. It is not about language constructs. It is not “is it a class, is it a module, is it a function,” whatever. It is about isolatability, and the idea that I am not going across a significant boundary of communication. I am not hitting the file system, I am not communicating with another service. The idea is that I am contained and therefore, as a nice consequence, it is going to run fast, but also I control everything about it. You do not control the file system, you do not control the network. Those are outside your control. They may be under your influence, but not your control. So if we define “unit” from that point of view, then we get a fast feedback cycle and it tells us something about the internal structure and looseness of coupling and the strength of cohesion of what we are building.

From that point of view, it feels like TDD is very much in the unit testing space. But that said, exactly the same workflow works for integration tests. There is nothing different there. You can use the same workflow. All of the same test recommendations pretty much apply, but I will probably be using other aspects of my unit testing framework. For example, if I can pull in data files, then it starts becoming a little more serious. It just means that when a test fails or when a test passes, I cannot guarantee that the reason it passes or fails is only to do with correctness of code. If your network is down and what you are testing involves wandering across the network, your test will fail and it will not be because your test is wrong or your code is wrong; it will be because the outside world is wrong. It is to do with the nature of feedback, and it will also be a bit slower. But everything else about it—the sensibility, the mindset, the structure, the naming, the partitioning—all of these other things are kind of the same.

Then we hit things like acceptance test driven development, ATDD, which I always find difficult to say. It is a lot easier to write it down. Acceptance test driven development is where we are actually looking just beneath the UI skin of an app, so potentially very much end-to-end without UI interaction, or at an integration level, but the idea is that it is still code on code and we are doing that. The idea with that is that clearly the small iterative steps will not be as small. They are probably very much more at the feature level. We are looking not at minutes to hours, but hours to days before a certain test may pass, and that is acceptable. The same sensibility applies. This is also where many people will associate behavior driven development, although behavior driven development is a philosophy that also applies to the unit tests. For many people, they think of BDD in this higher-level space. I want to be very careful to say it is not just in that space; it is just the way that many people approach it. But again, that can lead you to structuring your workflow in the same way, so you can see there is a sympathy between many of these kinds of testing.

We can also look at other forms of testing. Contract testing is where we would say we are testing external code. Historically I call that conformance testing. For me, contract testing is what you always do because you are testing the contract of the class; you are defining it. It is not about third-party code, but the term has come to mean code that is external to the code that you are writing and that you want to test conforms to your expectations. I think that is a really important point because it is complementary; it is not the same. It addresses an issue that sometimes people have when they are writing TDD. Let us say, for example, that I am using somebody’s cup framework that I have obtained from GitHub, and maybe I have had some bad experiences with that framework in the past and there were some bugs in it. That is annoying, because I am trying to focus on my code, but I am discovering bugs in somebody else’s third-party code.

The problem is that you end up putting an extra little test into your tests just to check that the bit that broke before does not break again, and so on. You end up with a lot of tests that are “drive-by” tests. In other words, you are testing your thing, but you are also testing this other thing. Do not do that. Your test has now got mixed responsibilities. You want to separate that. That is where contract or conformance tests fit in. The idea is that you want to say, “Everything we depend on that might cause a problem, we have tests that check that.” If those tests fail, we do not even bother running our tests, because there is no point. If the foundation of what we are building on does not work, then why are we even going to bother testing our code, because the foundation is already broken. So the idea is that is a clear separation that is written in a much more what we might call defect-driven style rather than test-driven style: “This is not working,” or “It was not working historically,” so I will write a test to make sure the latest version, or the versions we use in future, are always working.

We may also have other tests like performance tests. Performance tests are typically going to be something slightly different because they follow different experimental design. They have a different workflow. If I drink from a full cup and it does not empty it, then that is a bug straight out. But if I say that we have a particular availability or there is a particular performance limit, then statistically we may find that sometimes when we run the test we pass and sometimes when we run the test we do not, because of the way the operating system schedules and so on. We are not dealing with something that is simply about true and false. We are dealing with something that is better or worse, and there is a kind of grey area. We really want 90 percent of the time to be in this performance space and we will tolerate 10 percent outside it. That suggests that the nature of our test requires a different philosophy. We can pass and fail at certain limits, but not in the same way, and we do not just run it once and say, “That passed.” We need to draw from different samples, sometimes scaling-based samples. Those tests feel very different in that sense. Again, they are complementary, but not in a way that fits with our TDD; it is actually quite separate. They are testing behaviors that are outside the basic semantics. They are testing performance characteristics and so on, and that requires a different mindset, I feel.

So for me, TDD sits largely in the automated developer testing space, mostly centered on unit tests, but the same workflow and sensibility extend into integration tests and acceptance-level tests. Around that, we have complementary practices such as contract or conformance tests, characterization tests for third-party APIs, and performance tests with their own experimental mindset. All of that lives together in the broader testing picture on a project.

12: Maintaining TDD discipline under pressure: from a team leadership perspective, how can leads or senior developers encourage the team to stick to TDD when deadlines are tight or when people feel tempted to “just code it and test later”? Are there any habits or cultural practices that help sustain TDD in the real world of rapid timelines?

Kevlin Henney: Yes. I think the thing is, again, it comes back to whether or not it is your idea, whether or not you feel you own that idea. If TDD is something you only do when the team leader is in the room, and the minute they walk out of the room you stop doing it, then you do not have it. Your team is not doing TDD. It is a kind of performative TDD. You are doing it because you are supposed to, and that is understandable. But it means that the minute you feel any kind of pressure, you are going to throw that out of the window. We see this with a lot of different practices; it is not unique to TDD.

The point is that you need to get to the point where it is a habit and you embody it. You know, “This is what we do.” Also, if you have enough experience, you start realizing that the minute you start throwing out certain disciplines, you are going to pay for that later. This is where our managed technical debt comes from. It does not come from some kind of magic genie that pops into your code. Well, actually, maybe agentic AI can reduce the quality of your code while your back is turned, but the point is that technical debt does not magically appear in your code. You know it got there for a reason.

People often like to say that certain practices only work in certain cases. Honestly, there is a truth to that, but the chances are that whatever you are working on is not so special that practices you normally find useful suddenly stop applying. If you are already finding TDD useful, lean into that. Lean into it a bit more. If you are not, then that is a different discussion. But from the team lead perspective, the job of a team leader is not a controlling role. It is a leadership role. Leadership is not about managing; it is mostly about example, about enabling, about making people see opportunities and making it somehow easier for them to try the right thing than to try the wrong thing.

In some cases, TDD may be a good thing for them. That is great. How do we make that feel like it belongs to them and it is their practice, not the team lead’s practice? Not the organization’s practice, but my practice. How do I make it my practice so that when I start on a new team, that is how I work? When I go for an interview, that is how I describe what I do, because it is my practice, not the team’s practice or the organization’s practice or the team leader’s practice. This is not a practice that belongs to that person or that entity; it is my practice.

For me, that is the skill, which means that there is no easy answer. I am afraid if anybody is watching and hoping for an easy answer, there is not one. But that is the skill and also the subtlety in it: moving from performative compliance under pressure to a place where TDD is something the developers feel they own, something they practice because it helps them, so they are less likely to abandon it when deadlines are tight.

13: AI can draft tests fast, but quality is uneven. What acceptance gates—e.g., minimum mutation score (PIT), property-based invariants, automated test-smell checks, and explicit review rules—would you require so they increase fault detection? How would you enforce this in CI to scale safely?

Kevlin Henney: I would actually take a step back before talking about specific acceptance gates and ask why you are using AI in the first place. Why are you using AI to generate tests? What problem are you trying to solve by doing that? A lot of teams cannot answer that question. They say, “We are using AI because we were told to use AI,” and then we are right back at, “You were told to do stuff; this is not your practice, this is somebody else’s practice.” For many people the story is, “We do not have many tests,” so now they generate a lot of tests with AI, but they do not understand those tests. They do not know what is being tested, or whether the tests are correct.

Recently I wrote a blog post called “Think For Yourself,” where I gave people four things to consider whenever they want to integrate anything that is AI generated into their code base. The first question is: does it work? Do the tests pass in a way that gives you confidence that they are actually verifying the right behavior? The second question is: do you understand the generated code or tests? If you do not know that something works, or you do not understand it, step away. Do not pretend that you are being productive.

There is a common illusion here. Somebody will say, “AI has boosted my productivity,” and then you ask them how they know. How are they measuring that? The answer is often that they are not measuring it. They just have the feeling of speed because the AI produces a lot of stuff in ten minutes. Then they spend the rest of the week fixing it. That is not productivity; that is the creation of legacy. In my view, one working definition of legacy code is “code written by somebody else.” AI-generated code fits that definition perfectly. I am not saying, “Do not use AI.” I am saying that good use of AI requires more understanding, not less, and that requires tests and review.

The number of times I have been fooled or could have been fooled if I did not have tests is significant. That is why I write my own tests. I do not trust AI to do a better job than I can, because I still have to explain the behavior I want. In the time it takes me to explain that precisely enough to an AI, I could have written the tests myself, and I would know exactly why I wrote them and what design decisions they encode. AI might be useful for generating certain coverage-oriented tests in situations where coverage is very poor, but even there I have questions. If I am using AI to generate tests as well as code, I must spend most of my time reviewing, and reviewing is a skill that many people do not have.

I spent years learning how to review: fiction, non-fiction, technical material, books, articles, and code. Code review is not “I glanced at it and it looks good to me.” That is not review. If you do not understand the generated code and tests, you have a problem waiting to happen. The upside is that you can treat AI as a teacher as well as a generator. If you are using AI-generated tests, ask yourself: do you understand what your code is doing when viewed through those tests? What can you learn from them?

Then there is the question of taking control. What is the difference between the generated code or tests and what you would have written? If you were to write that test yourself, what would you have done? They might be similar, but they will often be different. Understanding that difference is an education. Sometimes I look at the generated result and think, “That is a really good way of doing it.” In other cases I look at it and think, “That is not a very good way of doing it at all.” Either way, I have learned something.

So my final recommendation in that space is to add one more gate: can you think of at least one way to improve what has been generated? Do not treat AI output as something that is simply “good enough to accept.” Treat it as a starting point. That gives you two big groups of questions. The first is: why are you using AI to generate your tests? Do you have a clear understanding of the benefit and how you will measure that benefit? If you do not, do not do it. Being busier is not the same as being productive.

The second group applies if you do decide to generate code or tests with AI. Use this list of four gates: does it work, do you understand what you have, what is the difference between what AI produced and what you would have done, and can you think of at least one way to improve it. If you habitually apply those checks, you will learn a lot. You will be using AI as a possibility generator, not as an autopilot. You will be interacting with it, passing judgment, using your design sense, and either accepting the result because you know why it is good, or changing it because you know what is wrong and how to fix it.

In other words, you turn AI into an assistant or a coach. The problem I see at the moment is that many people are backseat drivers with AI. They have no idea what is being generated on their behalf. They do not understand what is being tested. When they have to fix an issue or extend the code, they discover that they do not know enough, and it takes them longer. They are not using AI in the right way.

So my general advice is this: be crystal clear about why you are doing something, especially with tests. For me, the strength lies in you writing the tests, not in outsourcing them to AI. Tests are your executable specification and your feedback loop. If you hand that over to a tool without understanding or review, no mutation threshold in CI will save you. You may use metrics and gates in your pipeline, but the real acceptance gates are still the human ones: clarity of purpose, understanding, comparison with your own judgment, and deliberate improvement.

14: Finally, looking forward: What do you see as the future of TDD and automated testing practices? Are there emerging trends—perhaps in tooling (like property-based testing, AI-assisted test generation) or in process (like BDD, Continuous Deployment practices)—that you believe will shape how experienced developers approach TDD in the coming years?

Kevlin Henney: If you do not already have an automated testing habit, now is a very good time to start. With AI in the mix, you will find that generated code sometimes fails in ways that are quite intuitive, where you look at the mistake and think, “Yes, I can see how that happened given the training data.” In other cases, the mistakes are very odd: failures where you think, “Why would you ever do that?” You need to become better at testing to deal with both.

Interestingly, this is something I was saying years before large language models. Around 2016 or 2017, at a conference in Poland called MobiConf, somebody asked me about the future of AI. At that point everyone was guessing about where AI would go. My answer was that you need to get better at testing. That is still my answer. The more AI we add, the more testing skill we need. I do not want AI anywhere near company-critical code without tests and without proper reviewing skills.

So one part of the future is skills. You need to get good at testing, and you need to get really good at reviewing. Reviewing is not just a testing skill; it is a design skill. You cannot review code effectively unless you have deep design experience, which means you also need to learn to code. Coding remains relevant, because otherwise you do not know what you are looking at. This is not about language tricks. It is about familiarity with a kind of precision that you normally only see in areas like science and mathematics. Those are the skills you want to build, and they sit in the same space as the precision and specification thinking that testing requires.

In terms of the specific workflow of TDD, I find it hard to make strong predictions. My sense is that TDD adoption will always be relatively low compared to the overall population of developers. As things stand, many people still do not test at all. There is a significant and increasing proportion of developers who do test, and that has changed over the last couple of decades. The needle has moved. Within that group there is a smaller subset who will try TDD, or have TDD as part of their toolkit and can employ it when appropriate.

I would like that number to go up. Leaving AI out of it for a moment, I think TDD is a good practice. I have thought that for a long time. It is helpful because it encourages incremental thinking and clarity. Done with the right sensibility, it leaves behind something worth inheriting, rather than something people curse you for.

If you bring AI back into the picture, those same qualities remain valuable. Clear tests, incremental feedback, and a strong sense of specification help you reason about AI-generated code and about changes in general. I would like to think that AI might even increase the uptake of TDD, because it will force people to confront questions about correctness and understanding more directly. But whether that happens, and to what extent, is difficult to predict.

So my view of the future is less about a specific new tool or fashionable acronym and more about emphasis. We will see more AI and more automation, but the teams that thrive will be the ones that double down on testing skill, review skill, design sense, and the ability to work with precise specifications. TDD is one of the workflows that aligns naturally with that direction, and that makes it a practice that will continue to be relevant, even if it never becomes universal.

Architecting AI Software Systems for the Real World: A Conversation with Imran Ahmad

Divya Anne Selvaraj — Wed, 03 Dec 2025 11:18:19 GMT

AI systems are now everywhere in software, but turning a promising model into a reliable, cost-effective, and sustainable product is still hard work. Teams are discovering that “just add a model” is not enough; you need end-to-end architecture that can take an idea from a lab-style proof of concept to a production system that meets real constraints around cost, latency, security, and operations.

Imran Ahmad is a data scientist, educator, and author focused on algorithms, AI, and cloud computing. He leads machine learning projects for the Canadian government, teaches at Carleton University, and is an authorized instructor for AWS and Google Cloud. With Packt, he has authored 40 Algorithms Every Programmer Should Know (2020) and 50 Algorithms Every Programmer Should Know (2023), and Architecting AI Software Systems (2025, with co-author Richard D Avila) and the upcoming 30 Agents Every AI Engineer Should Know. Outside of work, he enjoys photography, biking, and mentoring developers through his Discord community and workshops.

In this conversation, we dig into how Imran thinks about AI architecture in practice: from the fundamentals of good software architecture and elastic cloud patterns to the “five pillars” he uses to evaluate AI systems—security, reliability, performance, efficiency and cost optimization, sustainability, and operational excellence. We discuss separating data and compute for sustainability, designing differently for heavy training workloads versus real-time inference, and avoiding hard coupling to any single AI or cloud vendor. Imran also shares his perspective on agentic AI and agentic RAG, what changes as AI becomes a core concern for software architects, and why UX, cross-functional collaboration, and long-term operational thinking are now central to successful AI systems.

About the Book

1: What inspired you to write Architecting AI Software Systems now, and what gap in the industry or knowledge are you hoping to fill with this book?

Imran Ahmad: When you design AI systems, or when you have a Gen AI solution, you have to have an end-to-end solution, so you have to look at that from its totality. What happens is that whenever there is a new technology, whenever there is a new idea—a technical idea—we start by focusing on depth. We develop solutions, we experiment with them, we iterate through different versions of them until we are ready to use them for solving large-scale problems—so, you know, beyond solving those cats-and-dogs pictures, differentiating between cats and dogs.

Now AI has come a long way. When we have gone through our development processes, they have matured. Then we need to deploy these solutions. These AI solutions are only useful when they can solve a problem in production, and when you bring these ideas to production then, from start to end, they need to work properly, and around that you have to design architecture. I will talk about this as well—how you quantify a good AI architecture. There are five pillars, as we call them; I will talk about that later: security, reliability, performance, efficiency, cost optimization, sustainability, and something that is perhaps the most important, that is operational excellence. I will talk about that later, but the need for this is that we are bringing these ideas to production. That can only be done if we are designing, if we are giving it proper thought. We are bringing all the best design patterns to the plate, and this is something that motivated me to write this book.

2: Your book takes a very practical, architecture-centric approach to AI, doesn’t it? It mentions a structured journey with real-world examples, hands-on exercises, and even a fictional AI system’s architecture as a learning tool. So can you give us a quick overview of the key themes or unique features of this book?

Imran Ahmad: So the way we have designed this book is that we have essentially divided it into two parts, and the first part is about the fundamentals—about the fundamentals of architecture. If you look at the first part, it zooms out and talks, in general and not very specifically about AI, about what the principles for good architecture are.

Of course we are talking about AI, but we start with the fundamentals of AI systems, and this is where we define terms. We define microservice architecture, and we discuss a couple of actual use cases. Then we also define terminology like data lake, what a data warehouse is, and why AI is so important in the context of designing AI systems.

When we bring these systems to cloud computing, then you have elastic architectures, and as soon as you move to elastic architectures, your system can become cost-effective and performant at the same time. If you think about this, usually the performant systems are not cost-effective. It is like you buy a Ferrari. Ferrari is performant; it is one of the expensive cars, one of the fastest cars—but Ferrari is expensive. So we are trying to buy a Ferrari at the cost of a Toyota Corolla. We want systems that are cost-effective and performant at the same time, and elastic architectures get you there, where your systems can expand and shrink based on the immediate needs. This is where we talk about why cloud computing is so important. This is chapter one.

Then we talk about a case of architecture. You know my co-author, Richard, he is a great architect. This is where he brings his 30 years of experience, and he talks about the role of the architect, the vision, the style, and, most importantly, he talks about what the implications are if the architecture is not properly designed.

Then chapter 3 is about bringing software engineering into the picture, and we see: OK, what are software-engineering-specific topics related to architecture? So this is part one—part one is done. Part 2 is about AI systems. It is specific to AI: what are the architecture templates, what are the architecture philosophies that are relevant to AI systems?

Here we talk about something that is called “concept of operations,” and we talk about how the concept of operations is relevant to AI systems. Then we talk about, as you mentioned, certain use cases—large-scale use cases. We go over a complete use case so that, if you want to design a RAG system, you know how you can use the ideas that we have developed in this book and how we can apply that to an actual use case.

3: What key skills can readers expect to gain from the case studies and exercises, and also perhaps changes in mindset?

Imran Ahmad: So the first skill is how to convert an idea into a design. This is skill number one. Requirements capture the idea in a formal way, but requirements are usually written by a non-technical person. The first part of the solution is to convert those requirements into a technical design—that is the architecture. Skill number one is what they will learn.

Skill number two they will learn is that there may be surprises. The design that you came up with may not be the perfect one. You have to iterate through that. You come up with something and it may or may not work. Now you need to find ways to differentiate between a good design and a bad design. Functionally, they will work; the functional requirements are met. It is the non-functional requirements that differentiate between a top-notch design and a design that is not that great. This is about looking at how cost-effective your architecture is, how performant it is, how secure it is, how reliable it is, whether you are using the principles of sustainability, and whether you are bringing operational excellence.

Operational excellence is a concept that we have discussed in this book. Operational excellence is that, when you are designing these architectures, you are looking long term. You are using YAML files. You are using orchestrators. You are using JSON files. You are using parameterization whenever possible so that you can reuse them and you can maintain them. And you are again looking at the long term.

So this is where the second skill they will learn is that your first attempt at the design may not be the best one. You have to rethink; you have to evolve. You have to quantify how good your design is, and then, before you actually start implementing that, it is a good idea that you start with a pilot project. A pilot project is something that is usually one-tenth of the scale of the original project, but it actually covers all the critical points, all the critical design parts—they are there. It validates that, and then you basically go towards the full-blown solution.

Key Challenges in AI Architecture

4: In the current context, many teams struggle to turn AI prototypes into reliable products. From your experience, what are the main challenges in bridging the gap between a promising AI demo and a production-ready system?

Imran Ahmad: Yeah. So you can look at it from three aspects. Let’s look into that. It is a very important topic as well when you move towards an AI solution. Initially, you feel that it is a silver bullet; it can solve any problem, so that is when you are experimenting with it. But the success of an AI project depends on three factors. The first is cost, the second is performance, and the third one is accuracy.

Let me explain that. Usually, you are applying AI to an existing problem. You already have an alternative solution; you know that things are already working. Now you are upgrading that and bringing AI into the picture. You want to do things in a new way. So, while you are moving to this new world, if I can call it that, first you need to quantify what the effect on the cost is—whether the investment in the AI, the initial investment and then the running expenditures, can be justified by the aggregated cost saving that you expect to get. This is point number one.

The second one is—I said performance, but it is actually time. Your current processes—whether they will be optimized in certain ways, whether the time to get things done, to meet some of the requirements—will they be done in a more timely way? Let me give you an example here. If it is a bank manager looking at people applying for a mortgage and that bank now wants to use AI, then perhaps instead of taking that bank manager four hours, it will take four seconds. Now this is the time that has been saved. So this is the second dimension.

And the third one is accuracy. Whatever your current systems and current processes are, now this new world, the new ideas that you are introducing through AI—whether that will be more accurate or not—you have to basically look at that. In these three dimensions you have to make some progress. Maybe it will be more costly, but if you can justify it in terms of time and accuracy, then you may be able to sell the idea to the senior management.

I have encouraged that this is something that we should do right at the beginning. It should not be an afterthought. I have seen that people come up with these AI architectures, the architecture is implemented, and then they do some sort of time and motion and try to see whether they justify it or not. By that time, you have already implemented; you have already invested; you are already using that system. So it is more like you buy a car. Let us say that you buy a plug-in hybrid, you have already paid the money, and then you go to check and see whether it made sense or not. So you need to basically check that before buying a car.

5: How can software architects specifically help ensure an AI proof of concept scales into a sustainable real-world solution?

Imran Ahmad: OK, now, sustainability is handled at two levels. If you are using cloud computing, then maybe it is not your job — this is the job of the vendor at AWS or Azure or Google Cloud. But still you can do a lot. For example, if you are using virtual machines, even if you have subscribed to virtual machines, in all of these cloud offerings, if you pay a fee, you can keep running these virtual machines 24/7, and you pay a flat fee. The same goes for the servers that you are running in-house. Now what you need to think about is that these are performance-hungry, power-hungry machines.

The second thing is that these days a good design is where we design the compute dimension in an ephemeral way and the data dimension separately. So there are two dimensions: when we talk about the architecture of these systems, there is a data dimension and there is a compute dimension. The design pattern that we suggest is that, first of all, there should be clear bifurcation, so the data and compute dimensions should be separate. The data dimension should be long term — it means your data is stored there for two or three years. The compute dimension should be ephemeral; it should be temporary. When there is a need to process the data, you provision the compute dimension, you process the data, and then you suspend it or just remove it.

Let me give you a simple analogy. All of us work in our favorite word processor like Microsoft Word. When you open Microsoft Word, the file is usually already there. Let us say that you are working on a research paper: the file is there, and whenever you find time, you open Microsoft Word, you work on the file, you store it, and then you go back to your work. So in the data dimension, your paper is stored for those four or five years, perhaps on your hard disk. But the compute dimension is your word processor. Whenever you want to change the paper, you open the word processor — for example, Microsoft Word — you change it, and once you are done, you close Microsoft Word. Now we can have the same design pattern for AI systems.

So it means that whenever there is a need to change something, whenever there is a need to train a model, whenever you need to change the processing pipeline or you want to process the data, you provision your compute dimension. The compute dimension should be need-based and the data dimension should be long term. If you follow this clear bifurcation between data and compute dimensions, our AI system will be cost effective, it will be performant, and it will also meet the needs of sustainability.

6: There’s a concern about the sustainability and vendor lock-in of today’s AI platforms. For example, Open AI reportedly reached 10 billion in revenue but is losing around 5 billion a year, a situation some have dubbed a “subprime AI” crisis. Now, if enterprise architects build around such providers, they face continuity and lock-in risks. According to you, how should architects mitigate these risks?

Imran Ahmad: OK, so wherever there is a choice, do not use the proprietary vendor-specific APIs. Almost always we have two choices. You can use the vendor-specific APIs, or you can use a higher-level generic API. I will give you a specific example. When you are working with large language models, you can use the APIs that are provided by OpenAI — this is choice one, and because you were talking about OpenAI, let us go with that example. The second choice is that you can use LangChain. LangChain is an orchestrator. If you use the LangChain API, then what will happen is that your code will be talking to LangChain, and in LangChain it will be talking to the OpenAI-specific API.

Now let us look into a scenario that is unlikely, but can happen: let us say that OpenAI goes bankrupt. If the code is not directly talking to OpenAI, the connection between LangChain and OpenAI will change to perhaps LangChain to Gemini or LangChain to Claude. That is all that is needed. Your code is not dependent on OpenAI. Now, you can repeat this design pattern for the clouds as well. For example, if you are using cloud and you use open source APIs, then it means that, let us say you are using Docker containers. If you are using Docker containers, then your cloud computing is just a living space for your Docker containers. If you are using Kubernetes as an orchestrator for Docker containers, that is even better.

So it means that all you need is that your cloud computing platform becomes a Kubernetes enabler. Now, if one of those enablers goes bankrupt, all you need to do is move to a different one and you do not have to change even a single line of code. But in this approach we have issues as well. For example, all these vendors sometimes provide the best tools in their vendor-specific APIs. I will give you an example here. If you use Google Cloud, one of their most polished tools is called BigQuery. BigQuery is vendor-specific, and for AWS that is Redshift. Redshift is vendor-specific as well.

My recommendation is that still the risk of being hard coded to a vendor-specific tool is higher, especially at this point when things are changing so fast. We should be cautious, and we should be putting effort into being as vendor-agnostic as possible.

7: Ensuring quality and maintainability in AI software is an emerging concern. Studies find many AI/ML codebases have minimal testing and documentation, often due to “lab-style” development by data scientists. And according to the State of Software 2025 report, only about 1.5% of the code in AI/big-data systems is test code (versus around 43% in traditional systems). Why do you think AI projects often end up with weaker software engineering practices? And what can architects do to instill better rigor—for example, would organizing cross-functional teams of data scientists and software engineers help bridge the gap and improve things like testing, security, and code quality?

Imran Ahmad: It is a new technology. In many cases, people are learning as they are implementing, and this is a byproduct of that. With mature technologies, the test cases have already been established, so we know from other similar projects what the criteria of success are, both in a functional and a non-functional way.

With something new, it is just working, and then we need to ask ourselves: what is the best way to test its functionality? For example, if hallucination is a concern, how do we test whether our solution is hallucinating or not? If accuracy is a concern, how do we test that? In a Gen AI solution, the metrics themselves are still evolving, and testing is all about quantifying whether our solution is meeting the agreed-upon goals or not. Those agreed-upon goals are still evolving, and that is one of the reasons, as you said, that these projects are not well tested. Yes, that is a concern, but things will improve.

AI is quite subjective in different ways, so a project may be successful for me, but for you it may be a failure. There is some subjectivity there, and as you brought up, one way of mitigating that is to come up with a consensus among people with different roles and different skills—a data scientist, a data engineer, a project manager, and perhaps a business analyst or a person who is in charge of production. They will have different views of the success of a project.

For a data scientist or an AI engineer, success is mostly about meeting the functional requirements. For a person who is in production, they may have no idea about algorithms, ROC curves, AUC, recall, or precision. For them, success is putting the Dockerized solution on a server and making sure that the non-functional requirements of reliability, security, performance, and availability are met. If it is an application for approval or refusal of a mortgage application, for the person who is in production it is all about whether the service is available or not, whereas for the data scientist it is all about the metrics related to data science.

So we have to bring them all to the table and come up with a consensus: what does end-to-end success for this project look like, both in development and in production? Whenever there is a problem, people do not have to agree on everything, but they have to speak out about what they think of the solution. Then they discuss, they understand each other’s world, and they come up with a consensus—a compromise. Once that is made, you follow that as the criteria of success. This is something that needs to happen, especially for large-scale projects.

Designing Scalable and Robust AI Systems

8: AI systems must be built with scale in mind from the start. On the training side, deep learning models demand substantial compute (GPUs/TPUs) and efficient distribution of tasks. On the inference side, serving many users requires horizontal scaling, containerization, and load balancing to keep latency low. What are some architectural strategies you recommend to handle scalability for AI? How do you approach designing for heavy training workloads versus high-volume real-time inference in a production system?

Imran Ahmad: First of all, it depends on the problem you are trying to solve. The scalability requirements are different for different problems. Let me give you an idea. When you are training a model, this is where most of the costs are incurred. You need GPUs—GPUs are expensive—you need CPUs as well, and you need to experiment and train over and over again.

However, scalability requirements in development have two characteristics. Number one is that once the training is done, you do not need those resources anymore. You still need to retrain the model, of course, but if you are developing the solution for two, three, or four months and then your model is trained and in production, all those 20 machines that you brought in will sit there doing nothing. That is why cloud computing is really good there: you can provision resources, and once you are done, elasticity becomes important.

Point number two is that there is no hard deadline associated with the training process. At inference, there is a deadline. If you swipe your credit card, the fraud detection result needs to come back within a few seconds. If someone is paying at a restaurant, that person cannot wait for 40 seconds. So at inference you have those deadlines.

When you are training the model in development, there is no such hard deadline; it is more about your comfort factor. If you can live with evolving the solution on a scaled-down system during the daytime, then at night you can submit the full-scale training job before you leave for home. It runs overnight, and you come back in the morning and the solution is there. During the daytime again, you work on a scaled-down system—one-tenth of the size—and you evolve it. If you follow that pattern, you can save a lot of cost. You use the off-hours for training, and in that case you can use a much smaller number of resources. You need to be innovative and creative there.

The second part of the equation is scalability for inference. Now let us say the model is trained and put into production. We need to carefully analyze the scalability requirements there. We should not over-design; we should not under-design. Let me give you a couple of scenarios.

Again, take the example of a credit card. Each time you swipe the card, the result—whether it is a fraudulent transaction or a regular one—needs to come back in about two seconds. That is a hard deadline, so you have to make your servers performant enough to meet that deadline. On the other hand, imagine a bank manager who, at the end of the day, just needs to look at a spreadsheet of the transactions that went through and see how many were likely to be fraudulent so they can be reviewed. In that case, the requirement is “end of day.”

There, we do not need real-time endpoints. We can live with batch-mode inference and save a lot of cost. You do not need to provision real-time HTTP endpoints. All you need is to gather your unlabeled data and create a batch—at the end of the day, the top of the hour, the end of the week, whatever granularity works—submit it to the server, and it produces the labels: how many are likely to be fraudulent and how many are not.

So real-time inference is not always needed; if you use it everywhere, it is expensive and you may be over-designing the system. To get scalability right, you have to carefully analyze the requirements first and, based on that, design and architect the system.

9: Integrating AI into existing enterprise environments can be complex. Teams often need to balance cloud-native AI services with capabilities within the customer’s current on-premises infrastructure so they can leverage existing investments and avoid disruption. How do you evaluate which deployment strategy is appropriate for a given project? What factors—for example, data sensitivity, legacy system constraints, regulatory requirements, or team skills—should influence whether AI systems run on-premises versus in the cloud or in a completely new environment?

Imran Ahmad: OK, so there are two things here. First of all, I suggest that we carefully determine the maturity level. There are four maturity levels, and those maturity levels are about the technical infrastructure maturity and the skill maturity level as well.

Let us imagine a company. There are 30 people working in that company, and they are working on developing a product that deals with recommendations. It is a recommendation engine that recommends products to their existing customers, and they are using some algorithms, but now they want to modernize that. They want to use deep learning, they want to use Gen AI, and they want to use cloud computing.

The first requirement is that they cannot afford any disruption. So, first you need to look at what maturity level you are going for, but you also have the hard requirements that you have to use the existing infrastructure and you have to use the existing people. Then we have to develop a phased approach. Usually there are four phases.

Phase one is where we come up with the plan, looking at the current situation and deciding what the path forward is. This is where we start. In that phased approach, depending on the maturity level, we may say that in phase one perhaps we can move this part of the system and keep the other part on-premises. Using the example of that company, perhaps accounting can stay on-premises, but the algorithms can move to the cloud. That is one thing we can do.

Then we have to figure out how we are going to create a pipeline that can link the on-premises environment with the cloud. Usually what we do is keep redundant systems both on the cloud and on-premises, and slowly we test that and then we remove the part that is no longer needed on-premises. So this phased-out approach will be vertical, it will be company-dependent, and it will reduce the risk, and that usually works.

In some cases we do not have a choice. If you are working for a government organization or at a financial company, then sometimes there are regulatory requirements that your data cannot be on the cloud. There are three sectors—usually government, healthcare, and the financial industry—and in these three, some of their data needs to be compliant with existing regulations. It is not impossible, but it is more difficult for them to bring the data to the cloud. For government, sometimes it is not even possible to bring the data to the cloud.

Let me give you an example. There is a tool from IBM that is called IBM SPSS Modeler. Banks and companies in the banking industry are still using that. If your processes are dependent on that and it is working fine, you will not get the same level of comfort if you move to the cloud, because you are using a legacy system with a lot of embedded knowledge. All of that embedded knowledge will not be available, so now you are tied to your legacy system unless and until you are ready to retire your legacy software. There is no way you can move to the cloud.

Then sometimes what happens is that companies, when they say that they will move to the cloud, think mainly in terms of cost savings. I will give you an example here. The Canadian federal government, about four or five years ago, thought that they would move to the cloud, and they started that journey. The infrastructure to support the Canadian federal government is worth billions of dollars. They thought that they would save money, and the study was that it would save about 20% of the cost. That was the initial study.

Now, five years down the road, that did not happen. They moved to the cloud and now they have spent more money. Cost has increased by about 12%. That is the number. And there is a reason for it. The reason is that if you do not make a conscious effort, the simplest architectures in the cloud are not cost-effective. If you run a virtual machine 24/7, it will meet the functional requirements, but it is not elastic—yet that is the easiest solution.

That is why, throughout this talk, we have been talking about the case for architecture: taking a step back and spending some time there, because in the long run, in that example, if you calculate cost, initially they thought that the cost would be 20% less; it is 12% more. What that tells us is that we should not rush into the cloud. We should first understand what architecture we need, and once you have that clear architectural vision, then you implement that so that in the long run you are going to be saving the money.

If, in a hurry, you have already started with something like the easiest possible solution, it will be very difficult to change it down the road when you have already started your computing resources, you already have your compute dimension, data dimension, and functional dimension running. If you want to change it, it will be very difficult and risky. You are doing something that you should have done a couple of years ago. That is why there is a whole section in our book that talks about the case for architecture—why we should, and what is the need for system architecture for AI systems.

10: User experience is a critical yet often overlooked aspect of AI systems. Even if the model is accurate, poor UX can block adoption. What can architects and designers do to ensure an AI system delivers a good UX and drives user adoption? For example, what is your view on using user-centered design practices or designing for diverse user needs such as voice UIs and accessibility features? Do you have any best practices for aligning AI architecture with great UX design?

Imran Ahmad: Yes. So for UX we should always be designing the system, we should always be thinking of it as a service. If you are a technical person and you have a spouse who is non-technical—or if you have a brother or sister who is non-technical—think about that person and whether that person can use this service or not.

My brother is a medical doctor, so I always think: OK, the eventual service that I will provide, can he use it or not? Sometimes what happens is that we bring too much technicality to the front. We are very impressed with our own algorithms, our own models, and our own infrastructure, but the end user is a non-technical person. They should not even need to know the details in the data dimension or the compute dimension or which models we are using. That all should be a black box once things are done.

It is a good idea to always try to see, from the eye of a non-technical person, how easy it is for a non-technical person to use it as a service. So think about it as a service. Your solution should be a service to the end user. There are different zoom levels. You can think of your solution as a microservice architecture. Now, microservice architecture is quite technical; it is great for providing abstraction to a data engineer, but not to the end user. We need to zoom out more.

I am into photography, so I give examples from zoom levels. Zoom out more and think of it as a service. At the highest zoom level the user just sees, “This is a service that helps me do X,” and everything else is hidden.

The example of that is that sometimes we are using AI without noticing it. The greatest example is when you use Google Maps. When you use Google Maps, it uses an optimization algorithm to get you from point A to point B. If you look under the hood—because my PhD was in algorithms—optimization algorithms are one of the hard areas. There is a famous example of the travelling salesman’s algorithm, and the travelling salesman’s algorithm is basically that you have a list of cities—city one, city two, city three, city four—and you try to find the optimal route. This is an NP-hard algorithm.

So it means that whenever you say, “OK, I want to go from point A to point B,” you do not know that under the hood there is a lot going on. First of all, your GPS location needs to be tracked. Then the destination needs to be there, and the traffic situation needs to be there—what are the real-time traffic conditions on each of the possible routes—so it is dynamic in nature as well. And then you reach your destination, and it asks you for feedback, and we do not even realize that for this simple use case there is so much power being used.

This is the best example. People use it. My daughter can use it; she uses it to go to her school. People will use it if they find the service easy to use, and we do not need to know what is under the hood. That is the UX.

And I will tell you the gap there as well, that I talked about earlier. In real time it needs to know that we are travelling on those routes, and the way it collects that information is that it assumes people are carrying those devices in their car, and if those cars slow down, it means that there is traffic congestion. It works most of the time. But where I live in the north, there is a place called Gatineau Park. It is about 80 kilometers long. People are biking there on their bikes, on their cycles, and their GPS devices, Google Maps, are being used, and Google Maps always thinks there is traffic congestion. It is always red. But if you go there, there is no one there. So there will be failures. It is not that algorithms always work.

Still, as a user, you trust it because of the overall experience: it is easy to use, it hides the complexity, and most of the time it works. That is what we should be aiming for when we align AI architecture with great UX design.

Emerging Trends and Future Outlook

11: The rise of “agentic AI” is a hot topic in 2025. We touched on it in the last conversation we had. Major platforms are jumping in—for instance, Microsoft’s new Azure AI Agent service helps orchestrate multiple specialized agents and tools. What might this shift from single AI applications to multi-agent systems mean for software architects? How might architectures evolve to accommodate networks of AI agents that can plan, collaborate, and act autonomously? What challenges should we be prepared for in areas like agent coordination, security, or reliability?

Imran Ahmad: OK, first, let us think about this. Right now, when we design an AI system, the goal is to mimic human wisdom. That is what artificial intelligence is: mimicking human wisdom.

Imagine a person who wants to develop a fraud detection system and wants to get it done by the end of Monday. The first step in the human mind is discovery: OK, what are the requirements, and what are the tools that are available? Maybe there are existing tools, maybe there are friends to ask about which tools exist. In my mind, I will orchestrate. I will use those tools in different ways, I will come up with a plan, and I will start using those tools. Some of those tools will work, some will not, and the solution that I deliver will be the result of using existing tools, being aware of the tools that are available to me, and combining them in a meaningful way.

An AI agent is mimicking exactly this human behavior. An ideal agent should be aware of the tools that are available. Second, it should be able to orchestrate those tools in a way that leads to a meaningful solution. Third, it should be ready for surprises. Just like I can change my plan when something unexpected happens, the agent should be dynamic enough that it can change and re-plan as it goes. These are the three attributes of an agent.

In an agentic system, a large language model is just one of the tools. It is one of the important tools, but right now the large language model sometimes becomes the “king” and everything else is forgotten. What Azure has provided, and what Google has also provided with their own agent solutions—for example, agent spaces, agent design tools—is a way to step back and see these as orchestration platforms. We can zoom out and look at them in a vendor-agnostic manner; essentially, they are all doing almost the same thing.

Now, for architects, the first thing is that they should be aware of these new developments. That is why this book is about the architecture of AI systems. We are entering a time—2025 and 2026—where AI architecture itself is becoming a specialty. You need to be aware of these developments and track them on a regular basis. One way I keep up is by subscribing to good YouTube channels and other high-quality sources. There is a lot of content out there where people give talks but do not really know what they are talking about, so you have to be selective. And you have to recognize that what is relevant today may not be relevant at the end of 2025.

At the same time, some fundamentals do not change: the need for good architectures, the need for performant architectures, the need to create operational excellence, and the need to have data that is reliable. If agentic systems are one way of doing things, they are not the only way. There will always be new ways coming. You should keep an eye on them and keep incorporating new ideas as they come along.

The challenges are very similar to what we saw with Kubernetes. When Kubernetes was introduced, there was so much excitement. I used to teach courses on Kubernetes, and people mainly wanted to learn how to design and manage applications on it; they were less interested in the internals. Now, if you use a managed service like Vertex AI, under the hood it provisions a Kubernetes system for you and you do not need to think about those details; you just use it.

Right now, these agentic systems are like Kubernetes in its early days. They are still being developed, so sometimes they will work, sometimes they will not. But you will see that in less than a year these systems will become mature. As an architect, you should expect that maturity. Things like agents talking to each other should come out of the box; multi-agent systems, where each agent is a specialist with its own piece of wisdom for a particular vertical, will become the norm.

Our responsibility as architects is to start bringing these entities into our architecture and then let the system evolve and mature in the coming months. Some glitches will be there, but over time those glitches will be resolved.

12: Enterprises also grapple with how to integrate their data with AI models effectively. One common pattern is bringing domain knowledge into AI workflows so that models can reason over real enterprise context. What is the right approach for infusing domain knowledge into AI systems? Do you think Retrieval-Augmented Generation (RAG) will remain the dominant architecture for bringing enterprise data into AI workflows, or will other patterns become more prominent as AI capabilities evolve?

Imran Ahmad: Yes. RAG is becoming obsolete in some ways—you are right about that—because context windows are becoming larger and larger, and that can remove the need for RAG. But it also means that with each request you may have to send a lot of information, and that may not be an efficient use of the model.

The advantage of RAG is that it is more efficient. Instead of sending everything, you only attach the right vectors or the right text. So our requests become more focused, and we are not wasting capacity on irrelevant context.

Agentic RAG is a step ahead. This is something that is still being developed, and classical RAG may become obsolete eventually. That is why I was saying earlier that these systems are expected to evolve. But RAG is still important, because you need to understand RAG in order to get to agentic RAG. In the book we have talked about RAG, and I feel that this is the right learning path: learn the simple use case before moving to the more complex one.

Coming back to your question, there are always multiple ways of doing things. You can have agentic RAG, you can have a large context window, you can have what I would call “classical” RAG. There will be an overlap in functionality between these approaches. In that case, it becomes subjective. You have to carefully see what the advantages and disadvantages are for each option, and then choose the approach that gives you the best solution that is available currently.

13: Some say the “AI architect” is no longer just a technologist, but a strategic leader at the intersection of data, infrastructure, and product. How do you see the role of architects changing as AI becomes a core part of software systems?

Imran Ahmad: Yes. So the traditional architect was basically operating in the days of the waterfall methodology, where you had clearly defined phases: your project gets approved, it gets funding, then someone writes the business requirements for you. Then there is a layer of red tape. After that comes the architect, who designs the system—and whatever that person designs is written in stone. Then the technical team needs to implement it, and the criteria of success is meeting that design in the most precise way. Gone are those days.

The reason is that now the architect needs to be involved in the iterative process. When you are doing AI, you are trying new things, you are experimenting, and sometimes ideas will not work. So it means that the role of the architect is more dynamic in nature. As you move towards AI systems, the architect has to be involved in the pilot project; the architect may need to refactor, may need to redesign the data dimension or the compute dimension if they see performance bottlenecks. So the role has become more agile, but the need for the software architect is still there. It is very important—it has become more important than ever.

Let me give you a reason. A large-scale project is like building a home. In some villages, people still build houses without an architect. They have bricks, they have an idea—“let us build a room here, let us put a kitchen there”—and they just start. But in an organized way, an architect first plans: “OK, this is the room, this is the hall, this is the kitchen,” makes a blueprint, gets it approved, and then we start building the home.

Now think about this: if the architecture is wrong—let us say the bedroom was supposed to be on the ground floor because the owner has a knee problem and cannot climb the stairs—but that decision was not captured, and the bedroom ends up on the first floor, then you have a serious problem. You can imagine how expensive and disruptive it is to change the structure after everything has been built. The same goes for large-scale software architecture. The basic templates need to be decided before you build the system.

That is why there needs to be an architect who designs the large-scale components, and then someone starts filling in the details. Otherwise, you end up with very costly mistakes. If you look at some real-world stories—for example, JP Morgan—you will find cases where they designed their system and spent minimal time on architecture. They picked, for example, MongoDB, went ahead with their design, and eight months down the road they realized that this was the wrong choice. There was a loss of revenue, a loss of time, and this is something we want to avoid at all costs.

So the role of the architect in the age of AI is not going away. It is becoming more central: more dynamic, more involved throughout the lifecycle, and more responsible for making sure we do not build the “bedroom upstairs” when the user cannot climb the stairs.

14: What new responsibilities or skills—for example, understanding model behavior, data governance, or AI ethics—should architects cultivate now to successfully design and oversee AI-enabled software in the coming years?

Imran Ahmad: This is essentially about making yourself aware of what technologies are available and what is happening in AI. The architect should not treat AI as a black box or something that is “someone else’s job.” You should be able to understand, at least at a high level, what these AI components do and how they behave.

A key skill is the ability to choose the right AI components under given requirements: which model to use, what kind of data pipeline is needed, what kind of storage is appropriate, and how the compute dimension should be designed. You should be able to look at the requirements and say, “Under these constraints, this combination of components will work best.” That selection ability is very important.

Another responsibility is to understand the implications of AI decisions on things like data governance, security, and compliance. When you bring AI into the system, you are also bringing in new questions: how the data is collected, how it is stored, how it is used for training, how it is monitored in production, and how you make sure that you are meeting ethical and regulatory expectations.

So for many architects, this means retraining themselves in AI. For some, AI is a blind spot at the moment. Closing that blind spot is crucial: keep learning about AI concepts, stay current with the tools and patterns, and build enough understanding that you can make informed architectural decisions. You do not have to be the person implementing every model, but you should be comfortable enough with AI that you can confidently design, review, and oversee AI-enabled systems end to end.

To go deeper on designing robust, scalable AI-enabled systems—from integrating machine learning into existing architectures to managing risks like underperformance, cost overruns, and operational complexity—check out Architecting AI Software Systems by Richard D Avila and Imran Ahmad (Packt, 2025). Through a structured progression of architectural concepts, real-world case studies, and hands-on exercises (including a fictional AI-enabled system you can dissect end to end), it shows software and systems architects, CTOs, VPs of Engineering, AI/ML engineers, and developers how to select the right models and data pipelines, use architectural models to ensure cohesion, simulate and optimize AI performance through iteration, and apply patterns and heuristics to integrate AI into large-scale systems with strong user experience and performance—so you can confidently architect AI-driven products across a range of domains.

Here’s what some readers have said:

Architecting AI-Native Platforms in the Real World: A Conversation with Amar Akshat

Divya Anne Selvaraj — Wed, 19 Nov 2025 10:52:24 GMT

AI is already in the loop for writing code, reviewing changes, and even drafting architecture diagrams—but turning those capabilities into resilient, auditable, production-grade systems in regulated domains is still hard. In payments and financial services especially, architects have to reconcile non-deterministic models with deterministic guarantees around correctness, security, and compliance.

In this conversation, we speak with Amar Akshat—SVP of Architecture at Paysafe Group and author of the forthcoming book Decode the Compiler (Packt, 2026). At Paysafe, Amar has led large-scale modernization and AI-native transformation across payments, wallets, and compliance platforms. Earlier, at Apple, he helped shape the architectural foundations of Apple Pay and contributed to wallet and tokenization frameworks. His work focuses on making architecture itself intelligent—blending principles like CAP, Twelve-Factor, and Zero Trust with AI-driven reasoning and automation.

Over the course of the interview, Amar explains how his teams are bringing AI into the architecture loop through MCPX, ArchX, and “cell” architectures that keep analysis and decision paths safely bounded. We dig into when to keep workflows purely deterministic versus putting an AI in the path, how to structure data, guardrails, and system prompts as first-class design elements, and how to choose between modular monoliths and microservices for AI-heavy workloads. Amar also shares concrete practices around confidence-based routing and trust deltas, prompts-as-code and AI Behavior Reviews, prompt manifests as “Dockerfiles for AI,” cost control with “cache, batch, distill,” and vendor-neutral orchestration via protocols like chat completions and MCP.

Looking ahead, Amar reflects on the skills architects now need and how compiler-level thinking informs the design of AI-driven systems. We close with a preview of Decode the Compiler and why understanding what compilers actually do to our code can change how we reason about performance, optimization, and large-scale architecture.

You can watch the full conversation below or read on for the complete Q&A transcript.

Introduction

1: Can you give us a quick overview of your current focus areas at the AI–architecture intersection? Which lens do you think we should use today—compiler-centric system design or product architecture, and why?

Amar Akshat: Right now my focus is what I call agentic architecture—designing systems that can reason about themselves. At Paysafe, that’s embodied in MCPX, which you talked about.

And we have something else called Archx, which is basically an AI-native workflow that powers things like onboarding, fraud analysis, and observability, but also reasoning with systems and their own capabilities—such as Zero Trust and the CAP theorem. A lot of what we do today is about codifying architectural experience.

For example, we have trained internal AI agents to analyze architectural decision records, or ADRs, and suggest reusable design patterns, effectively learning from the scars of every project before it.

And when you talk about lens, it’s an interesting analogy you use with compilers and system design. I would want to use the system design lens. You see, architecture isn’t abstract for me—it’s very progressive, it’s very pragmatic, and it has building blocks like servers, data flows, queues, failure domains.

My work sits between things like compiler intelligence and distributed systems and their logic. So if you think about it, the compiler is just an early architect. It takes intent, it optimizes it under constraints, and it produces an executable structure—and that’s the same mental loop I want our AI agents to have when designing complex systems through architecture.

AI’s Impact on the Architect’s Role

2: I think you are probably one of the best people to ask my next question to, because you sit exactly at this intersection, which I think a lot of people are still trying to make sense of. So where does AI practically help architects now, and where is human judgment non-negotiable?

Amar Akshat: I get this asked a lot in my team as well. We have tools like Cursor, for example, or Replit Agent 3, or GitHub Copilot Workspace. They basically act as junior architects today.

They help me generate documentation. They suggest failure patterns from known premises and known previous experiences, and they help me validate deployment diagrams. For example, every good generative AI can create brilliant Mermaid diagrams. So it can start off as your starting whiteboard—where you throw in your constraints and components—and it’ll start with a basic Mermaid diagram that would take an architect a few hours to actually come up with.

At Paysafe, we are using AI during architectural reviews. It will ingest our ADRs from before, diagrams, and codebases, and then it will flag inconsistencies between what we said we would build and what we actually deployed, because it has the whole lineage of design from scratch all the way to deployment. So it can reason and tell you, with evidence, that this was the original plan to be deployed, this was the scale pattern, and what we ended up deploying.

Human judgment, on the other hand, still owns context, risk appetite, regulatory nuance, and product trade-offs—the politics between product and regulatory. That is all still owned by humans. The AI can propose, but humans prioritize.

I know that my business can reasonably make money in EMEA and Europe for now, so I will prioritize regulatory nuances of EMEA and Europe, and then put through my roadmap what will come in the Americas. So that is the beautiful mix between how humans and AI interact in architectural designs.

3: What’s a good mental model for deciding when to put AI in the loop versus keeping a purely deterministic path?

Amar Akshat: If a task mainly benefits from pattern recognition, that’s where putting AI in the loop makes sense. If it instead requires legal, financial, or compliance certainty, we keep it deterministic. I think of AI as a kind of auto-complete for patterns: it can look at data and say, “this is PII,” “this is PCI,” “these are the compliance guardrails you’ll need.” That’s where this kind of AI can work. The parts that demand strict, predictable behavior should stay more deterministic, and that’s where we sometimes choose not to use AI.

Architectural Patterns for AI‑Infused Systems

4: What baseline architecture patterns do you recommend for shipping AI features reliably?

Amar Akshat: The first one is the data postulate, the second one is the guardrail postulate and the third one is the system prompt package itself.

So when I say data, I mean what is the current state of the data that is made available by MCP servers. Such as transactions, such as user records, addresses, etc.

Guardrail is about making sure what is allowed to be done and what is not allowed to be done. Do you want to completely ground the system? Do you want to have fairly only deterministic responses or do you want to use the existing LLMs?

And then the system prompt is about saying, what is my input format and what am I expecting the results to be in? And what are the other nuances I want my system to take care of automatically? So, for example, do you want more deterministic performance or do you want more accuracy? Do you want more transparency or do you want more speed? These are the kinds of trade-offs that we encode into that system prompt package. This also includes things like Langchain and Open AI or Azure’s AI Foundry for RAG.

And we have our own prompt manifests for governance. So, each inference has a manifest attached to it, and that is published to a data plane. So you can imagine things like Kafka plus Fast API. And each inference is observable, so you can actually observe the latency, the accuracy, etcetera. That is the current model which works for us. Where it breaks is things where execution is critical, user experience is critical. If things need to be made quickly using judgment, then you cannot rely on LLMs. Then we deploy lightweight sidecar models. Things like Open AI Mini or Llama 3 local for shared and even fraud scoring, which has to be real time in a transaction. We try to do these things in a centralized fashion.

5: How do you decide boundaries for AI components such as separate services, sidecars or embedded libraries?

Amar Akshat: So it is all what I call an architecture based on cells. A cell, you know, is a human component which is the tiniest unit of life. So, all of our AI deployments are tiniest units of life with their own regulatory nuances within themselves.

So, for example, if I’m talking about a wallet cell, everything which can support the wallet—its guardrails, its prompt package, its MCP servers, RAG, plus its fine-tuned models—will also participate in that cell and it won’t leak any data. The idea is the critical path of analysis never leaves the cell boundary, and it only leaves the cell boundary for audit and storage purposes. That keeps us safe first of all, fast, and then deterministic. No other data is going to change the way my wallet cell behaves, for example. The same thing applies to payment execution, the same thing applies to transaction ledgers, and so on and so forth.

6: When would you favor modular monolith over microservices for AI and vice versa?

Amar Akshat: If shared memory and stateful context really matter—say in a conversational commerce system—a modular monolith with well-defined internal modules works best. Imagine two people asking what to buy for Diwali (an Indian festival) in different parts of India, but their shared history and the same product catalog matter for recommendations—that’s a great case for a modular monolith with clear internal boundaries.

If scalability and isolation are paramount—for example, in fraud detection—microservices tend to win, because AI workloads often oscillate between those two needs. Many of us think of this as the context–isolation trade-off: which is more important for your use case, rich shared context or strong isolation?

Reliability, Safety, and Testing

7: How do you design for correctness and failure isolation when models are non-deterministic?

Amar Akshat: We route by confidence, really. If the model’s output confidence is less than a threshold, it escalates to a deterministic rule-based system, or we bring a human in the loop. We use things like LangSmith and internal logging to track trust deltas per request. We have effective guardrails and fallbacks—prompt validation and schema enforcement; we use things like Pydantic. We are a big Python shop for some AI-based workflows, and we use Pydantic plus semantics and sanity checks.

So a human only steps in for logic failures, not syntax, really. And we have a comprehensive testing strategy for AI features. For example, one of my cohorts runs drift pipelines. They will evaluate daily by comparing outputs to gold datasets—datasets that are deterministic and known to be correct—and any semantic drift triggers a review. Basically, you have to look at AI prompts as code. That’s it. Our CI/CD basically treats prompts as code. Every change goes through the peer-review process, with automated regression and some kind of sandbox deployment to test these against the gold dataset.

8: You mentioned guardrails and fallbacks earlier. If you had to distill it down to a couple, which guardrails, fallbacks, or human-in-the-loop steps have been most effective in practice?

Amar Akshat: In terms of guardrails and fallbacks, first of all, as I was describing before, guardrails are learned. We learn these guardrails from execution. Every prompt package has a version, and with every new execution and failure we put in more guardrails.

For example, if the AI system ended up putting someone’s email address from the RAG into a response that was meant to be PII-sanitized, we will again augment the guardrail to include that sanity check. Those guardrails are implemented by tools like prompt security, which ensure that none of these guardrail filters let you pass the data back to the customer.

If you apply a middleware kind of concept like prompt security—or any of those use cases where you can apply these guardrail policies before the prompt goes into the LLM and before the response comes back to the user—you will have effectively masked your failure pattern.

Human-in-the-loop is usually very, very important when it comes to response quality. Every response has a confidence score, and if the confidence score goes below a certain threshold, a human will come in and try to analyze what was wrong. Was the data too noisy? Was there too much guardrail or too little guardrail? Or was it a format problem, right—did we come back with bad formats, like something breaking the CI/CD somewhere or changes, etc.? So the combination of middleware components like prompt security and the usage of guardrails with a human in the loop is very important.

9: Can you describe a sensible testing strategy for AI features covering eval data drift and regressions?

Amar Akshat: I think the testing strategy for AI systems is fundamentally about learning from mistakes. Similarly, we have to make sure the AI learns from its own mistakes. The idea is that we have to monitor things like semantic drift, hallucination rate, and related metrics, and you have to monitor them with real-world data in sandboxes.

And then you have to, first of all, come up with a reasonable notion of success for yourself. So let’s say you are dealing with a lot of complaints. You have an AI system which analyzes your complaints and makes sure they’re being handled correctly. You run it in sandboxes with masked PII so that you have a reasonable testing ground around them, and then per execution you look at things like their semantic drift, hallucination rate, and the trust delta, right?

Every pipeline will come back with these metrics, and those evals plugging into your CI/CD are very important because your prompt is changing—changing just like code on a daily basis—and your prompt changes can sometimes be exponentially impacting your determinism.

Observability and Operability

10: What production signals matter most in production for AI features?

Amar Akshat: Yeah. So these signals—basically everything we tested for—now start to matter in production as well.

The first is cost. We are a financial company, we have millions of transactions going through, and a small change in cost per transaction can exponentially impact our revenue or margins.

The second thing is hallucination rate. Each hallucination in something as deterministic as fraud analysis costs us money, because it can lead to incorrect decisions on transactions.

And then the third part is obviously things around the sanity of the whole system itself. You should be making sure that, as you introduce or tune AI, you’re not unintentionally impacting real transactions or degrading the user experience—you might otherwise be causing attrition in your user base. These things matter for us, and we monitor them very closely in our production systems.

11: How do you set up automated and human feedback loops to improve models or prompts without breaking user-reliable behavior?

Amar Akshat: Yeah, so feedback is pretty automated. The agent will log all low-confidence events, and a human reviews them, and it will relabel itself. And as I was telling you, prompts are versioned with Git tags so we can replay failures exactly. Because it is an agent, you can always augment asynchronous activities by itself. So what we have today is that every failure in them is then analyzed by a different model so that we don’t have model bias itself, right?

For example, a failure derived in an OpenAI model will now be reviewed by a Sonnet Claude model. And the feedback we obtain from there will be asynchronously applied to the OpenAI package, which went into OpenAI—the whole thing. And we then, over time, figure out what is working better for us. Which model is able to review the feedbacks and failures of a different model? And then we have these model couplings formed by that, and all of it is tracked via Git tags. So every release has a JSON in it which says, here was the analysis and scoring, and here were the recommended prompt changes or guardrail changes, applied it, and got this score as the final one. So auditability is incredibly important in our ecosystem because this is real data you’re dealing with, this is real developers’ time you’re dealing with, and then also sometimes you’re dealing with real transactional data. So we need to understand which particular change and which particular recommendation caused us transactional benefit across the feedback loops.

Data, Privacy, and Governance

12: How do you protect sensitive data while keeping AI useful?

Amar Akshat: That’s a great topic, and it’s at the top of every executive’s mind in the industry right now. We basically redact personally identifiable data before inference and use hybrid RAG, where private embeddings always stay in-house.

For example, we can use something like Pinecone Local, where it runs as a local instance and private embeddings never leave our environment. Public context is then fetched externally in a secure and deterministic way—for example, a regulatory change, the impact of that change, or human sentiment around a new law. Those external signals are handled in a more deterministic, controlled way.

At the heart of all this is our middleware. That’s where we apply these policies: even if you wanted things like sentiment or PII, it will not flow into the inference layer if we don’t want it to. All AI access is integrated with SAML-based authentication, so we know who is accessing it and can augment their prompt with their role, etc. On top of that, there’s a guardrail middleware where we always apply a particular set of rules based on their role and permissions.

So even if you accidentally put my email address into the prompt, it will be filtered out before it leaves the system. That’s where our middleware stack plays a huge role, along with lightweight governance around prompting. Our prompt manifest defines who owns the prompt, what its data scope is, and its safety rating. You can think of a prompt manifest as a Dockerfile for AI—basically, it’s auditable but still fast to work with.

And finally, for governance, auditability and traceability are paramount for us. We log every inference as an “architectural replay,” which includes things like model ID, prompt version, and data snapshot. That way, our compliance teams can reproduce any decision path deterministically.

13: What is your take on auditability and traceability for AI decisions when regulation applies.

Amar Akshat: Regulation is paramount. I’m dealing with EU regulation and the AI Regulation Act on a daily basis, and basically it goes back to how deterministic you can make your decision paths.

Our whole goal is that anytime anything breaks our determinism, or that score, we either chuck it out of production immediately or we treat it as a P1—like a priority-one incident, right? So any production workflow losing determinism at a given threshold will be treated as a production incident. It is no longer a developer playground or anything.

And because we are able to log the model ID—basically the architectural snapshot, the replay of it—we are able to log the model ID, prompt version, and the data. We can go back to the decision path and change any of these variables to make sure determinism can be achieved immediately. Our Ops teams are actually trained to do this on a daily basis.

Cost, Performance, and Vendor Strategy

14: How do you avoid provider lock-in without slowing delivery?

Amar Akshat: We try to stick to the protocols the market is standardizing on—for example, the chat completions APIs and MCP. These may start with a single company, but over time they become common practice across the industry. So we abstract orchestration through these well-known protocol APIs.

When I talked about MCPX, that’s essentially our multi-provider orchestrator across OpenAI, Azure, Anthropic, and our on-prem models. The reason this works is that all of them support chat completions–style APIs and MCP-compatible patterns. So as long as any external or internal AI provider follows those APIs and protocols, we’re fine.

On top of that, we put an AI gateway in front. Based on things like request headers or your SAML identity information, we can route you to an Azure model versus an OpenAI model, or to an internal model. That is how we avoid lock-in in practice while still moving fast.

15: If we talk about capacity planning and cost control, what has worked for you in terms of caching, batching, smaller models, etc.?

Amar Akshat: I think the mantra is very simple: cache, batch, distill. We use a tiny Llama for high-volume routing tasks and GPT-4 Turbo for design-time reasoning. So if it is dynamic data like customer support or architecture, design, etc., we stay with prompt engineering because, in that case, flexibility beats precision.

16: When do you feel training or fine-tuning is worth it versus prompt engineering when it comes to a foundation model?

Amar Akshat: I think if your domain is stable—maybe KYC or risk scoring—the signals are very well known, the domain is stable, and then we use fine-tuning, because it’s a very well-known, stable, signal-based domain.

And as I was saying before, if it is dynamic—if it is changing a lot—Spanish customers complain in a different way, English customers complain in a very sarcastic tone, and Indian customers complain in a very direct tone, sometimes in Hindi or regional languages. Then we stay with prompt engineering, because we have specialized customer teams who know how their customers complain and can create prompts more easily to manage those customer complaints. So yeah, that’s my short answer.

Team, Skills, and Process

17: What new skills should architects or senior engineers acquire in 2025 and beyond to stay effective with AI in the stack?

Amar Akshat: Architects and senior engineers must learn prompt literacy, model evaluation, and probabilistic reasoning. That’s paramount. You don’t need to train models; you need to design systems that can survive their uncertainty.

18: How do you adapt design reviews, ADRs, and incident response for AI-specific risks and ongoing learning?

Amar Akshat: Our design reviews have introduced a first-class concept called AI Behavior Reviews. We explicitly acknowledge that AI behavior is non-deterministic, and we treat that as a first-class part of the review process.

ADRs now capture prompt decisions and fallback strategies as part of the architecture record. And on the operations side, our SREs include an AI SRE role—someone who understands when it’s model drift, not code, that broke the system.

As I mentioned earlier, we’ve trained Ops people to understand the determinism profile of every AI pipeline. So now they can recognize that it wasn’t the code that failed; it was drift in the AI behavior—and they know when to switch off that pipeline or replace it with something else.

Case Study

19: Can you walk us through a recent AI‑related technical decision you’ve made: the options, the trade‑offs, and how you validated the outcome.

Amar Akshat: That’s actually a very good question. I recently had a very interesting case. We create wallet workflows almost daily, and one of my teams was tasked with designing the checkout experience I was telling you about earlier—for our digital wallet.

This is the same problem I’ve solved multiple times before, in products like Paysafe and Paysafe Checkout. So it’s a problem I know well, and I had a clear sense of where I wanted to end up. What we did this time was use an AI assistant to generate candidate designs and then critique its own designs whenever they broke our zero-trust rules.

Eventually it produced essentially the same Mermaid diagram I would have drawn myself. It compressed many years of my experience into about 35 minutes of assessment, and it did a beautiful job of reproducing that design while honoring the constraints: partition tolerance was paramount, zero trust was paramount, and it respected those.

Then we validated it against failure scenarios—almost like a chaos check. For example: what if the system crashed at point A, B, or C? Does the system remain deterministic? Is the integrity of transaction persistence still correct? As I mentioned, the AI kept iterating until all the constraints were met. Its initial few iterations achieved consistency but not zero trust.

Next time, I plan to pair it with a chaos agent of some sort to automatically explore failure domains, and we’ll see how that goes.

20: Are there any emerging patterns or standards you’re watching that could reshape how AI components integrate?

Amar Akshat: I think all of this starts from orchestration. You can look at things like OpenAI’s protocols, Google’s APIs, or Visa’s agentic commerce protocol—everything starts from orchestration. And when orchestration is involved, zero trust is involved. And when zero trust is involved, deterministic fallback is also involved.

You’re an orchestrator: you’re orchestrating tasks, and you cannot blindly trust anyone in the world. So you apply zero trust, and then you ask, “When something fails, how do I fall back?” That’s where workflow engines come in, and I’m watching patterns that bring those engines together with AI.

I’m especially interested in cases where ambiguity is not known until the ambiguity actually shows up. I don’t think the existence of ambiguity has been mathematically described yet—when will ambiguity occur? When ambiguity occurs, it’s obviously not a clean mathematical situation, but predicting when it will occur is still unknown to systems.

That’s why I want to see chaos agents enter the market—agents whose job is to disrupt AI workflows. Right now we live too much in “happy path syndrome,” where we assume the happy path is the only path that really happens in execution. That is not true; anything can happen, anything can fail.

Every design must still be explainable by a junior engineer, basically. And simplicity is still the ultimate scaling factor. That’s all.

Hot takes

21: According to you, which production metric most correlates with perceived quality?

Amar Akshat: The trust delta.

22: OK, and what’s the smallest useful model card or change log for shipped prompts or agents?

Amar Akshat: We use Microsoft Guidance, and Microsoft Guidance lets you treat your prompt as code. So even the commit messages become the smallest kind of change log that tells you what changed between two versions of a prompt. I would say commit messages, now.

Looking Ahead

23: What constraints or first principles do you feel keep AI projects grounded, and what will look obvious in five years from now about architecting AI-heavy systems?

Amar Akshat: So first principles still apply, as in, any AI project will still not break the CAP theorem. The CAP theorem, when it applies to the determinism of applications—distributed applications—will still apply. So you will have trade-offs when you want consistency and partition tolerance. Availability will suffer irrespective of whether a human or an AI is writing the code or designing the system, right?

So those first principles remain, and an app will still be judged by its Twelve-Factor App principles,. AI apps are no exception. They may be self-healing, but their app constraints are still Twelve-Factor. Zero Trust is a model defined to safely execute critical workloads in the world, and that will still continue to apply.

One thing AI will add is the ability to self-heal with ample data and context at hand, which is a great principle we should actually capitalize on and try to create systems which, over time, go towards determinism rather than away from it. And “fail fast” is still very important, right? If something is not working for you—if determinism is not there—we should fail fast rather than have our transactional integrity or our customers suffer.

Looking ahead, I think if all the architects are on the same page, we should start versioning and feeding our contexts back into the AI. All the ADRs should go into the AI. The codebase should be scanned and understood by the AI on a regular basis. And then we should keep ourselves honest whenever the AI tells us that our ADRs and our codebases have diverged, which means we haven’t been true to our architectural design, right?

That will allow our AIs to have even more context in the world, and then they can apply these contextual patterns to create any advanced AI system, right? Any advanced AI system will still have deterministic models and dimensions—it is still working under those same constraints of the CAP theorem, etc. These are solved problems in every nook and cranny of the world. We just have to bring them together in an architectural model—not just a conversation, but an actual architectural model out there—and then let it weigh in with you on your high-scale design as a senior architect.

About the book: Decode the Compiler

24: You’re working on Decode the Compiler. According to you, how does deeper knowledge of compilation or codegen inform how we design AI-driven systems today?

Amar Akshat: Actually, that’s a great question. I’ll start with an anecdote. When I was growing up, I read this book by Yashwant Kannadkar, called Let Us C++ , and I was taught that when you initialize a pointer—or when you allocate a pointer in C using malloc—you must always typecast it to the right variable type you’re using. I kept that in my head; it was my first education, and it stayed with me until I went into the depths of the GRU compiler at Apple, the Clang.

And I realized that I should not be doing this extra typecasting, because I am now telling the compiler what to do. The compiler knows what to do. It has seen your system, it has seen your code—however beautiful or ugly it may be. It has known your system constraints. It knows what to do, right? Let it do what it does best. The problem is we don’t understand what it does.

How a compiler makes your for loop efficient, for example, or makes your incrementing variable within the loop efficient, for example—we don’t know. Many of us don’t know that compilers will automatically make some variables register variables in C and C++, and it is very important for us to know that so that, when we are writing more advanced code, those design patterns can stick with us. And those same patterns we can apply in larger-scale habits.

In one way, compilers are trying to spoil us—trying to make us lazy—because they let us not take care of those finer performance details by ourselves and do it on our behalf, which is great, but then we are also losing that sharp curve of learning there. So my book is about understanding, from the compiler’s own output or the compiler’s own dump of what it has performed, what it has done on your code, right?

You shouldn’t be surprised. I think it’s a very, very interesting thing to learn—even for a simple for or while loop—how many performance improvements the compiler is making on your behalf. And that is what my book is all about: trying to decode the compiler’s kindness towards us.

25: Is there a personal motivation or vision that led you to, you know, make the decision to write this book at this point in your life?

Amar Akshat: Oh yes, of course. So when I was at Morgan Stanley and when I came into Apple, I was deeply involved in build and integration systems. I was deeply involved in deep compiler workflows—understanding common build failure patterns—and I was at the heart of a team which was basically accepting code changes from the entire Apple operating system developer base inside Apple.

So I was seeing these common failure patterns across, and I was like, I wish I could run a podcast and almost every week tell people that this is a very common failure pattern all of you have. It’s just that it is not well documented. And, you know, sometimes the compiler steps in and does it for you and things like that. So syntax failures—sure, the compiler will reject you. But the subtle efficiency improvements which the compiler does, or sometimes we as humans do to make a couple of integrations work correctly, were almost too beautiful for me to just keep to myself.

So I wanted people to understand—when students go into engineering college today, or write their first few C programs—that they should be surprised to see what is happening beneath the compiler, right? Even if a “hello world” just comes in front of you, what it took the compiler to do it for you is a beautiful experience I went through, and I want the world to go through that as well.

26: Who is the ideal reader for this book? According to you, who will find it to be the most useful?

Amar Akshat: I think the architects and the senior developers would be the most ideal readers, because they understand that when they look at how the compiler optimizes their code, they will be surprised and inspired. Those optimizations apply to us in real-world architecture as well,. You would realize that the compiler does so many things to scale your tokens, your token chunking, or to make your lookup of a particular data structure faster.

And those are the same patterns which we apply in our day-to-day architecture as well, like when we do caching or when we do streaming of tokens, etc. So senior developers and architects will be inspired. Junior developers and people who are upcoming in the market will be surprised. So it will also apply to them—to get surprised, beautifully surprised, flabbergasted, I would say.

Mastering GitHub in the Real World: A Conversation with Ayodeji Ayodele

Divya Anne Selvaraj — Thu, 06 Nov 2025 07:43:57 GMT

From secure collaboration and branch protections to reusable workflows and AI-assisted development, GitHub now sits at the center of how software gets built—and scaled—inside modern organizations. In this conversation, we speak with Ayodeji Ayodele—author of the GitHub Foundations Certification Guide (Packt, 2025)—about helping teams move from “using Git” to leading with GitHub: collaborating transparently, automating confidently, and protecting the software supply chain without slowing delivery.

Ayodeji is a seasoned architect, DevOps evangelist, and Agile coach with over 18 years of experience across Financial Services, Tech, FMCG, Manufacturing, and the Public Sector. He’s worked with CIOs and engineering leaders throughout Asia, Oceania, and Africa to drive enterprise adoption of DevOps and Agile practices—helping teams ship better software, faster. Currently a Senior Customer Success Architect at GitHub, Ayodeji partners with large organizations to align GitHub’s tools and workflows to real business outcomes—improving developer velocity, security, and collaboration at scale.

In this interview, we dig into what the GitHub Foundations Certification covers in practice, how to level up from issues and pull requests to governance with rule sets and quality gates, and where GitHub Copilot (and emerging agentic capabilities) can responsibly boost productivity. We also discuss inner source as a cultural shift, strategies for CI/CD that avoid pipeline bloat, and pragmatic approaches to secrets management, dependency hygiene, and build provenance.

Looking ahead, Ayodeji shares how AI is reshaping developer workflows, what skills will keep engineers relevant, and how to cultivate a documentation-first, asynchronous culture across time zones.

You can watch or listen to/download the full conversation below—or read on for the complete transcript.

1: What gap does your book, GitHub Foundation Certifications Guide, fill for today’s developers?

Ayodeji Ayodele: My book bridges the gap between knowing Git basics and truly mastering GitHub as a collaborative, secure, and scalable platform. Many developers know how to commit and push code, right? But sometimes they struggle with collaboration, automation, and security, so I believe the book helps with that. Secondly, the GitHub Foundation certification is a great benchmark, but in the past there wasn’t a practical, hands-on guide to help people prepare and apply those skills in real terms, given the fact that GitHub certifications in general are just barely two years old. So we don’t have that many resources out there. I wanted to create a resource that’s not just exam-focused, but also helps developers become confident contributors in any organization. So I wrote this book to help developers go from “I can use GitHub” to “I can lead with GitHub.”

2: Your book promises to take readers from fundamentals to advanced GitHub features—from better collaboration and project management to secure workflows and even AI-powered coding with Copilot. Now, as you said, many developers use GitHub daily but may not be leveraging features like issues and pull requests. How can even experienced developers benefit from this book, and what are some important GitHub best practices or features that even seasoned engineers often overlook?

Ayodeji Ayodele: Yes, you’re correct—even seasoned engineers often miss out on GitHub’s advanced features that can supercharge collaboration and code quality. For example, GitHub Copilot—our latest product—is a massive game changer in the developer world, particularly if you’re using it both within the IDE and on the github.com platform. So GitHub Copilot is not just for the IDE; you can use and benefit from the great values that Copilot brings even on the github.com platform. Not just that—it supports multiple IDEs, up to about six, which means you don’t have to learn a new IDE for some of those other AI products in the software development space today. You can bring those models and still use the development environment you already use today.

So Copilot supports multiple models—you can use all the GPT models, you can use the cloud sonnet models, and there is also the Google Gemini models as well, all within GitHub Copilot. And then there are the agentic capabilities that we now see as the future of AI in software development—whereby you can have this huge backlog of issues, assign those issues to Copilot, and Copilot will spin up its own separate environment, triage the issue, and write the code that fits the description to implement that issue. That may be a feature request, or it may be fixing a bug and things like that. And going down the line, there are so many other things coming out in the next few months around helping to improve code quality as well. There is also a feature called GitHub Advanced Security that helps people manage vulnerabilities, and you can also bring in vulnerabilities that are reported by other security tools and even fix them within the same platform. Those are things you can use today.

Then, in terms of best practices, we’ve got pull request summaries. In terms of improving the semantics for pull requests and titles, we have rule sets for protecting branches and for protecting workflows when you run them—there’s a huge number of different rules you can apply to improve governance and to improve CI/CD automated checks—leveraging issues for transparent communication and collaboration as well. Finally, mastering these tools elevates you from an individual contributor to being a team enabler. GitHub isn’t just a code repository—it’s a platform for building better software together.

3: As we know, open-source code now underpins almost everything. Given this ubiquity, what advice do you have for developers to harness open source effectively?

Ayodeji Ayodele: Open source is the backbone of modern software. Contributing is the best way to learn—contributing to open-source projects helps you grow and give back to the community. In fact, roughly 50–60% of software today is built in open source or on top of open-source components and libraries. So open source is integral to how we build software in the world today. I’d say start small—fix typos in public repositories and improve documentation. You’re just like everyone else, and the GitHub platform is home to over 150 million developers across different skill sets and interests, so you’ll always find a space that fits you. If you’re worried your contribution won’t meet standards, there are tools to help. GitHub Copilot—free for open-source projects—can suggest or improve code; after you’ve written code, you can ask Copilot to review it for standards.

We also have GitHub Advanced Security, and many of its security components are free for public repositories. GitHub takes its role as the home where the world builds software seriously, so we provide security and AI-powered tools to open-source communities at no cost. Beyond that, there’s the GitHub Community Discussions space where people suggest improvements to GitHub itself—join in and learn by doing. Open source is a two-way street: you give, you learn, you grow.

4: How can engineers get involved in open-source projects on GitHub while balancing the risks and rewards of depending on community-maintained code?

Ayodeji Ayodele: Rephrasing that: if you want to get involved, start at github.com/explore. You’ll also find /trending, where you can see repositories gaining popularity—sometimes a new repo skyrockets because it’s exactly what everyone was looking for, whether a library, a design template, or a scaffolding component. There’s no judgment—you’re just like everyone else among 150 million developers on GitHub, spanning all experience levels and interests. About the risks: you may worry your contributions aren’t up to standard, or fear embarrassment. There’s no shame.

Use GitHub Copilot—currently free for open source—to suggest or improve code, and even ask it to review your code for standards after you’ve written it. Plus, GitHub Advanced Security offers many features free for public projects. As part of our commitment to the open-source community, we provide these security and AI-powered tools free to help you get started and build with confidence.

5: How can the principles of open source—such as open collaboration, transparency, fork and pull—be used within companies to improve teamwork and code reuse?

Ayodeji Ayodele: Bringing open-source practices inside companies—what we call inner source—breaks down silos and accelerates innovation. Transparency, forking and pulling workflows, and opening discussions all drive better code and teamwork. If companies are looking to get started, the InnerSource Commons website has very useful resources; my colleague Yuki in Tokyo is involved there.

To mitigate common issues like unclear ownership, be transparent and make it easy for people to contribute: add a CONTRIBUTING.md file with clear guidelines, explain what the project is, what it does, and where help is needed.

Use GitHub Discussions internally so people can collaborate and ask questions, and look to the GitHub Community for inspiration on how a discussion forum works. Leadership support matters, too—secure buy-in and celebrate contributions, whether at the community level or with incentives like badges or prizes. Inner source turns every developer into a potential innovator, not just a code consumer.

6: Have you seen any challenges in adopting this inner source model, and what strategies can help overcome them?

Ayodeji Ayodele: Resistance to change is common—people feel comfortable with the status quo.

Start by explaining what inner source is and keep the environment judgment-free so everyone feels supported.
Provide clear guidelines on what can be done, and communicate—over-communicate—so no one is surprised by the rollout.
Keep everyone in the loop on what you’re doing as a program, when you’re starting, and how you’ll deliver the change.
And don’t underestimate leadership involvement; having leadership support is critical to driving the change internally.

7: Modern software teams automate extensively. How critical is it for developers to integrate CI/CD pipelines and automation into their workflow?

Ayodeji Ayodele: CI/CD is one of my my favorite topics. CI/CD is hard. For example, you can introduce pipelines. In GitHub we have 3—one for the build phase, one for the test phase, and one for the deployment phase. And so when you think about CI/CD, you think of a process and a practice or a methodology that helps you automate all of those steps from the point where you collect requirements and analyze what needs to be developed. Then you build. You want to build and test as you build. So test-driven development—if that’s the model you follow—will require that you build your test before you write your code. And sometimes what we also do is that you write your code and write your test, depending on your methodology—whether it’s an agile methodology or something like that. So, for me, CI/CD is a practice and a set of standards that you want to have within your organization that helps you to be able to automate all of the very tedious, boring, repetitive tasks that you would ordinarily have to do by hand.

And when you think of CI/CD, you want to think about the entire journey of software development— so from the point of having an idea, building a prototype, and then having a feature request that you want to design, you want to test that feature, you want to deploy it and make that feature available to your customers. That’s the journey—it’s a life cycle. And when you think of life cycles, you’re thinking in terms of product management. You’re thinking about getting your ideas and turning those ideas into a reality for customers, for your users. And if you are going to really do that at scale, you cannot do it by having people on your team run commands manually on their terminal—copying configuration files here and there. You want to eliminate the room for human error. You want to make sure it’s repeatable. You want to make sure that the process that you follow in Team A is the same process Team B follows.

So CI/CD allows you to have those standards built so that your teams are able to build with confidence, and you’re able to ship more frequently. So the more frequently you ship a feature, the faster you are going to meet customer needs. And so CI/CD gives you that leverage to repeatedly ship features to your customers and not have to wait so long just because you’re following a very careful manual process. And the second thing I want to say is that CI/CD helps you to test early, test often. When you test early, you reduce the cost of the bugs—you reduce the cost of fixing them because you can discover them very early in the production life cycle. And when you do that, it’s cheaper to fix as opposed to discovering a bug after you have launched a feature to production. And then the last thing about CI/CD that I want to highlight is that you want to make sure that it’s repeatable. And so you want to write your configuration as code so that you are able to repeatedly do the same thing over and over again and in the shorter time possible. And that just lets you to ship faster and safer.

There are a number of things that you can do to improve the CICD pipeline. You can introduce automated quality gates within your pipelines so that there are quality checkpoints to prevent low-quality code from going into your main branch. And for that, for example, in GitHub today, we have code scanning tools—you know, GitHub Advanced Security is available to scanning tools today. And that helps to set up a rule set that helps you to scan your code for vulnerabilities. And you can then block a pull request from being merged based on that. You can also block pull requests by checking for tests, checking that your tests are passing, checking that your build pipeline—like the build process—completes and is successful. And then checking that the deployment pipeline meets all of the criteria or standards required before you actually deploy to production.

And so once you have those things set up—essentially a very solid, repeatable process with automation and with quality gates in place—you are going to be able to confidently ship production-grade software again and again. And when you centralize CI/CD—like, for example, you can have it built as reusable workflows—you are able to introduce them across all your repositories whenever you want to merge a pull request, and you know you have a central place where you can manage the quality gates, the rule sets, and the different things that need to be checked across your organization. This gives you a very repeatable process across the organization—across different business units—and you’re able to have that single pane of glass that lets you to see every single repository and the quality within them. So CI/CD is integral to software development today—I would say it’s an integral part of software development, yeah.

8: What guidance would you give for using tools like GitHub Actions to streamline testing, integration, and deployment, and to avoid “pipeline overload” while maintaining software quality?

Ayodeji Ayodele: I would say start with one of the most important points around deployments—which is continuous deployment. If you’re working in an organization where you are able to build a great testing culture—so you are building both unit tests, and you have automated tests running end-to-end, and you are able to bake quality into your check-in process—then you can enable continuous deployment so that you have frequent deployments to production without, you know, requiring someone to push a button after every merge into main. And the reason is that you have discipline, and you have the culture and processes to ensure that they are repeatedly passing. So when a pull request lands on main, the deployment happens automatically—so that is one.

And another point is that release gates help with staging: having an environment that lets you validate in a staging environment before you deploy to production. That’s really helpful when you have many teams and the product is really important and there are key delivery timelines, so release gates make sure that when you are in the staging environment it’s actually very close to the production environment. You can do some scale testing and even chaos testing in that staging environment. Now, in terms of avoiding pipeline overload, what I’ve seen is that when teams begin to add more and more and more—like, you know, “Let’s add this because it’s nice to have”—you’ll quickly get to a point where your pipelines take a long time to run. That means you’re blocking people from shipping because you now have this huge, long pipeline. So what I would say is that you want to keep your pipelines small in terms of the number of steps for a particular task. And when you break down your tasks—like, for example, your unit tests are separate—you want to break integration tests separately, and then you have performance tests and other tests that you want to do. And by breaking these things down into a pipeline for a particular purpose, you are able to keep them short and small, so that when you are running the pipeline for unit tests, it’s not including all the steps that you need for, say, integration tests.

So you want to break them down, and you want to make sure that they are not blocked by each other. For example, let’s say you want to run a build pipeline, and then you also want to run unit tests and integration tests. You should be able to run them in parallel such that when your build is done and the unit tests are done, then you can do code coverage and other types of analysis. And then you can go ahead and, you know, do the integration tests while, you know, the unit tests are done—rather than having them chained where you have to wait for one to finish before the other can begin. So run in parallel—parallelization lets you to be able to get things done faster. The other thing I would say is: cache artifacts between steps. When you are building artifacts, you want to be able to reuse them easily between pipelines or even within a pipeline. And so when you cache artifacts, it saves you the time it takes for a build to start from scratch every time.

For example, in Node.js, you can cache the dependencies; in Python, you can cache the dependencies; in Java you can cache dependencies—you can cache them across runs. And this makes sure that if there’s a small change you’re making, you don’t have to start from scratch. And then, finally, when you’re thinking about CICD, you want to think about making it reusable—so reusable workflows. That way, when you have standard, common steps, you can reuse them across teams. And then also make it modular. So when you have a task, you want to have a complete task that—once that task is completed—then you can run the next task in parallel or the next task that depends on the outcomes of the previous tasks. And so with all of that, I would say, yeah—automate the boring stuff. Focus on what makes your products unique.

9: In your experience, what branching strategy works best for teams on GitHub?

Ayodeji Ayodele: I have a preference, so disclaimer first: this is an opinionated preference. For me, I prefer trunk-based development. A lot—because with the trunk-based development model, your history stays clean and you have only one single branch at every point in time. And you might ask, what if I need to go back in time and roll… the Git flow, which is always, you know, very good… the best branching strategy is the one that your team understands and they can follow consistently. Gitflow can work for… you know, complex release cycles and… and you want to have different release version numbers, and you jus… consistency and clarity matter, that’s—uh, what I would say, yeah.

10: Now we know what you prefer personally. But for teams—what would you recommend for them? Would it be trunk-based development or Git Flow?

Ayodeji Ayodele: Yeah, I would say trunk-based development… most—that’s your speed because you… you know that you know how to… You can quickly make quick changes, validate… collaborate. So that is particularly important for teams, yeah.

11: Let us talk about the pros and cons of branching strategies. Could you quickly summarize what trunk-based development is best suited for versus Git Flow?

Ayodeji Ayodele: If you’re building software and want to release features more often—say, once every day—with frequent changes to production or very short sprint cycles, trunk-based development is the best approach. If you tend to have multiple release versions in production—or you have the kind of application where customers expect separate versioned releases, then Git Flow is good so you can keep those separate branches of the same codebase and use them regularly. So yeah, that’s what I’ll say.

12: With supply-chain attacks on the rise, security is a huge concern. What steps should developers and teams take on platforms like GitHub—enabling two-factor authentication, etc.—to prevent such attacks?

Ayodeji Ayodele: Security is everyone’s job. Every role needs to consider security as very important—not just the security architect. Every developer needs to factor security into the code they build, so use the built-in tools you have on GitHub to protect your code at every stage. For example, at the development stage, GitHub push protection helps block secrets from leaking from your IDE when you’re about to push to the remote—so that helps even before you get to CI/CD. When you’re about to merge, there are rule sets to protect code—for example, from accidental deletions of branches—and from dependencies and vulnerabilities. You can run vulnerability scans out of the box with a single-click default setup for code scanning. For secret scanning, you can scan code at rest or even code in PRs.

There are also security configurations you can apply to different sets of repositories—and you can have code scanning run on a predetermined frequency, whether weekly or monthly, across the organization. There’s also Dependabot, which looks at your dependencies and recommends updates; it tends to keep dependencies up to date and even opens automatic pull requests you can review and merge. For software supply chain integrity, you can implement build provenance using SLSA (a build attestation standard)—GitHub Actions is always on SLSA Build Level 2. Within GitHub, you can generate attestations to prove dependencies are valid—like a blockchain of your deployments—to show nothing was tampered with; you can store that on GitHub or as an artifact in Artifactory or other external package managers.

Definitely use two-factor authentication. GitHub supports that, and you can use single sign-on with Okta or Azure AD. You can also use Teams to manage roles, create custom roles with different permissions, and make sure every commit is signed and verified to add confidence that the person changing the code is the right person.

13: How do Git and GitHub enable effective asynchronous collaboration across different time zones and teams?

Ayodeji Ayodele: Yep, like you rightly hinted—yes, I work remotely. In fact, Git… for asynchronous teamwork. It helps. You don’t have to be in a meeting to make changes on GitHub, and clear communication is built into the platform. You can use GitHub Projects for planning. You can track status for each task—who’s working on what—and then it also provides a timeline view where you’re seeing the timeline for the different tasks. That helps when you have teams in different parts of the world, even people in the U.S. and [elsewhere]. GitHub Issues is that central place where we collaborate on a particular issue—you can create an issue to capture an idea, a bug, or a feature request. You can tag people, add labels that are customized to the way your organization works, and then you can create task lists within the issue. So it helps pull requests as well—it comes out of the box.

Pull requests is one very fantastic feature of GitHub, and I think… Peter… You can also use GitHub Discussions for discussions that are not specific to a particular issue or pull request—discussions are really trackable. You can also use discussions for social collaboration. Yep, these are the different features I can think of now. And then there’s documentation—wiki pages and README files. You can also create road map views and use that to collaborate on different projects.

14: Are there any secret tips you’d like to share—little things we might have missed—about GitHub collaboration in a remote-team context?

Ayodeji Ayodele: Yeah, absolutely. And the fact that you work with different people across different time zones means documentation becomes very important. Even though they’re developers—and we’re geeks—sometimes we just want to write in, you know, shorthand and short comments. But if you’re working asynchronously, you want to provide context in your issues and pull requests so people can understand the “why” and the “how” behind changes without a meeting. Use templates for issues and PRs, and follow a consistent convention for titles and descriptions. Use labels and project boards to make status clear at a glance. Encourage code owners and reviewers to leave actionable comments, not just approvals. And you want to have regular check-ins with your team—maybe you have a sync once a week or once in two weeks—so people feel connected while still relying on async work.

15: You have a background in DevOps and change management—how do you see platforms like GitHub influencing team culture and process?

Ayodeji Ayodele: GitHub is a fantastic tool for collaboration—especially when you want to bring people from silos to becoming more collaborative. In the development world, we call GitHub “social coding” because you’re writing code and working together with people. There’s transparency: when you’re making changes, you’re seeing what others are doing, and others can see what you’re doing—that transparency is really important for providing feedback. When you’re reviewing code, you can add inline comments, and that feedback can also drive continuous improvement. When you put automation in place, it saves people time, and they can use that time to work on bigger problems—so having that automation helps teams become more collaborative. People feel this developer happiness when they use GitHub in a transparent, collaborative way. Collaboration is a very core pillar of how the platform is built and shaped. Yeah.

16: Are there any common pitfalls teams should avoid when integrating GitHub into their DevOps workflows?

Ayodeji Ayodele: Over-customization is one I see often. Sometimes platform engineering teams or DevOps engineers want to customize everything, and that can take away from the standard way of doing things. There’s overhead for you to maintain the customization, and overhead for the people who have to use it. You want to reduce that so your end users and consumers can use your application or software more easily—so avoid over-customization. Neglecting documentation is another. I’ve seen people create pull requests with great changes but little or no context.

Today, with GitHub Copilot, you can easily summarize the changes you’ve made in the pull request—beautifully—so try not to neglect documentation. Also, skipping retrospectives is a pitfall. Retrospectives help you look at what you did well, what you didn’t do well, and where you can improve—so don’t skip them. Over-customizing your platform, neglecting documentation, and skipping retrospectives are common pitfalls—and culture matters. The right tools can change what’s on your menu.

17: Let’s talk specifically about Copilot and the future of coding. Is AI-assisted coding a boon to productivity—that’s the debate. From your perspective, how is AI changing the day-to-day work of developers?

Ayodeji Ayodele: I’ve been in the industry for about 20 years writing software, and I’ve never seen anything like this before. It fundamentally improves and changes the way we write software today, and it’s not just a buzzword—AI has come to stay. We’ve seen people reduce the time it takes to introduce new features; we’ve seen improvements in the quality of the code as well—higher test coverage and fewer vulnerabilities when they scan the code.

On GitHub, we build GitHub on the GitHub platform—Copilot is the number one contributor with the highest number of contributions per week and per month today on the GitHub platform. We believe in the platform and in what AI has come to improve. The agentic capabilities I’ve seen today—if I had them coming up earlier in my career—I would achieve a lot of great things.

So if you’re not using AI today, you’re likely playing catch-up, because AI allows you to move at a much faster rate and safely, in a secure manner. AI is now the pair programmer. You’ll be seeing agents—so you can assign some complex or boring work to an agent within GitHub, the GitHub platform. You can have multiple issues in your backlog, and an agent can hand off from one agent to another, and GitHub Copilot has different agents. We will also be releasing many new agents, so you’ll have agents as your pair programmer or even your teammates. In addition to a development team of humans, you’ll now have these agents alongside the team, doubling and tripling their output and their throughput compared to those who don’t have AI today, yeah.

18: Do you foresee AI assistance becoming a standard part of development? How should developers—especially junior engineers—take advantage of tools like Copilot while continuing to hone their coding skills?

Ayodeji Ayodele: GitHub Copilot not only helps you write code—it helps you understand the code. You can ask GitHub to get a better understanding of the codebase. Let’s say you inherited it from a senior developer; Copilot can help you understand the different components of the code and what it thinks the code does. It can explain concepts for you—coding terminologies and development practices—and even identify components in the stack you use: “These are the components you use here.” You can ask questions such as, “How can I test-run this?”—and GitHub Copilot can help you go through that flow. Secondly, it helps with prototyping. You’re given the requirements, and you need to quickly prototype and experiment—an important task many people underestimate that developers do today.

From translating ordinary business needs into what software should do, Copilot can help with ideation, brainstorming, and prototyping as well. These are very good areas where a junior developer can really benefit. It also ensures the code follows certain coding practices, which means a junior developer can work as if the person is an experienced senior developer—because they have that assistant by their side. So you can solve problems with AI, not just write code.

19: Let’s talk about caveats a little bit. Can Copilot negatively impact code quality? What is your take on this?

Ayodeji Ayodele: Oh, that’s a tricky one. Yes and no. Yes—in the sense that if you don’t know what you’re doing and something goes wrong, it will be hard for you to understand the code base and figure out how to fix it yourself. And if you find yourself in a remote area or you don’t have internet access, or you’re on a system where AI is not allowed—say, you’re building some, you know, covert, highly secure environment software—how will you be able to cope? So you also want to make sure that you understand the language it’s written in, so that you can triage some of those things. And in terms of whether AI in general can introduce bugs in the code—there are times when, if there is no good prompting, Copilot can build (or AI tools can build) code in a different way, because the model has its own preferences. You can read up on how to write good prompts for AI tools—for GitHub Copilot.

Then there are times where you can say, “I want you to do it in this particular way, and I want you to use these libraries,” because typically there is more than one way of solving a problem. If you have a kind of library that is homegrown or approved for use, you can create what we call Copilot Instructions to instruct Copilot to write code in a way that is accepted by the organization you’re working in. And whenever it introduces bugs, there is a review agent you can use to review the entire code base itself and review it against best practices—and even against your internal standard practices—within an enterprise or a team.

20: How do you feel teams can use AI coding assistance responsibly to ensure the generated code meets quality and security standards?

Ayodeji Ayodele: Good question. Not all AI coding tools are the same, and the dependency models you use can determine what kind of responsible use is available. For example, GitHub Copilot—being part of the Microsoft ecosystem—has multiple layers internally for the responsible use of AI. First, it looks at the kind of prompts you’re sending, checks them against responsible-use standards, and sanitizes them. When it sends back the code, it looks at that code to be sure it’s a responsible use—making sure there are no personal data leaks or similar risks—within the system itself, depending on which AI system or product you’re using. And for the human being, I would say always review—and then use automated checks—because the volume of what AI will be contributing to your code will increase.

That means it will be time-consuming to review everything manually, so you want to reduce that burden with automation while still reviewing. Make sure a lot of the review work has been done in advance with automated checks, and treat Copilot as a collaborator, not a replacement—that’s why GitHub calls them “co-pilots,” not “the pilot.” Someone is still in charge, driving what it does, yeah.

21: Continuing to talk about AI and its impact—The latest Stack Overflow developer survey shows a paradox – nearly 80% of devs are now using AI tools in some form, yet only ~3% highly trust the answers from AI. In fact, 75% of developers say they turn to a human colleague when they don’t trust an AI’s answer. How do you envision the collaboration between developers and AI tools going forward? For example, will coding become more about validating and refining AI-generated solutions?

Ayodeji Ayodele: Yeah, I think I’d be keen to see that report and see, you know, what tools they’re using. It would be good to see a breakdown of that.

Because I feel it may not be the same experience for every tool. With that said, this revolution itself—right? Oh, you know, this is another time in human history when you look back and see that a huge change has occurred. This is another change. And there is, as expected, a bit of resistance when change comes. Some people have not used it, and some people who used it don’t understand exactly everything it does and how it runs in the back end. And it’s difficult to trust what you don’t know—how it runs or how it works. You may need to understand the design and the architecture of that AI tool to know what’s going on under the hood.

And many of these AI tools also give flexibility to configuration—so you can configure what the AI is able to do, and that way you can control and decide the way the AI works and what it’s allowed to see. For example, with GitHub Copilot, there’s a set of files you can exclude so the AI will never look at those files—maybe they’re secret files and things like that—and it will never look at them, even in your codebase. So this helps. And then knowledge sharing as well—some people love it because they’ve used it very well and they’ve seen the impact it has had on their productivity. So knowledge sharing will help the community; it will help people balance up and understand things better—how some of these tools work. So, yeah, I’ll encourage knowledge sharing, and then understanding the design and the architecture and the documentation on how it runs.

22: That’s some really good advice. But how do we ensure that junior devs still learn critical thinking instead of blindly accepting AI output? Because that’s a genuine risk the community is facing at this time.

Ayodeji Ayodele: Yes—I would say use AI as a learning tool, not a crutch. Use AI to understand and know things better. And you can also use AI to augment knowledge. At times, knowledge is scattered within an enterprise in different sources, and it’s hard to have access to—or remember—all those areas—of course we have other platforms like SharePoint, Jira, ServiceNow, and things like that. Use AI to augment and consolidate this knowledge base so that people can have a richer source to derive information from. Consolidating the knowledge base can really help.

And of course you can also have meetups; they can really help to improve knowledge sharing. You can also come and showcase what you’ve built, and people can, ask questions to—maybe, you know—test your assumptions for the software you built, and that can help. AI itself can help in the scaling of that—in bringing all of it together. In building that, you can also capture notes and conversations and improvements, suggestions, and interpret them back into the code. Or document those comments and feedback in your repositories—AI can help with that for you—but humans have to ask the right questions, yeah.

23: There’s a lot of debate about whether AI makes people more productive or makes them worse—you know, more dependent on them. And I think at this point it’s a bit pointless to go into that debate. But complex, creative problem-solving is still uniquely human. In the same Stack Overflow survey, roughly 40% of developers said that AI tools performed poorly on complex tasks. Having said that—since the space is moving fast and there are developments and improvements—given AI’s current limitations, what uniquely human skills should developers focus on strengthening now to remain relevant?

Ayodeji Ayodele: First and foremost, there was that comment about AI not being able to perform complex tasks. In the last three months, I’ve seen a major change. The models have evolved and people are beginning to say, “You know what, this is going really magical.” So I’ve seen some AI models out there that can perform really complex tasks. There are some models that take a shorter time and just give you—on the fly. So there are different strengths to different models.

So yes, AI can help with complex tasks, but AI can’t replace creativity. AI cannot replace empathy. It cannot replace problem-solving. These are innate skills for humans. I know I’ve seen some people trying things like that, but I don’t think they can ever be like humans. So I don’t think AI will replace human beings, you know. So you want to focus on creativity, on communication, on design thinking, and, you know, adaptability as well.

24: What core competencies will define a successful developer in the next 5–10 years as tools like Copilot evolve?

Ayodeji Ayodele: So the developer role is evolving—more than just writing code, it’s about understanding systems, collaborating across functions, and adapting as tools change. You want to build strong fundamentals, but stay curious and keep learning new paradigms, frameworks, and practices as they emerge. Communication and collaboration matter a lot—being able to explain your thinking and work well with others. So the most valuable skill in tech isn’t coding—it’s learning.

To build practical mastery of Git and GitHub— from version control basics to collaborative workflows, secure automation, and AI-assisted productivity—check out GitHub Foundations Certification Guide by Ayodeji Ayodele (Packt, 2025). Through step-by-step labs, real-world projects, and exam strategies, it helps you prepare for the GitHub Foundations certification while adopting best practices for issues and pull requests, GitHub Projects, privacy and security controls, and GitHub Copilot—so you can level up your skills and ship better software, faster.

Here’s what some readers have said:

From Enablement to Reliability: How Platform Engineering Aligns with SRE Goals – A conversation with Sean P Alvarez and Ajay Chankramath

Sushma Reddy — Thu, 25 Sep 2025 09:06:27 GMT

Platform Engineering is often confused with Site Reliability Engineering (SRE) or seen as the latest rebranding of DevOps. In reality, it represents a distinct shift: treating internal platforms as products, designed for adoption and developer experience. In this conversation, we speak with Sean P. Alvarez and Ajay Chankramath—co-authors of the forthcoming Platform Engineer’s Handbook (Packt, 2026)—about where SRE and Platform Engineering converge, where they differ, and why collaboration between them is essential in today’s organizations.

Sean P Alvarez is the Chief Technology Officer of the Life Sciences business at Brillio, where he leads engineering teams and advises clients on cloud strategy and platform modernization. With over 15 years of experience in regulated industries and consulting, he specializes in applying Platform Engineering principles to drive enterprise-scale transformation.

Ajay Chankramath is the Cofounder and CEO of Platformetrics. With more than 35 years of global experience, he has led platform engineering at Thoughtworks, Oracle, Broadridge, and Xilinx, and is recognized as a Platform Engineering Ambassador and Team Topologies Advocate.

Together, they are writing The Platform Engineer’s Handbook on building secure, developer-focused platforms that streamline modern software delivery. The book takes a hands-on, “build first, clarify as you go” approach—guiding readers from source control governance and Kubernetes runtimes to observability, self-service onboarding, and AI-augmented tooling. Built for practitioners, it equips engineers to design platforms that scale without disrupting delivery.

You can watch the full conversation below—or read on for the complete transcript.

1: Sean, Can you take us back to the moment when you first realized you were doing what we now call Platform Engineering? I imagine the term might not have existed yet, but the work was already there.

Sean Alvarez: It’s a great question, and thank you. I’ve always worked in somewhat regulated industries—life sciences and the financial industry. In those industries, the release process tends to be cumbersome. There are a lot of compliance issues, a lot of sign-offs that need to happen, and a lot of double and triple checking to make sure things won’t go wrong or that auditing won’t get messed up.

When I first started working with companies that would burn deliverables onto CDs and move things back and forth, I tried to introduce automation, but there wasn’t much trust in it at first. That led to centralized teams who stayed up all hours of the night doing deployments over and over again. There really wasn’t trust to move to something like what we’d now call DevOps.

At the same time, there was a desire to speed up development as the industry matured and as more start-ups entered the space and moved more nimbly. We wanted to make the deployment process—and the SDLC overall—less of a dreaded gate. Instead of developers thinking, “Oh no, I need to submit this to the DevOps team or the Operations team,” we wanted to turn it into an enabler.

To do that, we had to work across silos—security teams, networking teams, compliance, even upper management and governance—to put automation in place. That’s really where Platform Engineering took off for me: ensuring deployments were safe, compliant, and reliable, while allowing developers to move faster and giving the organization confidence that releases would go smoothly.

2: Every company has its own unique engineering culture. In your journey across different organizations, how have those environments shaped your understanding of what Platform Engineering really is?

Sean Alvarez: As I moved into more of a consulting role, I worked across industries and saw more organizations that operated closer to start-ups—moving fast even at the risk of breaking things. As the industry matured, I realized Platform Engineering wasn’t just about enabling speed. Developers also had to want to use it.

When you have individual full-stack teams in control of their deployments, Platform Engineering can look like extra work—just another backlog item or another check. If the platform feels forced, adoption won’t happen. Instead, it has to be something they want to use.

It’s not a “build it and they will come” situation—it has to be product-oriented. Just like external products compete for market share, internal platforms have to attract developer adoption. If we think of engineers as internal customers, that adoption becomes the real measure of success. It’s where the interests of leadership and developers align, and that’s when Platform Engineering really becomes powerful.

3: Ajay, in your journey across organizations, how did different environments shape your understanding of Platform Engineering?

Ajay Chankramath: My journey has been slightly different, but not too dissimilar to what Sean described.

Many years ago, I had a role supporting developers, though I was a developer myself. My first job involved building complex algorithmic software. But I noticed that developers—including my own team—always encountered friction. They couldn’t get things done the way they wanted.

Back then, roles were structured differently. There was a clear divide between developers writing code and then “throwing it over the wall” to another group—experienced engineers who built the product software. I started looking at this challenge through fundamental principles of software design, like loose coupling and high cohesion, which have been around for decades.

The first thing I did was identify common developer pain points and create reusable components to improve productivity. That led to a patent we called ROMS—Reusable Object Modules. This was 25 years ago, before DevOps, SRE, or Platform Engineering existed. Looking back, ROMS was essentially a fundamental building block of a platform: reusable capability packaged as a library.

We presented it at conferences, and that became my first foray into what I’d now recognize as platform thinking. The spark came from applying core software principles, which is probably the theme of today’s conversation—how principles extend into SRE and Platform Engineering.

Eventually, leadership asked me to build a team around this work. That transition took me from software development into support-related activities, which we then called Software Productivity Automation. We didn’t have today’s terminology, but the idea was similar.

Over time, working mainly in large companies, I saw the federated model in action: a centralized set of services with individual teams building on top. It’s a strong model because it gives smaller teams autonomy. But the downside is that centralized teams face huge backlogs. That almost never works out, and so the real question becomes: are individual teams able to be self-sufficient?

With the advent of public cloud, that’s changed significantly. Over the last 10–15 years, cloud services have made it much easier for smaller organizations to adopt these practices and integrate them into their ecosystems.

4: What are some of the biggest misconceptions you see today when teams practice SRE and Platform Engineering?

Ajay Chankramath: One of the biggest misconceptions is that constructs like SRE, Platform Engineering, DevOps, or DevEx are simply newer terms replacing older ones. That’s absolutely not the case. Each has its own role in the larger scheme.

We emphasize this in our book: DevOps is not a team and should not be a job title. It’s a cultural paradigm—about improving collaboration and communication across the SDLC.

SRE is about applying software engineering principles to operations to create highly reliable production systems. DevEx—developer experience—has always been there. It’s about how developers interact with tools, frameworks, and processes throughout the SDLC.

If you define these clearly, the overlap becomes easier to see. Platform Engineering sits at the center, enabling all three—giving developers self-service capabilities, enabling SREs, and supporting the DevOps culture.

That’s why it’s wrong to think Platform Engineering replaces SRE. SRE has existed since 2004 and continues to serve a distinct purpose. Platform Engineering is complementary—it enables SRE and DevOps to succeed together.

Sean Alvarez: I’ve also seen misconceptions where SRE is viewed only as production support. People then ask: does Platform Engineering automate that away? The answer is no.

SRE plays a vital role in ensuring production reliability. Platform Engineering spans the entire SDLC—from onboarding new team members to building, deploying, and running software in production. SRE should inform what Platform Engineering builds to make their jobs more efficient, but they aren’t the only users of the platform.

Platform success depends on collaboration—SREs, developers, and business owners all need to work together.

5: There’s a lot of overlap between SRE, DevOps, DevEx, and Platform Engineering. Would you like to talk a bit about what each brings to the table, how you personally draw boundaries between them, and where they are similar or different?

Ajay Chankramath: Absolutely, that’s a great question. Let me step back and offer one-line definitions for clarity.

DevOps is the cultural movement aimed at breaking down silos. It improves collaboration across the SDLC—from planning and coding to testing, deployment, and operations.

SRE began at Google as the application of software engineering principles to operations. Historically, operators knew systems well but weren’t software developers. Google shifted that model, expecting SREs to understand both software design and systems. Today, not every organization does it the same way, but the principle remains: SRE is more than system administration—it’s a role that requires deeper software knowledge.

DevEx is the outcome of enabling developers to be more productive. It’s well-studied now, with research from books like Accelerate and companies such as GetDX. Measurement is critical: you can’t improve productivity without first understanding where you stand.

Platform Engineering, then, is about the tools, processes, and techniques that improve all these aspects. It’s not just building automation or software—it’s building capabilities as a product. That’s why it’s different. Platform engineers don’t “take code and push it to production.” That’s an anti-pattern. Instead, the role is about enabling developers to be self-sufficient, reducing friction, and making the SDLC as fast and efficient as possible.

Think of it like providing APIs or libraries. Developers need to ask: “What do I need to be productive?” Platform Engineering exists to deliver those capabilities.

Sean Alvarez: What makes our Platform Engineer’s Handbook unique is its practical approach. We’re less about theory and more about how to actually create these capabilities.

If you look at tooling, the distinctions become clearer. Platform Engineering often brings to mind Kubernetes clusters and deployment automation—but it’s much broader than that. It includes enabling new projects, automating pipelines, and creating self-service capabilities.

SRE, on the other hand, focuses on reliability—SLIs, SLOs, SLAs, and observability. Their responsibility is to ensure systems are running well in production.

Developer experience isn’t just about portals like Backstage. It’s about making platforms easy to use. Whether through APIs, CLI tools, or portals, DevEx ensures developers adopt what Platform Engineering provides.

Together, these roles form a layered model: Platform Engineering builds capabilities, DevEx makes them usable, and SRE ensures reliability. The overlap is real, but the responsibilities remain distinct.

6: Looking back on your years building and scaling platforms, what was one of the hardest lessons you learned at the enterprise level?

Sean Alvarez: When people hear “enterprise,” they often think “process”—multiple levels of management, sign-offs, and compliance. That usually leads to a push for standardization.

In many organizations, Platform Engineering is introduced as a way to rein in DevOps sprawl. Instead of every team building its own pipeline or observability, leadership wants a single, standardized platform for everyone.

But forcing a single solution across the enterprise is often a recipe for failure. It creates complexity, slows delivery, and leaves some teams feeling stifled. They start working around the platform to meet their needs, and friction grows.

A better approach is the 80/20 rule: serve the 80% of teams whose needs can be standardized, and let the remaining 20% adapt their processes where necessary. That reduces time-to-value, avoids endless edge-case debates, and ensures most teams actually benefit from the platform.

Ajay Chankramath: Sean made a great point about the 80/20 rule. You’ll never get 100% success, and you shouldn’t aim for it. The question is: how do you achieve that 80%?

In our book, we outline seven principles that guide successful platforms. A few highlights:

Measure what you improve. Quantify waste and friction so improvements are visible.
Treat platforms as products. This isn’t just automation—it’s technical product management. Capabilities must be managed like products with clear value propositions.
Balance build vs. buy. Engineers love to build, but with so many products available, organizations must consider total cost of ownership.
Design for composability. Platforms aren’t just Kubernetes. They consist of multiple components. Each must be composable, extensible, and replaceable.
Prioritize observability. Don’t limit it to applications—extend it across the SDLC.
Enable team autonomy. If teams wait on others for security or approvals, waste accumulates. Platforms must empower autonomy.
Articulate value. This is critical. Success depends on clearly communicating the value unlocked, not just building capabilities.

Without these principles—especially value articulation—platform engineering risks losing executive support and ultimately failing.

7: Do you think there is confusion or resistance when people start working on Platform Engineering or its principles? Are there gaps, and what has worked for you in bridging those gaps?

Sean Alvarez: One reason organizations adopt Platform Engineering is to standardize and make developers’ jobs easier. But in doing so, we often take control away from developers.

For example, if we say, “Every developer must use this deployment pipeline,” it might simplify things for the organization, but inevitably some teams will run into cases it doesn’t support. Maybe the pipeline was designed for a single service, but a team needs to deploy three. Now they’re frustrated, waiting on Platform Engineering to adapt the pipeline—or worse, being told to make their work fit into it.

That dynamic quickly creates confusion and resentment. Developers no longer want to use the platform, and adoption stalls.

The way to bridge this gap is to treat the platform as an internal product. The most effective way is to have a technical product owner—someone who understands product management practices but also has the technical depth to talk with developers.

This role continuously interviews developers, identifies gaps in their day-to-day work, and ensures the platform gives them flexibility—guardrails where needed, but also options to override defaults when necessary. By organizing and prioritizing a backlog around developer needs, a technical product owner ensures the platform provides real value, which drives adoption and organizational success.

Ajay Chankramath: Sean covered most of it, but I’ll add this: the technical product owner role is not about saying “yes” to everything developers ask for. Every developer comes with their own perspective, and building every request would be unsustainable.

The job is to balance ROI across the organization. Building platform capabilities costs time and money. The question is: what value will this unlock across the enterprise, not just for an individual team?

That’s where challenging requirements becomes important. By pushing back and prioritizing based on organizational value, the technical product owner ensures investments make sense.

This ties back to the seventh principle we mentioned earlier—articulating value. Without it, platforms risk losing executive sponsorship. We’ve seen it happen: executives reassign platform engineers back into product teams because they don’t see visible value.

A strong technical product owner prevents that by showing how the platform delivers ROI across the organization. That value articulation is often the difference between success and failure.

8: Your book, Platform Engineer’s Handbook, takes a “build first, clarify as you go” approach. Instead of starting with strategy decks and frameworks, you dive straight into building. Why do you think this approach works better, especially for technologists new to Platform Engineering? And what makes it more effective than leading with theory?

Sean Alvarez: Great question. If you think about who gets into Platform Engineering, it’s often people from two backgrounds. Some are software developers who understand APIs and architecture but aren’t used to handling infrastructure. Others come from operations or SRE roles, familiar with Terraform, ServiceNow, or Dynatrace, but less with software development practices like GitHub projects or release pipelines.

Platform Engineering covers the entire SDLC, which is a big scope. You can’t really “practice” it in a small sandbox—it requires working with real teams, real projects, and real deployments.

That’s why starting with building makes sense. It’s similar to “dogfooding”—using what you create. If you build even a small demo platform and deploy an application on it, you immediately see the friction developers face. That teaches you what features are needed. You then build those features, measure their impact, and learn from the experience.

The theory and strategy become clearer once you’ve lived through the practice. You’re not just reading slides—you’ve felt the difference. That makes the lessons stick and prepares you to scale up to enterprise scenarios.

Ajay Chankramath: I agree. Developers understand software best. If you show them real builds and workflows, you bring them along much faster than if you start with strategy.

When this idea was first proposed, Sean and I knew it would be a challenge because we were used to explaining the “why” first. But flipping the model is powerful. It engages developers with what they already enjoy—building—and then shows them why it matters.

That’s why we believe this approach will make the book stand out. It’s not theory-heavy; it’s practical, hands-on, and aligned with how developers actually learn.

9: Ajay, you’ve repeatedly emphasized the seven principles and the importance of measuring value. Do you think taking a “build first, clarify as you go” approach will help highlight that key aspect of value measurement?

Ajay Chankramath: That was the first question that came to my mind when we considered writing this book. We’ve always said you need to measure something before you can improve it. How would that work with a “build first” model?

The more I thought about it, the more I realized this approach makes value measurement easier. When you’re actually building, the value is visible right there—not in a spreadsheet or a theoretical model. Developers can see what they’ve created, how much effort it took, and what productivity gains it delivers.

Instead of abstract discussions, you have concrete examples: “I built this, and here’s the measurable improvement.” That makes the articulation of value much stronger.

This is also one of the unique aspects of the book. I haven’t seen another resource take this approach—making value measurement practical and hands-on. It’s a challenge, but we’re confident it will provide a powerful way to connect principles with real-world practice.

10: Looking ahead, as more organizations mature their platform teams, how do you see the SRE role evolving? Do you expect a dramatic shift in the next few years?

Ajay Chankramath: Absolutely—and it’s a great question. Let me break it down across a few dimensions:

Specialization: SREs focus on complex reliability challenges, while platform engineers focus on building capabilities that enable both developers and SREs. The partnership between these roles will strengthen. Instead of friction over “who owns what,” SREs and platform engineers will complement each other.
Abstraction. SREs will increasingly work at higher levels—service meshes and clusters—rather than building individual features. Their focus will stay on ensuring reliability under pressure, not on building platform products. That’s where collaboration with platform engineers becomes critical.
Domain-driven platform engineering: This means applying platform principles directly into products, not just infrastructure. For SREs, it requires more domain knowledge—something Google emphasized in its original SRE model but that has been diluted over time. I believe we’ll see a return to that principle.
AI: SREs are already using AI for anomaly detection, root cause analysis, and automated remediation. Platform engineers will need to provide capabilities that make this easier. AI won’t replace these roles, but it will reshape them, moving focus from rote tasks to domain-driven decision-making.
Risk and compliance: Especially in industries like finance and healthcare, SREs will need to take on more responsibility here, supported by platform capabilities. Compliance is not going away—it’s only becoming more central.

We’ll also see some fluidity between roles. Some SREs will transition into Platform Engineering if their interests and skills align more with building capabilities, and vice versa. This cross-pollination will strengthen both practices.

11: Sean, you mentioned that much of Platform Engineering is about automating deployments and similar processes. Do you think AI can really make a big difference here, or is it still hype and early experimentation?

Sean Alvarez: Ajay mentioned the growing importance of domain knowledge, and I think that’s the key. Every time there’s a wave of abstraction, people ask if their jobs will disappear.

When cloud and serverless databases arrived, DBAs wondered if they were obsolete. Now with AI, people are asking the same thing: will software developers vanish because AI can write code? Do we even need Platform Engineering if AI can generate infrastructure as code, analyze logs, and raise alerts automatically?

The reality is that AI can take over rote, repetitive tasks—like writing observability queries or scanning logs for anomalies. That frees SREs and platform engineers to focus on higher-value work: understanding what uptime means in a given domain, or which processes truly require five nines of reliability.

For example, in fintech, stock-trading throughput during the day has a very different priority than a nightly batch process. One demands higher uptime even if it costs millions more; the other can tolerate delays. SREs and platform engineers, with their domain understanding, are the ones who can make those calls and guide how AI should be applied.

So I see AI as an inflection point. It won’t replace these roles—it will elevate them. The day-to-day debugging and manual tasks will shrink, while the focus shifts to domain analysis and delivering business value.

12: For someone currently in an SRE or DevOps role who is now expected to build or contribute to an internal platform, what’s the one mindset shift or practical skill you’d suggest they prioritize? And what’s the first practical step they should take to set themselves up for success?

Sean Alvarez: The mindset shift is this: whenever you build something—whether it’s a script or an automation—ask yourself, “Will someone need to contact me to use this?” If the answer is yes, it’s not done yet.

The goal is self-sufficiency. If a developer needs help every time they use your tool, you’re stuck in an operations role—answering tickets all day—instead of moving on to the next feature. True platform engineering means building things that others can use independently. That mindset is critical.

Ajay Chankramath: Sean gave a great example. To tie it together: the biggest shift is adopting a product mindset. Every script or automation should be treated like a product with long-term viability.

The second piece—and sometimes more important than technical skills—is communication. Build your soft skills. Relationships, collaboration, and communication determine whether a platform succeeds. Developers historically avoided this, but today it’s essential.

AI or not, tools will come and go. The constant is people. If you can communicate, align stakeholders, and articulate value, you’ll set yourself—and your platform—up for success.

To explore the principles and practices discussed in this conversation in greater depth—including building developer-focused platforms from a blank slate, embedding observability and security, enabling self-service onboarding, and layering in AI-augmented services—keep an eye out for The Platform Engineer’s Handbook by Sean P Alvarez and Ajay Chankramath, coming August 2026.

The book provides a hands-on, progressive journey: from source control governance and Kubernetes runtimes to developer portals, reusable CI/CD workflows, infrastructure blueprints, and FinOps observability. Each chapter combines concepts with lab-based exercises and production-ready patterns, equipping engineers to build scalable, secure platforms that streamline software delivery.

Pragmatic Clean Architecture in Python: A Conversation with Sam Keen

Divya Anne Selvaraj — Thu, 18 Sep 2025 06:28:01 GMT

From structuring APIs and isolating domain logic to refactoring legacy systems, Python’s flexibility presents both opportunities and challenges for building sustainable software. In this conversation, we speak with —author of Clean Architecture with Python (Packt, 2025)—about applying architectural principles to real-world Python projects without sacrificing the language’s ethos.

Sam is a software engineering leader with over 25 years of experience, a polyglot developer who has used Python everywhere from early-stage startups to large-scale systems at AWS, Lululemon, and Nike. At Lululemon, he led the company’s first cloud-native development team, setting foundational standards for distributed architecture. Currently a Principal Engineer at Pluralsight, he focuses on leveraging generative AI for software engineering enablement—building tools that amplify developer productivity while preserving architectural integrity.

In this interview, we explore how clean architecture can be adapted to Python’s dynamic nature, where SOLID principles prove tricky, and how to keep frameworks like Django and FastAPI from leaking into core logic. We also discuss pragmatic strategies for enforcing the dependency rule, modeling entities and value objects with Python’s modern features, and managing testing and refactoring in complex systems. Looking ahead, Sam offers insights on how AI is reshaping development workflows and what it means for applying clean architecture across services and scaling applications.

You can watch the full conversation below—or read on for the complete transcript.

1: What motivated you to write Clean Architecture with Python, and why do you think Python as a language needs its own treatment of these ideas?

Sam Keen: I’ve been in development for quite some time—and in Python for a big portion of that. As for the topic, clean architecture has been around; Uncle Bob’s book has been out for quite some time. I’d always seen it discussed in the context of static languages like Java and C#. In a lot of Python communities, the thinking was that we didn’t quite need that—that it seemed overburdensome. So I wanted to take an approach and see: can we take from clean architecture the aspects that help us maintain larger Python codebases—without trying to turn it into Java? I knew there would be pushback, so I definitely wanted to keep it Pythonic.

As for why write a book—this is my first published book. During the COVID years, I had a lot of time and got into doing YouTube tutorials for game development. I really liked that content-creation process, and when the opportunity came up to write this book, I thought, yeah, that’s the next step. I’ve always wanted to write a book, so those motivations aligned with good timing.

2: In your book, how do you explain clean architecture in Pythonic terms—especially to developers who know the classic concepts but struggle to apply them cleanly in real-world Python projects?

Sam Keen: Kind of returning to the previous answer: clean architecture aligns well with the Pythonic ethos—one of Python’s core features is to be explicit rather than implicit. Clean architecture gives you that map for your application. If you’re doing object-oriented development, the first things you’re concerned with are the classes—how to design those classes. I’m sure we’ll talk about SOLID. You use those SOLID principles on the class itself to ensure it’s cohesive. Clean architecture takes that and expands on it—how do we apply some of these same principles to the entire application as a whole? The original “clean architecture” was explicit that it’s not a framework and not a rigid set of rules; it’s a set of principles to follow and adapt to your needs. There isn’t one playbook for clean—it depends on what you’re building. You don’t want to build a skyscraper if it’s a little cottage codebase.

3: SOLID principles are often cited as foundational to clean architecture. According to you, which of them are easiest and which of them are the hardest to apply in a dynamic language like Python?

Sam Keen: I think it’s kind of the usual suspects. With SOLID, the S—the single responsibility principle—a lot of folks comprehend pretty well. It’s the first letter, and people understand that a class should have a single concern, that sort of thing. That translates well into Python. The compiler isn’t going to help you much in that regard—it’s more of a human design decision.

It gets more into the nuance with the I—the interface segregation principle—having well-structured and focused interfaces. That’s where it gets tricky. For instance, you might have a vehicle class. You wouldn’t want to have the engine be part of that vehicle base class, because you may have gasoline cars but also electric cars. If you couple the concept of an engine directly into the vehicle base class, you’ll end up with a class that has a power level and a fuel level in liters. You’ll have parts of that interface that don’t make sense for all of the concrete classes that inherit from it.

Another one is the L—the Liskov substitution principle. This is about ensuring that if you have a base class and then classes implementing it, none of those subclasses disrupt the contract of the base class. Anywhere in your code where you’re referring to the base class, you should be able to insert one of the child classes and have it function fine. That’s something a compiler in a static language would help with. In Python, you don’t have that, so type hinting—and I’m sure we’ll talk more about this since it’s core to the book—paired with mypy gives you a little bit of that compiler-like type checking. That can be very helpful in Python.

And then of course, unit testing is always good. So the L and the I are the tougher ones, while S is the easiest.

4: How do you enforce the dependency rule in your projects so that business logic stays free from frameworks, ORMs, or infrastructure code?

Sam Keen: The dependency rule is really core to clean architecture. If you get that one thing right, you’re doing quite well. Conceptually, clean architecture is built in layers. You have the inner domain layer—core business objects with very few dependencies. Then you move out to the application layer, which orchestrates workflows and manages those entities—for example, “save task” or “complete task.”

Then there’s the interfaces layer. It should be a thin layer, often with controllers that translate between the outer layer and the inner business core—the application and domain layers. Finally, the outer layer is your frameworks and drivers. That’s where all the volatility is—things like SQLAlchemy and external dependencies that you don’t control.

Back to your question: how do you enforce the dependency rule? The principle is that dependencies all face inward. The domain layer shouldn’t be aware of anything above it. The application layer shouldn’t be aware of anything above it. They should only depend on what’s inside.

In Python, just like in other languages, you can check this. One way is pragmatic—make sure the team understands the principle and why it matters. Another way is structural—use a folder structure: a folder for domain objects, a folder for application objects, and so on. That way, each file is bound by the rules of its layer.

You can also get precise with automation. For example, in the book we give a simple fitness function test. It runs linting across import statements and checks them against the known directory structure of the layers. If it finds that a use case in the application layer is importing from, say, a driver, it fails the build. That shifts knowledge of violations left, so developers can correct them quickly.

So, it’s a hierarchy: first ensure the team understands the principle, then reinforce it with clear folder structure, and finally back it up with tests that assert violations.

5: What kind of project structure do you recommend? Do you prefer separation by layer—domain, interfaces, etc.—or by feature? And how do you keep the layout from becoming overly rigid?

Sam Keen: One example we mentioned earlier is having a folder per layer—that’s definitely a possibility. But to step back, the bigger principle is: always have the simplest solution that meets the needs of your project and your team. You don’t want to overbuild.

For example, you might have a very simple CRUD application, essentially an API fronting a database with create, update, delete functionality. There’s not much to it. In that case, you could build it in more of a feature structure—say, a task microservice—and just go with FastAPI in a one- or two-file implementation. That makes sense because there isn’t much domain logic in that particular service. I see a lot of single-file frameworks that work quite well for these small cases. That’s one end of the spectrum.

On the other end, as projects and codebases grow larger, with more complex business rules, you start to shift toward the domain structure we talked about earlier—a folder per layer. That helps put in a structure where the right thing to do is also the easiest. For example, if your team decides to support multiple users instead of just one, you now need a User domain object. With a folder-per-layer design, the map is already there: the business object goes in the domain folder, the orchestration goes in the application folder, and so on.

It’s not one-size-fits-all. It’s something you can evolve into. And when your team makes an intentional decision to bend a common practice—for pragmatic reasons—document it in an ADR, an architectural decision record. That way it’s explicit and intentional, not accidental complexity. It also helps developers who come later—or even yourself six months down the road—understand why that choice was made.

6: Domain-driven design is a big part of your approach. How do you model things like entities and value objects in idiomatic Python?

Sam Keen: Something very popular in modern Python is the use of dataclasses. They’re a great way to model domain objects because they eliminate boilerplate and make classes very easy to comprehend—you see just the attributes and functions.

For entities, which have an identity, I use a thin base entity class that every entity extends. A Task, a User, any entity extends this base entity class. That gives you a primary ID field—a reserved field for all entities. Anything in the system that knows it’s dealing with an entity knows that ID field is there and that it’s universally unique.

In Python specifically, in that entity class you’d also implement the __hash__ and __eq__ methods. That keeps you in line with the concept of an entity having identity. You can change all the attributes of that class, but it will always represent the same person or the same task—it’s just that its attributes have changed.

A value object is different. It doesn’t have an ID field; it’s defined solely by its properties. Again, a dataclass works well, but here you’d set frozen=True to make it immutable. For example, an exchange rate could be modeled this way. If any of its attributes change, it’s no longer the same object. By making it immutable, you guarantee that once an exchange rate object is created, its values won’t change, which fits the definition of a value object.

So, as a Python developer, the tools for following domain-driven design are built into the language. It aligns well with the principles of clean architecture.

7: Frameworks like FastAPI or Django can easily creep into core logic. How do you recommend engineers keep frameworks out of the domain and use case layers?

Sam Keen: Again, it builds on what we talked about—deciding on the directory structure that makes sense for your project. You want to make the right thing to do the easy thing. A clear structure gives you context: when you look at a class in the application layer, it should be obvious if it’s behaving abnormally by depending on a framework.

Beyond structure, it comes down to applying principles and patterns. Take databases, for example. SQLAlchemy is a framework. You shouldn’t see any reference to it in the domain or application layers. The anti-pattern would be a User class in the domain layer with SQLAlchemy methods directly on it to save to the database—that’s direct coupling.

Instead, you use the repository pattern. You define an interface with simple methods like save_user, get_user, delete_user. Your User object depends only on that contract. Then, in the frameworks layer, you implement that repository using SQLAlchemy or whatever tool you need.

That way, the domain isn’t directly coupled to the framework—both the domain object and the implementation simply agree to the interface. This is the general pattern across the board: keep frameworks out of your core logic by making them details implemented at the outer layers.

8: Do you use Pydantic or dataclasses in your domain models, or do you restrict them to boundaries? How do you handle input validation and transformation cleanly?

Sam Keen: That’s an interesting one. When I was writing the book, I actually drifted into being too strict with clean architecture—treating it almost like a framework. In some early drafts, I found myself duplicating property validation in two places: once in the interface layer and again in the domain layer, but using different mechanisms. Anytime you’re duplicating validation, that’s a red flag.

The framework I was using was Pydantic, which is very popular in Python. I use it extensively. It has strong validation and serialization methods. In practice, I made a calculated choice: in some applications, I would allow Pydantic into the domain layer. That’s because it’s mainstream, well supported, and it reduced a lot of boilerplate and duplicated code in that specific case.

That’s the bigger point—clean architecture is a set of principles, not a rigid framework. Sometimes you’ll make compromises to reduce complexity, and that’s OK. The important thing is to be transparent about it. Document the decision in an ADR—an architectural decision record—so it’s clear to the team and future developers that it was a conscious, intentional choice. That way, you’ve managed the trade-off explicitly rather than letting accidental complexity creep in.

9: What’s your testing strategy for a clean architecture codebase? You mentioned unit testing and things like that a bit earlier, but where do unit, integration, and end-to-end tests fit within this concept?

Sam Keen: A very common approach to testing is the test pyramid—Martin Fowler and others have popularized this. You want the base of that to be unit tests, which just test individual functions. They’re very quick—testing the behavior of a class on its own—so you want a large number of those. Above that are integration tests, where classes work together; in clean architecture, that often means classes working across layers. At the very top are end-to-end tests, where you actually boot up infrastructure and test as a real user against the application. Those are slow and often brittle because they test interfaces that change quite a bit, so maintaining many of them is a burden.

If you have a tightly coupled application, it’s hard to implement that pyramid—you can end up with an “ice cream cone,” because unit tests are hard to write and you brute-force a lot of end-to-end tests. Clean architecture enables you to have the true pyramid. The domain layer has no dependencies, so you can easily write unit tests without starting a database or worrying about network calls. Up into the application layer, because you’ve used dependency inversion and coded use cases against interfaces rather than concrete classes, a test can insert a mock that implements the interface. That keeps those tests very quick as well. You still need end-to-end tests for critical workflows as final validation, but the confidence comes from a plethora of fast unit tests and some integration tests. Clean really enables that true test pyramid.

10: What’s your approach to refactoring legacy Python code? How do you introduce clean architecture there without overhauling everything at once?

Sam Keen: We have a chapter devoted to this, and much of it is common practice regardless of language. What you don’t want to do is a Big Bang release—rewriting the entire stack and trying to release it all at once hardly ever works. It takes too long, requirements change, and the system you’re rebuilding keeps changing. You have to figure out how to slice up the problem. The “strangler fig” is a common metaphor: it’s a fig vine that grows up and engulfs a tree.

Clean helps because you’re going from a system without a map—something chaotic—to one that has a map and discrete components. That lets you take Service A and split it off into a clean architecture with full test coverage. Where to start: begin at the lower layers. Look at your legacy system and find all the parts that in aggregate build the concept of a user. Extract that domain logic and build your true domain User in clean architecture. Then, using gateways and feature flags, parallelize traffic to the old system and the new system, compare, and ensure parity in state changes for the user across both.

Beyond domain objects, look for natural bounded contexts—returning to domain-driven design—and extract them into domains with their use cases. Do everything you can to avoid a Big Bang release. Clean architecture gives you the guidance to restructure incrementally and release with confidence.

11: How is AI changing the way we approach clean architecture?

Sam Keen: AI is the definition of disruptive. This generative AI wave we’ve come in on is really interesting because, again, we started the book a little over a year ago and, in AI years, that’s forever ago. I think baby ChatGPT-3 was coming out or something, and LLMs were just starting to be able to build Snake—that was the extent of what they could build. And then you look at where we are now.

There are a couple of dimensions. If you’re thinking of AI as a feature you would add to an application—integrating an LLM into an application—nothing really changes. That’s a driver—a framework driver. So the knowledge of, say, LangChain or LlamaIndex—really common frameworks—that’s all going to stay in the outer layers. Same playbook, and then you integrate that down to your pure domain objects. So that part doesn’t change.

The other part is using AI to build with—coding tools and these sorts of things. It helps. We talked about writing tests—AIs are great at writing unit tests, so you can definitely leverage it there. That’s where you mitigate hallucination—you’re writing tests to validate, so you can know the AI is doing the right thing. Another example: we have that User that needs to be saved to a database, and it has an interface contract. You can use AI to build the concrete class against that interface—at least get a start on it. Some might think of that as boilerplate, but it can do that.

Overall, the advantage is this idea of “context engineering.” The AI’s ability to help is only as good as the context you give it. If you let it index legacy, tightly coupled systems, it’s not really sure what the plan was—there kind of wasn’t one—so the AI will continue to build against that codebase without much of a plan. Whereas, if you’re explicit about your approach and you have these four folders with easily defined rationale of what goes into each folder, all that context goes to the LLM—since it’s helping a human. Using clean architecture and having that playbook for how to build out your application helps AIs do the right thing and not go off the rails. It’s exciting.

12: How do you apply clean architecture across multiple services or modules? Should each follow the full layering pattern or do you recommend something else?

Sam Keen: We touched on this a little bit. It’s not one pattern to fit everything. With multiple services, it’s the same idea. The purpose—under the context that what you’re building is multiple services—matters. You may have portions that are really straightforward CRUD applications. You just know that you don’t want direct database access, so you put an API in front of that. Maybe, at this point in time, it’s kind of one-to-one mapping—you have it for defense if you need to change it in the future. In those cases, FastAPI and using Pydantic throughout might be the right mechanism.

But your larger, business-rule-centric, orchestrating parts of the application—those are where you invest in a fully layered approach. And even though you’ve built services, you treat them as framework drivers with respect to one another. You have Service A you’ve built; you have Service B you’ve built. Service B treats Service A as a detail and doesn’t let Service A’s implementation details get pulled into the deeper layers of Service B.

13: When building scalable applications, how do you decide what belongs in the core versus what stays at the edge—for example, pagination, caching, or authentication?

Sam Keen: Yeah, so that—again, there’s a little bit of thought process to that, of course. You kind of get into that domain-driven design mentality of what’s core to the domain. So a user—what’s core to a user—versus, like, transport protocols and these sorts of things. Those are just details—computer concepts.

A tricky kind of heuristic is: what makes sense even if you weren’t a computer program? To explain that—like, a user having a schedule is a concept that makes sense even before computers were invented. But the knowledge that we’re transferring using JSON or gRPC—that’s a computer concept; it’s a detail of the platform we’re implementing these user domain objects on. To be facetious, if we switch to quantum computers in ten years, we’ll bring our domain objects with us, but all those other details are going to change.

So that’s the macroscopic way to think about it—what are the nouns, the objects of our system—versus what’s a transport or technology detail that’s not core and should stay at the edge. There is some nuance to authentication. You may have a system that needs to be impenetrable, so every layer may need to validate the authentication—like a zero-trust approach. But you may not be in that case, so you may stop authentication at, say, the adapters layer. Then everything below either assumes authentication or doesn’t have knowledge of it. That’s an example where that computer concept could come all the way down to the domain out of need.

14: How do you manage inter-service communication in Python systems built with clean architecture?

Sam Keen: That’s—again, that’s a common practice regardless, but clean helps you with it. In event-driven architecture, messages are another way you can leak implementation details. Be very cognizant of the information you put in the message. It should be past tense—“this happened”—those sorts of practices.

Be cognizant that it’s a common way for implementation details to get transmitted across the wire, versus, you know, a leaky interface at the code layer. Even at the transport level, you can leak implementation details that you don’t want to, and that will cause coupling as well.

To learn Clean Architecture through a series of real-world, code-centric examples and exercises, optimize system componentization, and significantly reduce maintenance burden and overall complexity, check out Clean Architecture with Python by Sam Keen. The book helps you apply Clean Architecture concepts confidently to new Python projects and legacy code refactoring.

Inside Go Systems Programming: A Conversation with Mihalis Tsoukalos

Divya Anne Selvaraj — Wed, 20 Aug 2025 09:39:38 GMT

From goroutine scheduling quirks to profile-guided optimizations, Go has grown into a language that balances simplicity with systems-level power. In this conversation, we speak with Mihalis Tsoukalos—author of Mastering Go, Fourth Edition (Packt, 2025)—about what it takes to write high-performance, maintainable Go in today’s evolving ecosystem.

Mihalis is a Unix systems engineer and prolific technical author whose books Go Systems Programming and Mastering Go have become staples for developers working close to the metal with Go and Linux. He holds a BSc in Mathematics from the University of Patras and an MSc in IT from University College London, and his work has appeared in Linux Journal, USENIX ;login:, and C/C++ Users Journal. His expertise spans systems programming, time series data, and databases, but his reputation in the Go community comes from distilling that low-level experience into accessible, practical guidance.

In this interview, we explore what motivated the fourth edition of Mastering Go and the audiences it serves, the realities of structuring goroutines and channels correctly, and the concurrency patterns that actually hold up under production workloads. We also dive into Go’s runtime improvements, profiling and memory-management workflows, and the maturing role of generics in real-world projects. Beyond language features, Mihalis shares his perspective on observability, the expanding standard library, and how Go compares with Rust and Zig for systems programming. Looking ahead, he offers a candid view of where Go is headed—from concurrency safety to ecosystem maturity—without losing sight of its defining trait: clarity without unnecessary complexity.

You can watch the full conversation below—or read on for the complete transcript.

1: What motivated you to write the 4th edition of Mastering Go? Who should pick it up, and what kind of projects will the book help them with?

Mihalis Tsoukalos: First of all, I want to start with a disclaimer—nothing, no book or any other resource, can replace experience. You have to try things. That’s the general idea, and that’s why I wrote the book—to make you try.

The main reason for the 4th edition of Mastering Go was the continued growth and evolution of Go. Since the last edition, the language has seen major changes. The most important was the addition of generics in Go 1.18, a long-awaited feature that really shifted how developers think about type safety and code reuse. Alongside that, we’ve had improvements to modules, better WebAssembly support, and lots of enhancements across the standard library and toolchain, including faster testing for the testing process. So it made perfect sense to update the book to reflect where Go is today.

Another big motivator was feedback from the Go community. Mastering Go has always aimed to be a practical, hands-on guide, and readers kept asking for more real-world examples—especially about things like concurrency, networking, and systems-level programming. This edition builds on that by going deeper into those topics and refining the guidance on writing idiomatic, maintainable Go code. Again, I have to say it: nothing can replace experience. You have to try things all the time. That’s the point of learning something new.

The book is best suited for intermediate to advanced developers—people who already understand the basics of programming and want to work with Go and take their skills further. It is particularly useful for engineers working on backend systems, command-line infrastructure tools, high-performance network applications, or cloud-native services. It’s also a solid resource for developers coming from languages like C, C++, Java, or Python who want to build scalable, efficient systems.

For example, it can be applied in projects like migrating a payments platform from Node.js to Go to reduce latency in transaction processing and better handle traffic, writing a custom reverse proxy in Go to maximize performance and manage connection concurrency efficiently, or an observability team building high-throughput log collectors or trace aggregators that must process and forward millions of events per second.

So overall, this edition is not a minor update—it’s a reflection of how far Go has come as a language and how central it is becoming in areas like cloud computing, DevOps, and systems engineering. It is really for anyone serious about mastering Go and using it to build high-performance, real-world applications.

2: Today, Go’s concurrency model remains a major draw, with Go 1.22 fixing the long-standing loop variable capture issue. How do you now recommend structuring goroutines and channels to avoid common bugs?

Mihalis Tsoukalos: The concurrency model of Go has always been one of its strongest features because it’s simple, easy to understand, yet powerful. You can’t have everything, but what Go offers is pretty much what programmers want. Goroutines are lightweight, channels give you a clear way to coordinate between them, and it is generally easy to express concurrent logic in a readable way.

But you can always go wrong, especially when working on more complex or high-performance systems. One issue that has tripped up a lot of developers over the years was the loop variable capture problem. If you launched goroutines inside a loop, you could accidentally end up capturing the loop variable in a closure, which meant that all your goroutines might reference the same variable—not what you intended. Usually, you wanted each goroutine to take a different variable value. The typical workaround was to reassign the variable inside the loop, but it was error-prone. This was finally fixed in Go 1.22: now the language creates a new instance of the loop variable on each iteration, so closures and goroutines get the correct value automatically. It’s a small change in behavior, but it eliminates a very common class of bugs and makes concurrent code cleaner and more predictable.

That said, even with this fix in place, you still need to be cautious when writing concurrent code. A few best practices:

Always be explicit about goroutine ownership and lifecycle. The best way to do that is by using context.Context to manage cancellation and timeouts. This ensures goroutines don’t hang around longer than they should, avoiding memory leaks and unpredictable behavior.
Limit concurrency when needed. Just because goroutines are lightweight doesn’t mean you should spin up thousands of them without thinking. If you’re processing a large number of tasks or I/O operations, use worker pools, semaphores, or bounded channels to keep things under control.
Avoid unbuffered channels for high-volume communication. They’re great for synchronization, but if you’re passing a lot of data around, buffered channels reduce blocking and improve performance.
Always close channels properly. Only the sender should close the channel, and only once. Closing channels from multiple places or from the receiver side can cause panics or race conditions.
Use the select statement defensively, especially when working with multiple channels. A default case can help you avoid blocking in situations where responsiveness matters, like event loops or fault-tolerant systems.
Don’t force everything through channels. Although they look practical at first, sometimes mutexes or atomic operations are a better fit. Think carefully before you start writing code and designing your program.

So overall, Go 1.22 makes life easier for concurrent programming, but writing robust concurrent code still requires discipline, clear design, and a good understanding of how goroutines and channels behave under the hood. That’s what really helps you build systems that are both maintainable and production-ready. Again—think before you start writing code, and don’t just throw in goroutines because they’re lightweight.

3: When you think about concurrency at a systems level, which patterns do you find most effective for real-world workloads? Are there particular idioms you keep returning to, like worker pools or pipelines?

Mihalis Tsoukalos: At the systems level, concurrency is not just a feature—it’s a design principle. It influences how your software scales, how efficiently it uses resources, and how it behaves under pressure. Go gives you the primitives—goroutines and channels—but using them well requires a solid set of patterns you can rely on.

The first is worker pools. They are probably the most universally effective pattern. The Apache web server used to do this with threads. Instead of spawning a new goroutine for every task, you maintain a fixed set of workers that pull from a task queue. This gives you controlled concurrency—you’re not overloading the system with thousands of goroutines, and you stay within limits like memory, file descriptors, or database connections. This makes system behavior under load much more predictable because you know exactly what resources you’re using. For example, on a project I worked on, we used worker pools in a log processing service that handled thousands of files per hour without any issues.

The second pattern is pipelines. These are great when you want to break a task into stages and process each stage concurrently. Each stage runs in its own goroutine and passes data to the next using a channel chain. It’s a clean way to handle streaming data transformations or multi-step processing. It encourages modularity and makes it easier to deal with backpressure and separation of concerns.

Another critical piece is context.Context, which I consider non-negotiable in any serious concurrent Go application. It’s the standard way to manage timeouts, cancellations, and deadlines across goroutines. If you’re handling HTTP requests, running background jobs, or coordinating distributed tasks, context helps you shut things down cleanly and avoid goroutine leaks. This is especially important when interacting with external systems like databases or APIs, where you don’t want calls hanging indefinitely. For example, if you’re writing a TCP server and connections are not closed properly, you might run out of ports to serve new requests.

Another pattern I use is fan-out/fan-in. Fan-out means launching multiple goroutines to handle parts of a job in parallel, and fan-in means collecting the results into a single place. Combined with worker pools, this is a powerful way to parallelize work and aggregate results efficiently. I used fan-out/fan-in for a monitoring aggregator service with many microservices for health and metrics data—some over HTTP—and then collected the results into a single response.

I also rely heavily on select statements. Being able to multiplex across multiple channels or listen for a cancellation signal or timeout is incredibly powerful. It helps you write responsive systems that can recover from delays, retry on failure, or time out gracefully.

One principle I’ve learned over time: don’t reach for channels by default. A mutex or an atomic operation might be more appropriate, creating a simpler, cleaner, and less error-prone design.

Finally, goroutine supervision is critical. You need to track what your goroutines are doing, make sure they shut down cleanly, and prevent them from sitting idle in the background.

To sum up: the patterns I find most effective are worker pools, pipelines, the context package, fan-out/fan-in, and select statements. These help you create reliable, maintainable concurrent Go code.

4: Let’s talk about Go’s runtime performance, which has improved noticeably. Tail latencies are down and the garbage collector is smarter. What profiling techniques do you recommend to help teams actually realize these gains?

Mihalis Tsoukalos: That’s a good question, because sometimes you have issues and you don’t know what’s going on behind the scenes. Go has made real progress on runtime performance. The garbage collector in particular has seen big gains in terms of pause times and CPU usage. The garbage collector runs as a goroutine—everything in Go is a goroutine, including the garbage collector. The special thing about it is that sometimes, for the garbage collector to operate, everything else must freeze briefly, because you can’t create new variables while the collector is cleaning up.

Tail latencies have also come down, which makes Go a strong option for performance-sensitive systems like APIs, proxies, and backend infrastructure. But those gains don’t happen automatically—you need to profile and measure to benefit from them. Optimization without measurement is guesswork.

Go provides excellent tools for profiling. With the right approach, those tools can lead to real improvements. A practical point: don’t wait until you have a problem to learn about optimization and measurement. Experiment ahead of time so that when a problem arises, you’re ready to use the tools. Also, be very careful when using them on production systems—you might crash them. Don’t run measurements during peak hours unless it’s absolutely necessary.

The first tool I recommend is Pprof, which is built in and very powerful. The net/http/pprof package exposes several types of profiles—CPU, heap, goroutines, blocking operations, mutex contention—and you can access them through an HTTP endpoint in your web browser. Then you can visualize them using go tool pprof or one of the newer web interfaces.

I usually start with CPU profiles. Run them under realistic load and see where your code is spending time—it’s often not where you expect. Heap profiles are equally important, especially now that the garbage collector is more efficient. If you can cut down unnecessary allocations, the collector has less to clean, and your application runs more smoothly.

One mistake teams make is profiling only with benchmarks or local tests. You need to profile under real workloads in production or in a staging environment that closely mimics production. Many teams now include Pprof endpoints in production, behind secure admin-only routes, so they can safely collect data without affecting users.

For deeper insight, I recommend runtime tracing. The runtime/trace package provides a timeline of goroutine scheduling, system calls, garbage collection, and other events. Paired with Pprof, it helps explain why a goroutine was delayed or what caused a latency spike. You can collect traces with go test -trace or via code, and then explore them with go tool trace.

If you’re doing micro-optimizations, the Go benchmarking framework is excellent. Metrics like allocations per operation, bytes per operation, or nanoseconds per operation help you track how small changes affect performance, especially in tight loops or hot paths like serialization or hashing. Even one extra allocation can have a big impact under heavy load, so it’s worth running go test -bench regularly if you’re tuning critical functions.

It’s also important to watch for goroutine leaks or contention. Use goroutine and block profiles to track how many goroutines are running and whether they’re getting stuck. If the goroutine count keeps rising, that’s often a sign of a leak or unexpected blocking.

Beyond profiling, observability matters. The best-performing teams invest in continuous metrics and dashboards. Tools like Prometheus, combined with Go’s ability to export metrics, let you track garbage collection pause times, allocation rates, goroutine counts, and more. With alerting, you can catch issues before they impact users or your boss—which is never a good surprise.

A concrete case: I once worked on a high-throughput telemetry pipeline. The team was seeing unusually high CPU usage during peak hours, even though the runtime looked idle. The issue turned out to be repeated use of json.Marshal inside a loop, which was allocating and copying far more data than necessary. Replacing it with a streaming encoder solved the problem and made everything much faster.

So in short, Go’s runtime has improved, but to realize those gains you must measure continuously, profile under real workloads, and act on what you find.

5: Profile-guided optimization became stable in Go 1.21. Where does PGO make a real difference, and when might it not be worth the effort?

Mihalis Tsoukalos: The stabilization of profile-guided optimization (PGO) in Go 1.21 was a big milestone for performance-focused developers. Go has traditionally emphasized implicit, fast compiler optimizations, but PGO changes that. It gives us a new way to fine-tune performance based on how our code actually runs in production.

In simple terms, PGO lets the compiler make smarter decisions using real-world runtime data—things like which functions are called most often, which branches get taken, and where the hot paths are. With that information, the compiler can reorder functions to improve caching, inline code more intelligently, and reduce indirect calls. The result is lower CPU usage and better latency, especially in high-throughput or tight-loop scenarios.

So where does PGO shine? It’s great for performance-critical systems with stable workloads—things like low-latency services, backend infrastructure, proxies, or message brokers. In these environments, even small improvements in CPU can translate into real wins. It also makes a difference in hot-path code: tight loops that run millions of times, or CPU-bound routines like encoders, parsers, or math-heavy computations. PGO helps optimize layout and branching in those areas, reducing stalls and improving instruction-cache behavior.

If you’re running large-scale or long-lived services, even small gains add up—a 5% CPU saving across hundreds of instances is significant.

That said, PGO isn’t always worth the effort. For applications with unpredictable or highly variable workloads, the profile you generate today might not reflect tomorrow’s behavior. It’s also not ideal for short-lived command-line tools or scripts. And if your codebase is still changing rapidly, PGO is premature. Finish stabilizing your application first, then consider it.

In general, PGO is a powerful tool, but like any optimization technique, it’s most effective when used deliberately. If you’ve already profiled your application, you know where the bottlenecks are, and you want to squeeze out more performance without rewriting code, then PGO is a great next step. But it won’t solve every problem. My advice is to experiment with it on your own time so you’re ready to use it when it’s truly needed.

6: Memory is always a tricky area. What is your typical workflow for diagnosing memory leaks or reducing high allocation rates in Go systems? Do you have any favorite tools or patterns you like using?

Mihalis Tsoukalos: Although modern computers have plenty of memory, we still need to watch for leaks and excessive allocations. When I’m diagnosing memory issues in Go—whether a potential leak or just high allocation pressure—the first step is to establish a baseline. That means running the service under real or representative load and collecting memory data that reflects actual behavior, not just synthetic benchmarks.

From there, I rely heavily on Go’s built-in tooling, especially Pprof. I usually instrument the service with an HTTP endpoint using net/http/pprof, then capture heap profiles at different points—typically one right after startup and another after the service has been running under load. Comparing these snapshots helps answer key questions: Are allocations growing continuously? Which types are taking the most memory? Is the garbage collector doing more work than expected?

I load these profiles into go tool pprof or use the web interface, focusing on views like “in-use space” or “in-use objects.” If I see unexpected memory growth, I look for object types that shouldn’t be long-lived but are still hanging around. I also use the -alloc_space and -alloc_objects views to see where allocations are happening most frequently. That helps distinguish between a true leak and simply too many short-lived allocations.

A common pattern I follow is taking delta comparisons between snapshots. If memory usage looks flat but allocation counts are high, that’s usually a sign of churn, not a leak. Tools like go test -bench -benchmem are useful here—they show allocation behavior in tight loops or hot paths and help validate changes quickly.

When reducing allocations, I start with escape analysis. Running go build -gcflags=-m tells you which variables are escaping to the heap and why. Small changes—like passing a pointer instead of a value, or reusing a buffer—can keep data on the stack and reduce garbage collector pressure. If I see repeated allocations of slices, maps, or temporary structs in performance-sensitive areas, I consider sync.Pool, preallocating, or reusing buffers carefully. Even avoiding repeated string concatenations in loops or unnecessary interface conversions can make a noticeable difference.

For long-running services, I also recommend taking full memory dumps periodically and tracking object retention over time. That helps catch leaks caused by forgotten references. Continuous monitoring with Prometheus and visualization in Grafana is also valuable—it makes unexpected trends easy to spot.

Ultimately, avoiding memory leaks comes down to habits: profile early, understand your allocation patterns, avoid global state, and monitor in production. It’s not just about saving memory—it’s about running a system that behaves predictably under load and doesn’t wake you up in the middle of the night.

One memorable case involved a team whose Go service gradually climbed in memory usage over several days, even under steady load. Garbage collection seemed fine, but comparing heap profiles revealed that a map of cached Protobuf messages was never shrinking. The problem was a custom cache with no eviction policy—it just kept growing. To make matters worse, the keys were strings derived from user input, so the cardinality was unbounded. The fix was introducing a bounded LRU cache with periodic cleanup. The key insight came from seeing that the live object count of a specific type kept rising across heap snapshots. Without those profiles, it would have been much harder to pinpoint and fix.

7: Generics have been around for a couple of releases now. What patterns have you seen work well, and where do you think developers are overusing or misusing them?

Mihalis Tsoukalos: Now that generics have had time to mature over a few Go releases, we are starting to see clear patterns around where they shine and where they can go off the rails.

One of the most effective use cases has been writing reusable, type-safe data structures and algorithms. Things like generic slices, sets, maps, or utility functions—map, filter, reduce—have become much easier to implement in a way that’s both clean and performant. This has led to better library code, especially in packages dealing with collections, number crunching, or parsing. Libraries that used to rely on the empty interface and type assertions now benefit from compile-time safety with very little extra syntax. That’s a big improvement in terms of both correctness and readability.

Another area where generics work really well is domain-specific helper functions. For example, a pagination utility that works across different types of records, or a retry wrapper that can handle arbitrary operations. These kinds of generics eliminate boilerplate and keep APIs consistent without losing clarity. When used thoughtfully, they make code more declarative and reduce the need for duplicating logic across packages or modules.

That said, there have also been missteps. A common one is overgeneralization—creating overly abstract, flexible APIs just because the language allows it. Another is wrapping generic types in ways that obscure intent. Instead of simply using a slice of type T, some developers introduce unnecessary abstractions that add layers without real benefit, making the codebase harder to understand.

There’s also a tendency among some developers to import functional programming paradigms wholesale—monads, chaining combinators, deeply nested generic utilities. While elegant in languages designed for them, these patterns often clash with Go’s core philosophy of clarity, simplicity, and explicit flow of control. The result can be clever-looking code that’s hard to read and even harder to debug.

In short, generics are a powerful addition to Go, but like any powerful tool, they need to be used with purpose and restraint. Think before you reach for them, and prefer clear designs. The goal should always be code that is easy to understand and maintain.

8: Go 1.23 adds iterator functions and generic type aliases. How do you see those changing how we write Go, especially in libraries?

Mihalis Tsoukalos: The addition of iterator functions and generic type aliases in Go 1.23 might look like a quiet update, but it’s actually a significant step forward in writing more expressive, reusable, and composable code—particularly in libraries. These features build on the foundation of generics and help capture common programming patterns more naturally, while still keeping Go’s strengths of simplicity and clarity.

Take iterator functions. Go has always relied on for loops and range for iteration, and that worked well. But now, with iterator functions, we can encapsulate iteration logic as values—functions that yield elements one at a time. That might sound like a small shift, but it opens up powerful patterns like lazy evaluation, functional-style pipelines, and composable data flows. You’re no longer stuck rewriting the same loop boilerplate; you can abstract iteration into helpers that are both type-safe and ergonomic.

Then there are generic type aliases, which reduce friction when using generic types across packages. Before, if you wanted to tailor a generic type like map[K]V or Option[T] to your domain, you often had to rewrap or reimplement it. That made things verbose and diluted the usefulness of generic libraries. Now, with type aliases that support generics, you can define concise, strongly typed shortcuts for common patterns. This improves readability and makes code easier to work with, without introducing runtime overhead.

I think these features will lead to more expressive APIs and more composable, domain-agnostic utility packages. We’ll likely see libraries offering richer iterator utilities—things like filter, map, and reduce—implemented in a way that feels native to Go.

That said, the real challenge for library authors will be balance—using these tools to enhance code, not overcomplicate it. If done well, these features could significantly modernize the Go ecosystem, especially in areas like data processing and systems-level programming, where reusable containers, iterators, and higher-order utilities really shine.

9: Fuzzing is built into Go now. How have you seen teams make fuzz testing practical, and what are some tips to get value from fuzzing beyond just turning it on?

Mihalis Tsoukalos: Fuzz testing is a powerful technique for uncovering edge cases, subtle bugs, and even security issues—things that traditional unit tests often miss. Since Go 1.18 added fuzzing support directly into the go test tool, we’ve seen some teams begin experimenting with it. Again, it’s important to experiment first.

But as you said, just enabling fuzzing isn’t enough. To really benefit, teams need a focused, deliberate approach. The teams that get the most out of fuzz testing usually start by targeting critical code paths—places where the software processes complex or untrusted inputs. Think parsers, codecs, or deserialization logic. These are prime candidates because they’re hard to reason about and easy to break with unexpected input. And one important rule here: never trust user input. Writing fuzz tests for these areas helps surface bugs that could otherwise go unnoticed.

It’s also important to seed the fuzzer well, instead of letting it start with purely random inputs. Give it examples representative of real data—this helps the fuzzing engine explore the space more intelligently and find meaningful values faster.

Integration is key. The most effective teams make fuzz testing part of continuous integration. They run short fuzzing sessions locally during development for quick feedback, and then schedule longer runs overnight or during off-hours on CI servers. That way, fuzzing becomes a continuous part of testing, not just something you do once in a while.

Beyond just finding crashes, fuzz testing is excellent for hardening error handling. It ensures your code doesn’t panic, leak resources, or hang when it gets bad input. And when you combine fuzz tests with other tools like the race detector, you can catch data races that wouldn’t show up otherwise. That combination improves reliability across the board.

One more tip: keep your fuzz functions deterministic and free of side effects. Avoid calling external systems or relying on randomness inside the test itself. Deterministic behavior makes failures easier to reproduce and debug.

In short, fuzz testing is most valuable when used deliberately—targeting the right parts of your code, seeding it well, integrating it into workflows, and combining it with other tools. Done right, it’s not just about uncovering obscure crashes—it’s about building more robust, resilient Go systems.

10: Observability is another area you cover in your book. What do you recommend for monitoring and tracing Go systems effectively, especially under high concurrency?

Mihalis Tsoukalos: Observability is absolutely essential when running Go systems at scale, especially in high-concurrency environments. It gives you the visibility to understand how your application behaves in production, diagnose issues quickly, and keep performance and reliability where they need to be.

For monitoring, the first step is always metrics—both system-level and application-specific. We mentioned Prometheus earlier; it’s the go-to choice in the Go ecosystem, largely because of its flexibility and strong community support. The key is to instrument your code with meaningful metrics. Put simply: if you don’t collect the right metrics, you won’t solve your issues.

So collect things like request rates, error counts, latency percentiles, goroutine counts, and garbage collection pauses. These tell you how the system is behaving and where things might degrade under load. You also get a lot of value from the Go runtime metrics exposed through the runtime/metrics package. These provide insight into memory usage, garbage collection activity, and goroutine scheduling—crucial when dealing with thousands of concurrent operations.

Metrics give you an aggregated view, but tracing lets you zoom in. With distributed tracing—using something like OpenTelemetry—you can follow individual requests as they move through different parts of your system. That’s where you see latency accumulation, service interactions, or contention points. Under high concurrency, tracing is especially useful for catching queuing delays, lock contention, or slow dependencies—issues that metrics alone might mask.

One of the most important practices here is context propagation. We’ve already discussed the context.Context type. This is your mechanism for passing timeouts, cancellations, and tracing data across API boundaries and goroutines. If you don’t propagate context properly, you’ll miss spans or lose correlation in your traces. End-to-end consistency in instrumentation is critical, especially for workloads where a request might fan out into multiple goroutines.

Of course, high concurrency also means generating a lot of telemetry data, so you need to be smart about sampling and rate limiting. Adaptive sampling works well—prioritizing traces based on latency, errors, or unusual behavior. This way, you capture the most informative data without overwhelming your observability systems or introducing overhead.

And observability isn’t just about collecting data—it’s about acting on it. Instead of relying only on fixed thresholds, use anomaly detection and pattern-based alerts. Dashboards that track Go-specific behaviors—like spikes in goroutines or increased garbage collector pauses—make it easier to spot problems early and understand what’s really happening.

In short, effective observability in high-concurrency Go systems means combining detailed metrics, distributed tracing with proper context propagation, smart sampling, and ongoing analysis. With those in place, you’re in a much better position to detect issues early, debug complex behavior, and keep systems running smoothly at scale.

11: The standard library keeps expanding with utility packages and smarter routing in net/http. Do these reduce the need for external frameworks? What do you feel is still missing?

Mihalis Tsoukalos: The standard library of Go has always been one of its strongest features—clean, composable, well-tested, and rich. Over time, the maintainers have added to it in a very deliberate way. Things like smarter routing in net/http and new utility packages like slices, maps, and cmp have made it easier to build web services, command-line tools, and system-level software directly on top of the standard library.

Yes, these improvements are definitely reducing the need for external frameworks, especially for small to mid-sized applications. One of the best examples is net/http, which has steadily improved: better routing logic, smoother integration with middleware patterns, improved support for HTTP/2 and structured headers, and overall better ergonomics. For teams that prioritize simplicity, performance, and long-term maintainability, that’s a big win.

The new utility packages also help. Tasks like filtering, slicing, comparing maps, or writing type-safe logic can now be done concisely and idiomatically, reducing boilerplate and external dependencies.

That said, the standard library doesn’t replace third-party libraries entirely—especially when working on more complex systems or domain-specific problems. For example, I’ve written HTTP services in Go using Gorilla rather than plain net/http, and for building command-line tools I’ve used Cobra and Viper. The famous Docker tool has been written in Go using Cobra, and the Hugo static site generator also relies on Cobra and Viper. These are powerful tools for real-world utilities.

So, while the standard library is strong and keeps evolving, there are still gaps—particularly in areas like higher-level CLI frameworks or more sophisticated HTTP tooling. I expect the standard library will continue to improve, but tools like Cobra and Viper still fill important roles.

12: Even experienced Go developers make mistakes. What are some of the less obvious ones you still see when people work on performance-sensitive or concurrent systems?

Mihalis Tsoukalos: Everyone makes mistakes—that’s how we learn. The important part is being careful not to make them in production systems.

Even experienced Go developers can run into subtle issues when working on performance-sensitive or concurrent systems. A lot of this comes from Go’s simplicity. Goroutines are lightweight, channels are first-class, and the standard library gives you powerful tools. But that simplicity can hide complexity, and mistakes often come from relying too much on defaults or making assumptions about how the runtime behaves.

One common pitfall is spinning up goroutines without proper cancellation or lifecycle management. We’ve discussed before that using context.Context gives you control—allowing you to cancel goroutines properly and avoid memory leaks.

Another mistake is assuming channels are always the right concurrency primitive. When I first learned about channels, I thought they could solve every concurrency problem. But that’s not true. In some cases, a mutex or an atomic variable is more efficient and easier to work with. Think carefully before using channels.

Memory allocation is another big one. Developers often overlook how temporary allocations—like slices created in a tight loop, or boxing values into interfaces—can lead to heavy garbage collection overhead, which gets worse under high concurrency. Tools like Pprof or go test -bench -benchmem help you spot these patterns, but ideally, you should design with memory efficiency in mind from the start.

Another mistake is making false assumptions about how the scheduler works. Developers sometimes expect goroutines to be preempted fairly, but in CPU-bound loops without I/O or channel operations, goroutines might not yield control. This can lead to starvation or uneven workload distribution. Newer versions of Go have improved scheduling and preemption, but in rare cases you still need to explicitly yield with runtime.Gosched to let other goroutines run.

So overall, the issues I see are not usually about syntax—they’re about architecture. They come from assumptions about how Go handles concurrency and performance under the hood. The way to avoid them is by profiling continuously, testing under realistic loads, and building a solid mental model of how the Go runtime behaves at scale. In other words, learn the internals—don’t just assume.

13: You’ve worked very close to the metal for many years now. How would you compare Go and Rust for systems programming, especially in terms of performance, safety, and maintainability?

Mihalis Tsoukalos: This is a question I get often. Go and Rust take very different approaches to systems programming, and choosing between them depends on the specific priorities of your project.

Rust gives you fine-grained control over memory and concurrency with zero-cost abstractions that can deliver exceptional performance. Its ownership model and borrow checker eliminate entire classes of bugs at compile time—things like data races or use-after-free errors. That makes Rust a great choice for low-level systems where correctness and reliability are absolutely critical—think operating system components, device drivers, or performance-sensitive networking.

But that level of control comes with a steep learning curve. Rust’s mental model—ownership, lifetimes, trait bounds—can slow teams down, especially if they’re new to the language. Refactoring or prototyping requires great care to satisfy the compiler. Rust’s tooling (Cargo, Clippy, Rust Analyzer) is excellent, but the language demands precision. That pays off in safety and performance, but it can be a barrier in fast-moving or exploratory environments.

Go, by contrast, is all about simplicity and development speed. Its concurrency model with goroutines and channels is approachable and powerful. The garbage collector handles memory management, so you don’t need to think about it most of the time. Go may not match Rust in raw performance for compute-heavy workloads, but its performance is consistent and more than good enough for most system-level use cases.

That predictability, combined with readability and minimalism, makes Go practical for building high-throughput services, container tools, infrastructure automation, and other backend-heavy systems. It’s also easier to onboard new developers, and because Go code tends to look the same across teams, long-term maintainability is a real strength.

On safety, Go doesn’t give you compile-time guarantees like Rust. It won’t catch data races before you write code, but it does offer good tools: the race detector, a solid testing framework, and a culture that values clarity and explicitness. Go avoids complexity by design—no macro-heavy DSLs, no surprising inference—so the code stays understandable even as systems grow.

To sum up: Rust is the right tool when performance and safety are top priorities and you’re ready to invest in upfront complexity. Go shines when development speed, operational simplicity, and long-term maintainability matter more.

As an example, we once evaluated Rust for a packet inspection engine but chose Go due to faster development time and easier team onboarding.

In the past few months, I’ve also had the chance to explore Zig. It sits closer to Rust in terms of low-level control, but it’s much easier to learn. Zig has no garbage collector—you manage memory manually—but it’s far simpler than Rust. It may be a sweet spot between Go and Rust when you want to go lower without Rust’s complexity.

14: Looking ahead, what are you most excited about in Go’s evolution over the next few releases? Where do you see the ecosystem heading?

Mihalis Tsoukalos: What excites me most is how Go continues to evolve while staying true to its roots—pragmatic, simple, but increasingly powerful.

One area I’m watching closely is the ongoing evolution of generics. Since type parameters were introduced in Go 1.18, each release has built on that foundation, most recently with features like generic type aliases and iterator functions in Go 1.23. These aren’t flashy changes, but they’re meaningful. They enable more expressive and reusable code across the ecosystem—richer data structures, functional-style APIs, and cleaner abstractions in libraries. I look forward to seeing how the standard library and open-source projects embrace these tools to offer more composable, idiomatic patterns without losing Go’s clarity.

Performance tuning and runtime observability are also maturing quickly. Built-in fuzzing, profile-guided optimizations, and expanded runtime metrics are pushing Go beyond being just easy to use—it’s becoming easy to optimize too. For teams building high-performance systems, this is a big deal. I think profiling and performance tuning will become a routine part of development workflows, just like writing tests.

Concurrency is another area evolving. Go has always had a clean concurrency model, but with Go used increasingly in multicore, high-load environments—APIs, networking layers, real-time systems—there’s more attention on scheduler improvements, memory footprint reduction, and smarter resource usage. The recent fix to the goroutine loop variable capture bug is a good example: a small change, but it eliminates a long-standing issue and makes concurrent programming safer without adding complexity.

Beyond the language, the ecosystem is maturing fast. We’re seeing better libraries, stronger tooling for testing, static analysis, and cross-compilation, and an overall improved developer experience. Projects like TinyGo, Go Cloud, and Go’s growing presence in WebAssembly and embedded environments point to a future where Go isn’t just a server-side language—it’s part of a broader portable systems toolkit.

At the same time, community efforts around formal APIs, versioning best practices, and module proxy infrastructure show that Go is becoming more production-hardened and resilient.

So in short, I’m excited that Go is getting more powerful without becoming more complicated. That’s rare in programming languages. Go is investing in performance, safety, and tooling in a way that feels very Go-like: minimal, orthogonal, and deliberate. The future looks bright because Go isn’t chasing trends—it’s solving real problems with clarity and focus. I think we’ll see it used in even more places—cloud, systems, edge, maybe even mobile—while continuing to be a language teams can rely on for the long haul.

To explore the ideas discussed in this conversation—including concurrency design patterns, profiling techniques, and Go’s evolving support for generics and fuzzing—check out Mastering Go, Fourth Edition by Mihalis Tsoukalos, available from Packt. This 740-page comprehensive guide dives deep into advanced Go concepts such as RESTful servers, memory management, the garbage collector, TCP/IP, and observability.

Fully updated with coverage of Go generics, fuzz testing, Docker integration, and performance optimization, the book combines detailed explanations with real-world exercises. Readers build high-performance servers, develop robust command-line utilities, work with JSON and databases, and refine their understanding of Go’s internals. Each chapter is designed to strengthen both conceptual mastery and hands-on practice, from error handling and data types to concurrency, profiling, and advanced testing.

Whether you’re building network systems, optimizing cloud-native applications, or simply aiming to deepen your Go expertise, Mastering Go provides a practical foundation for writing professional, production-grade software.

Here is what some readers have said:

Designing for Decades: A Conversation with Alexander Kushnir on Longevity, Maintainability, and Embedded Systems at Scale

Divya Anne Selvaraj — Tue, 12 Aug 2025 11:22:40 GMT

In safety-critical domains, code longevity isn’t a nice-to-have—it’s a baseline constraint. Software must coexist with hardware for ten years or more, while withstanding evolving standards, team turnover, and limited upgrade paths. In this Deep Engineering Q&A, we ask industry veteran Alexander Kushnir about the realities of building and maintaining embedded systems that endure. We explore long-term technical debt, the discipline of software rejuvenation, and why modern C++ idioms are reshaping how engineers think about embedded maintainability.

Alexander Kushnir is a principal software engineer at Johnson & Johnson MedTech, specializing in electrophysiology systems. With about 20 years of experience across medical devices, industrial controllers, and networked embedded platforms, he has worked on everything from motion control firmware and network switches to VoIP and medical devices software . His core expertise lies in embedded Linux, modern C++, cross-platform development, and HW/SW integration. He has also built and lead a 2-day workshop related to CMake.

1: How do you approach the challenge of managing architectural technical debt in systems with 10+ year hardware lifecycles, especially in regulated environments where major refactoring or redesign is costly and risky?

Alexander Kushnir:

Technical debt is actually a real problem. However, we can follow several strategies to mitigate the issue:

Build modular software: This strategy pays off again and again. It helps us to isolate a specific functionality, which makes the task of “replacing the wheel in a moving car” easier.
“Divide and conquer”: Separate your application logic from the hardware-dependent logic. You will benefit from that by being able to run the logic not dependent on the hardware (for instance in a simulator or using software mocks that simulate hardware behavior).
Test, test, test: If you follow the previous advice, you should be able to test the logic on your development PC, not just on your target. Why is that good? You can write and run your unit tests with much shorter cycles (think - compiling, loading, debugging…all this on your PC instead of the device).
Use industry-standard and up-to-date tools: Even though it is not a hard requirement, tools keep evolving, and if you fall too far behind, then when you eventually need to investigate an issue in the field, you may find yourself forced to use newer tools you’ve never worked with—leaving you at a disadvantage.

Subscribe now

2: What strategies do you use to mitigate hardware obsolescence in long-lived systems?

Alexander Kushnir:

Of course. It is not exactly my responsibility, but I am in the loop. When designing a hardware platform, the engineer must ensure that the components he chooses have a “long-term support”. Having said that, I prefer to use off-the-shelf System-on-Module (SOM) integrated on a custom board, rather than developing a board with the same CPU (or FPGA) and having to address most basic interfaces such as memory or a flash storage during the board bring-up. This reduces the complexity of board bring-up and makes it easier to handle hardware obsolescence, because the SOM vendor typically manages low-level design, interface validation, and long-term component sourcing.

3: How do you reconcile the need for regular updates (e.g. for security patches or feature improvements) with the need to minimize disruption and regulatory overhead?

Alexander Kushnir:

Every change needs to be justified.

One of the projects I am most proud of was adding a firmware update capability to a device my team was developing.

However, the regulatory burden remains — any update that could affect safety or compliance still requires formal review and, if necessary, re-certification. In practice, we minimize disruption by:

Separating safety-critical functions into a stable, validated firmware baseline that is rarely touched.
Isolating updatable modules (non-critical logic, UI features, analytics, etc.) so they can evolve without impacting certified components.
Using risk-based change management to decide when an update is worth the cost of triggering the regulatory process — for example, prioritizing security patches and critical bug fixes, while bundling minor enhancements into larger, less frequent releases.

In this way, the need to keep embedded software up to date becomes operationally similar to maintaining conventional PC or cloud-based software, but with the extra discipline required for regulated environments.

4: What architectural patterns help maintain software flexibility in these conditions? For instance, have you used hardware abstraction layers, multi-process architectures, or IPC frameworks to decouple software from specific hardware so you can update or add features without a full redesign? How effective have these methods been in extending the usable life of older platforms in your experience?

Alexander Kushnir:

Abstract all you can. Whether one is taking the OOP approach (C++, my love), or a procedural one, abstraction and modularity must be applied. Hardware Abstraction Layer (HAL) is an excellent example of abstraction, as the application logic is not aware of the hardware (for example Linux paradigm took abstraction to the edge - everything is a file, whether it is a network connection, hardware device, or a real file - the user reads from and writes to a file).

Multi-process architecture makes sense when the software has many functionalities, and if one of the functionalities has malfunctioned, it won’t affect other ones. For instance, once I worked on an infrastructure that included a terminal (CLI), database engine, and several more features. So, if the DB engine crashed, the terminal would continue running unaffected thanks to the isolation between processes.

Another tricky multi-process architecture usage is when a programmer needs to utilize a GPL-licensed library in a proprietary environment and is not interested in exposing the code. In such a case they can create a process that links with the GPL-licensed library, and communicates with the main software using a well-defined interface such as pipe, socket or shared memory.

I will repeat myself - abstract all you can. However, you must pay attention to the cost of these abstractions. For example, if you use runtime polymorphism, you’ll need to profile your virtual dispatches to verify that they create no bottleneck in your critical path.

5: How do you decide what to keep backward compatible versus when to break from legacy constraints? Are there lessons from enduring platforms (for example, the VMEbus standard stayed relevant for 40+ years by emphasizing modularity and backward compatibility) that you apply to provide a clear migration path for long-term customers?

Alexander Kushnir:

Well, that’s a tough question. If the device interfaces with the outer world, changing that interface will always be the last priority. However, if changes are inevitable, they can be mitigated. For example, if you think ahead when designing the protocol, you can add versioning so that new features or changes do not affect older generations of devices. In some cases, you can run multiple versions in parallel or provide adapters to bridge old and new systems, giving customers a clear migration path. This approach is similar to what made platforms like VMEbus last for decades—keep the external contracts stable, design for modularity, and plan for evolution without forcing everyone to upgrade at once.

6: In a system meant to last a decade or more, how do you design for maintainability to slow down software aging? Can you share practices you use to avoid “bit rot” that ensure the codebase remains clean and adaptable to new requirements over time?

Alexander Kushnir:

All principles mentioned in my answer to the first question apply here. You can’t avoid software aging, as the ecosystem moves quickly. However, if your system is modular enough, the changes can be rolled out gradually, for instance, refactoring module by module, after testing each one thoroughly.

Additionally, CI tests are a must. I would even say that every pull request should be gated, i.e. only if the pull request passes all the tests, should it be merged. Many developers don’t like writing tests, but as a matter of fact, the tests protect them, and provide developers the confidence to make major changes without breaking things.

7: Have you observed issues like memory leaks, data corruption, or performance degradation creeping in over long uptimes in embedded systems? If so, what proactive fault-tolerance techniques do you recommend to address this?

Alexander Kushnir:

I don’t believe in regular restarts or “scheduled maintenance” where the only action is a reboot. If there’s a problem like a memory leak, it should be fixed—not hidden—especially on a resource-tight device.

Memory leaks are possible, of course, but they can be avoided. In modern C++, for example, using smart pointers eliminates most manual memory management errors. During development, I also recommend dynamic memory analysis tools such as Valgrind, which is still underrated in pre-release testing. Combined with thorough code reviews and targeted stress tests, these measures catch leaks and other resource issues before deployment, reducing the need for reactive “rejuvenation” in the field.

8: What fault-tolerance strategies do you build in to ensure long-term reliability? Can you share how you determine the right level of redundancy or self-diagnostic capability for a design that needs to last a decade?

Alexander Kushnir:

All the systems I’ve built have interacted with a human at some point—whether an operator, a technician, or an end user. In such cases, the most practical solution is a periodic health check, or Built-In Test (BIT), that monitors critical components and manages system state when a fault is detected. Typically, this means indicating the issue to the user—via an LED, buzzer, or display—so corrective action can be taken.

The specifics depend on the criticality of the system. For non-safety-critical designs, the goal is early detection and clear reporting so the failure can be fixed before it escalates. For higher-reliability requirements, BIT can be combined with fault isolation, allowing unaffected subsystems to keep running, or with limited redundancy (e.g., a backup sensor or communication path) to maintain partial functionality. The “right” level of redundancy or self-diagnostics is always a trade-off between cost, power, size, and the consequences of downtime—but even in minimal designs, proactive monitoring and clear fault signaling are essential for long-term reliability.

9: How do you ensure that devices you design today can be kept secure 10+ years down the line?

Alexander Kushnir:

Like I’ve mentioned before, one of the features I’m most proud of is the firmware update capability we built into one of the devices I worked on. I think this is a crucial capability—not just for delivering new functionality, but also for applying OS and security patches over the device’s entire lifetime.

To keep a system secure for 10+ years, the update mechanism itself must be secure: signed and verified updates, encrypted transport, and a rollback option in case an update fails. In regulated environments, it also needs to integrate with compliance workflows so updates can be deployed without breaking certification. In some cases, it’s wise to design for network segmentation or controlled update channels, so that only trusted endpoints can initiate the process. Without this foundation from day one, long-term patching becomes either risky or impossible.

10: Are there insights or practices—whether from automotive, avionics, or industrial IoT—that you find relevant or transferable to your work? Are there philosophies or practices from other domains that you think MedTech could borrow—or should avoid?

Alexander Kushnir:

I think the processes in MedTech are good, but slow. Code review, documentation, testing—these all have a clear purpose, and they exist for good reasons. But no process has to be sanctified. Code review isn’t done just because “that’s the rule”; it’s done to catch defects and improve design. The same goes for documentation and tests—they’re tools, not rituals.

That’s something I see in other industries as well. Automotive has learned to speed up iterations without skipping the essentials, especially with OTA updates. Avionics shows how you can lock down safety-critical code while still evolving peripheral systems. From these, I think MedTech can borrow the idea of tailoring process intensity to the context—keeping rigorous control where safety demands it, but streamlining where it doesn’t. The key is to always ask: how crucial is this step at the stage we’re in right now?