Architecting AI Software Systems for the Real World: A Conversation with Imran Ahmad
Designing scalable, sustainable AI—from lab prototypes to production systems
AI systems are now everywhere in software, but turning a promising model into a reliable, cost-effective, and sustainable product is still hard work. Teams are discovering that “just add a model” is not enough; you need end-to-end architecture that can take an idea from a lab-style proof of concept to a production system that meets real constraints around cost, latency, security, and operations.
Imran Ahmad is a data scientist, educator, and author focused on algorithms, AI, and cloud computing. He leads machine learning projects for the Canadian government, teaches at Carleton University, and is an authorized instructor for AWS and Google Cloud. With Packt, he has authored 40 Algorithms Every Programmer Should Know (2020) and 50 Algorithms Every Programmer Should Know (2023), and Architecting AI Software Systems (2025, with co-author Richard D Avila) and the upcoming 30 Agents Every AI Engineer Should Know. Outside of work, he enjoys photography, biking, and mentoring developers through his Discord community and workshops.
In this conversation, we dig into how Imran thinks about AI architecture in practice: from the fundamentals of good software architecture and elastic cloud patterns to the “five pillars” he uses to evaluate AI systems—security, reliability, performance, efficiency and cost optimization, sustainability, and operational excellence. We discuss separating data and compute for sustainability, designing differently for heavy training workloads versus real-time inference, and avoiding hard coupling to any single AI or cloud vendor. Imran also shares his perspective on agentic AI and agentic RAG, what changes as AI becomes a core concern for software architects, and why UX, cross-functional collaboration, and long-term operational thinking are now central to successful AI systems.
About the Book
1: What inspired you to write Architecting AI Software Systems now, and what gap in the industry or knowledge are you hoping to fill with this book?
Imran Ahmad: When you design AI systems, or when you have a Gen AI solution, you have to have an end-to-end solution, so you have to look at that from its totality. What happens is that whenever there is a new technology, whenever there is a new idea—a technical idea—we start by focusing on depth. We develop solutions, we experiment with them, we iterate through different versions of them until we are ready to use them for solving large-scale problems—so, you know, beyond solving those cats-and-dogs pictures, differentiating between cats and dogs.
Now AI has come a long way. When we have gone through our development processes, they have matured. Then we need to deploy these solutions. These AI solutions are only useful when they can solve a problem in production, and when you bring these ideas to production then, from start to end, they need to work properly, and around that you have to design architecture. I will talk about this as well—how you quantify a good AI architecture. There are five pillars, as we call them; I will talk about that later: security, reliability, performance, efficiency, cost optimization, sustainability, and something that is perhaps the most important, that is operational excellence. I will talk about that later, but the need for this is that we are bringing these ideas to production. That can only be done if we are designing, if we are giving it proper thought. We are bringing all the best design patterns to the plate, and this is something that motivated me to write this book.
2: Your book takes a very practical, architecture-centric approach to AI, doesn’t it? It mentions a structured journey with real-world examples, hands-on exercises, and even a fictional AI system’s architecture as a learning tool. So can you give us a quick overview of the key themes or unique features of this book?
Imran Ahmad: So the way we have designed this book is that we have essentially divided it into two parts, and the first part is about the fundamentals—about the fundamentals of architecture. If you look at the first part, it zooms out and talks, in general and not very specifically about AI, about what the principles for good architecture are.
Of course we are talking about AI, but we start with the fundamentals of AI systems, and this is where we define terms. We define microservice architecture, and we discuss a couple of actual use cases. Then we also define terminology like data lake, what a data warehouse is, and why AI is so important in the context of designing AI systems.
When we bring these systems to cloud computing, then you have elastic architectures, and as soon as you move to elastic architectures, your system can become cost-effective and performant at the same time. If you think about this, usually the performant systems are not cost-effective. It is like you buy a Ferrari. Ferrari is performant; it is one of the expensive cars, one of the fastest cars—but Ferrari is expensive. So we are trying to buy a Ferrari at the cost of a Toyota Corolla. We want systems that are cost-effective and performant at the same time, and elastic architectures get you there, where your systems can expand and shrink based on the immediate needs. This is where we talk about why cloud computing is so important. This is chapter one.
Then we talk about a case of architecture. You know my co-author, Richard, he is a great architect. This is where he brings his 30 years of experience, and he talks about the role of the architect, the vision, the style, and, most importantly, he talks about what the implications are if the architecture is not properly designed.
Then chapter 3 is about bringing software engineering into the picture, and we see: OK, what are software-engineering-specific topics related to architecture? So this is part one—part one is done. Part 2 is about AI systems. It is specific to AI: what are the architecture templates, what are the architecture philosophies that are relevant to AI systems?
Here we talk about something that is called “concept of operations,” and we talk about how the concept of operations is relevant to AI systems. Then we talk about, as you mentioned, certain use cases—large-scale use cases. We go over a complete use case so that, if you want to design a RAG system, you know how you can use the ideas that we have developed in this book and how we can apply that to an actual use case.
3: What key skills can readers expect to gain from the case studies and exercises, and also perhaps changes in mindset?
Imran Ahmad: So the first skill is how to convert an idea into a design. This is skill number one. Requirements capture the idea in a formal way, but requirements are usually written by a non-technical person. The first part of the solution is to convert those requirements into a technical design—that is the architecture. Skill number one is what they will learn.
Skill number two they will learn is that there may be surprises. The design that you came up with may not be the perfect one. You have to iterate through that. You come up with something and it may or may not work. Now you need to find ways to differentiate between a good design and a bad design. Functionally, they will work; the functional requirements are met. It is the non-functional requirements that differentiate between a top-notch design and a design that is not that great. This is about looking at how cost-effective your architecture is, how performant it is, how secure it is, how reliable it is, whether you are using the principles of sustainability, and whether you are bringing operational excellence.
Operational excellence is a concept that we have discussed in this book. Operational excellence is that, when you are designing these architectures, you are looking long term. You are using YAML files. You are using orchestrators. You are using JSON files. You are using parameterization whenever possible so that you can reuse them and you can maintain them. And you are again looking at the long term.
So this is where the second skill they will learn is that your first attempt at the design may not be the best one. You have to rethink; you have to evolve. You have to quantify how good your design is, and then, before you actually start implementing that, it is a good idea that you start with a pilot project. A pilot project is something that is usually one-tenth of the scale of the original project, but it actually covers all the critical points, all the critical design parts—they are there. It validates that, and then you basically go towards the full-blown solution.
Key Challenges in AI Architecture
4: In the current context, many teams struggle to turn AI prototypes into reliable products. From your experience, what are the main challenges in bridging the gap between a promising AI demo and a production-ready system?
Imran Ahmad: Yeah. So you can look at it from three aspects. Let’s look into that. It is a very important topic as well when you move towards an AI solution. Initially, you feel that it is a silver bullet; it can solve any problem, so that is when you are experimenting with it. But the success of an AI project depends on three factors. The first is cost, the second is performance, and the third one is accuracy.
Let me explain that. Usually, you are applying AI to an existing problem. You already have an alternative solution; you know that things are already working. Now you are upgrading that and bringing AI into the picture. You want to do things in a new way. So, while you are moving to this new world, if I can call it that, first you need to quantify what the effect on the cost is—whether the investment in the AI, the initial investment and then the running expenditures, can be justified by the aggregated cost saving that you expect to get. This is point number one.
The second one is—I said performance, but it is actually time. Your current processes—whether they will be optimized in certain ways, whether the time to get things done, to meet some of the requirements—will they be done in a more timely way? Let me give you an example here. If it is a bank manager looking at people applying for a mortgage and that bank now wants to use AI, then perhaps instead of taking that bank manager four hours, it will take four seconds. Now this is the time that has been saved. So this is the second dimension.
And the third one is accuracy. Whatever your current systems and current processes are, now this new world, the new ideas that you are introducing through AI—whether that will be more accurate or not—you have to basically look at that. In these three dimensions you have to make some progress. Maybe it will be more costly, but if you can justify it in terms of time and accuracy, then you may be able to sell the idea to the senior management.
I have encouraged that this is something that we should do right at the beginning. It should not be an afterthought. I have seen that people come up with these AI architectures, the architecture is implemented, and then they do some sort of time and motion and try to see whether they justify it or not. By that time, you have already implemented; you have already invested; you are already using that system. So it is more like you buy a car. Let us say that you buy a plug-in hybrid, you have already paid the money, and then you go to check and see whether it made sense or not. So you need to basically check that before buying a car.
5: How can software architects specifically help ensure an AI proof of concept scales into a sustainable real-world solution?
Imran Ahmad: OK, now, sustainability is handled at two levels. If you are using cloud computing, then maybe it is not your job — this is the job of the vendor at AWS or Azure or Google Cloud. But still you can do a lot. For example, if you are using virtual machines, even if you have subscribed to virtual machines, in all of these cloud offerings, if you pay a fee, you can keep running these virtual machines 24/7, and you pay a flat fee. The same goes for the servers that you are running in-house. Now what you need to think about is that these are performance-hungry, power-hungry machines.
The second thing is that these days a good design is where we design the compute dimension in an ephemeral way and the data dimension separately. So there are two dimensions: when we talk about the architecture of these systems, there is a data dimension and there is a compute dimension. The design pattern that we suggest is that, first of all, there should be clear bifurcation, so the data and compute dimensions should be separate. The data dimension should be long term — it means your data is stored there for two or three years. The compute dimension should be ephemeral; it should be temporary. When there is a need to process the data, you provision the compute dimension, you process the data, and then you suspend it or just remove it.
Let me give you a simple analogy. All of us work in our favorite word processor like Microsoft Word. When you open Microsoft Word, the file is usually already there. Let us say that you are working on a research paper: the file is there, and whenever you find time, you open Microsoft Word, you work on the file, you store it, and then you go back to your work. So in the data dimension, your paper is stored for those four or five years, perhaps on your hard disk. But the compute dimension is your word processor. Whenever you want to change the paper, you open the word processor — for example, Microsoft Word — you change it, and once you are done, you close Microsoft Word. Now we can have the same design pattern for AI systems.
So it means that whenever there is a need to change something, whenever there is a need to train a model, whenever you need to change the processing pipeline or you want to process the data, you provision your compute dimension. The compute dimension should be need-based and the data dimension should be long term. If you follow this clear bifurcation between data and compute dimensions, our AI system will be cost effective, it will be performant, and it will also meet the needs of sustainability.
6: There’s a concern about the sustainability and vendor lock-in of today’s AI platforms. For example, Open AI reportedly reached 10 billion in revenue but is losing around 5 billion a year, a situation some have dubbed a “subprime AI” crisis. Now, if enterprise architects build around such providers, they face continuity and lock-in risks. According to you, how should architects mitigate these risks?
Imran Ahmad: OK, so wherever there is a choice, do not use the proprietary vendor-specific APIs. Almost always we have two choices. You can use the vendor-specific APIs, or you can use a higher-level generic API. I will give you a specific example. When you are working with large language models, you can use the APIs that are provided by OpenAI — this is choice one, and because you were talking about OpenAI, let us go with that example. The second choice is that you can use LangChain. LangChain is an orchestrator. If you use the LangChain API, then what will happen is that your code will be talking to LangChain, and in LangChain it will be talking to the OpenAI-specific API.
Now let us look into a scenario that is unlikely, but can happen: let us say that OpenAI goes bankrupt. If the code is not directly talking to OpenAI, the connection between LangChain and OpenAI will change to perhaps LangChain to Gemini or LangChain to Claude. That is all that is needed. Your code is not dependent on OpenAI. Now, you can repeat this design pattern for the clouds as well. For example, if you are using cloud and you use open source APIs, then it means that, let us say you are using Docker containers. If you are using Docker containers, then your cloud computing is just a living space for your Docker containers. If you are using Kubernetes as an orchestrator for Docker containers, that is even better.
So it means that all you need is that your cloud computing platform becomes a Kubernetes enabler. Now, if one of those enablers goes bankrupt, all you need to do is move to a different one and you do not have to change even a single line of code. But in this approach we have issues as well. For example, all these vendors sometimes provide the best tools in their vendor-specific APIs. I will give you an example here. If you use Google Cloud, one of their most polished tools is called BigQuery. BigQuery is vendor-specific, and for AWS that is Redshift. Redshift is vendor-specific as well.
My recommendation is that still the risk of being hard coded to a vendor-specific tool is higher, especially at this point when things are changing so fast. We should be cautious, and we should be putting effort into being as vendor-agnostic as possible.
7: Ensuring quality and maintainability in AI software is an emerging concern. Studies find many AI/ML codebases have minimal testing and documentation, often due to “lab-style” development by data scientists. And according to the State of Software 2025 report, only about 1.5% of the code in AI/big-data systems is test code (versus around 43% in traditional systems). Why do you think AI projects often end up with weaker software engineering practices? And what can architects do to instill better rigor—for example, would organizing cross-functional teams of data scientists and software engineers help bridge the gap and improve things like testing, security, and code quality?
Imran Ahmad: It is a new technology. In many cases, people are learning as they are implementing, and this is a byproduct of that. With mature technologies, the test cases have already been established, so we know from other similar projects what the criteria of success are, both in a functional and a non-functional way.
With something new, it is just working, and then we need to ask ourselves: what is the best way to test its functionality? For example, if hallucination is a concern, how do we test whether our solution is hallucinating or not? If accuracy is a concern, how do we test that? In a Gen AI solution, the metrics themselves are still evolving, and testing is all about quantifying whether our solution is meeting the agreed-upon goals or not. Those agreed-upon goals are still evolving, and that is one of the reasons, as you said, that these projects are not well tested. Yes, that is a concern, but things will improve.
AI is quite subjective in different ways, so a project may be successful for me, but for you it may be a failure. There is some subjectivity there, and as you brought up, one way of mitigating that is to come up with a consensus among people with different roles and different skills—a data scientist, a data engineer, a project manager, and perhaps a business analyst or a person who is in charge of production. They will have different views of the success of a project.
For a data scientist or an AI engineer, success is mostly about meeting the functional requirements. For a person who is in production, they may have no idea about algorithms, ROC curves, AUC, recall, or precision. For them, success is putting the Dockerized solution on a server and making sure that the non-functional requirements of reliability, security, performance, and availability are met. If it is an application for approval or refusal of a mortgage application, for the person who is in production it is all about whether the service is available or not, whereas for the data scientist it is all about the metrics related to data science.
So we have to bring them all to the table and come up with a consensus: what does end-to-end success for this project look like, both in development and in production? Whenever there is a problem, people do not have to agree on everything, but they have to speak out about what they think of the solution. Then they discuss, they understand each other’s world, and they come up with a consensus—a compromise. Once that is made, you follow that as the criteria of success. This is something that needs to happen, especially for large-scale projects.
Designing Scalable and Robust AI Systems
8: AI systems must be built with scale in mind from the start. On the training side, deep learning models demand substantial compute (GPUs/TPUs) and efficient distribution of tasks. On the inference side, serving many users requires horizontal scaling, containerization, and load balancing to keep latency low. What are some architectural strategies you recommend to handle scalability for AI? How do you approach designing for heavy training workloads versus high-volume real-time inference in a production system?
Imran Ahmad: First of all, it depends on the problem you are trying to solve. The scalability requirements are different for different problems. Let me give you an idea. When you are training a model, this is where most of the costs are incurred. You need GPUs—GPUs are expensive—you need CPUs as well, and you need to experiment and train over and over again.
However, scalability requirements in development have two characteristics. Number one is that once the training is done, you do not need those resources anymore. You still need to retrain the model, of course, but if you are developing the solution for two, three, or four months and then your model is trained and in production, all those 20 machines that you brought in will sit there doing nothing. That is why cloud computing is really good there: you can provision resources, and once you are done, elasticity becomes important.
Point number two is that there is no hard deadline associated with the training process. At inference, there is a deadline. If you swipe your credit card, the fraud detection result needs to come back within a few seconds. If someone is paying at a restaurant, that person cannot wait for 40 seconds. So at inference you have those deadlines.
When you are training the model in development, there is no such hard deadline; it is more about your comfort factor. If you can live with evolving the solution on a scaled-down system during the daytime, then at night you can submit the full-scale training job before you leave for home. It runs overnight, and you come back in the morning and the solution is there. During the daytime again, you work on a scaled-down system—one-tenth of the size—and you evolve it. If you follow that pattern, you can save a lot of cost. You use the off-hours for training, and in that case you can use a much smaller number of resources. You need to be innovative and creative there.
The second part of the equation is scalability for inference. Now let us say the model is trained and put into production. We need to carefully analyze the scalability requirements there. We should not over-design; we should not under-design. Let me give you a couple of scenarios.
Again, take the example of a credit card. Each time you swipe the card, the result—whether it is a fraudulent transaction or a regular one—needs to come back in about two seconds. That is a hard deadline, so you have to make your servers performant enough to meet that deadline. On the other hand, imagine a bank manager who, at the end of the day, just needs to look at a spreadsheet of the transactions that went through and see how many were likely to be fraudulent so they can be reviewed. In that case, the requirement is “end of day.”
There, we do not need real-time endpoints. We can live with batch-mode inference and save a lot of cost. You do not need to provision real-time HTTP endpoints. All you need is to gather your unlabeled data and create a batch—at the end of the day, the top of the hour, the end of the week, whatever granularity works—submit it to the server, and it produces the labels: how many are likely to be fraudulent and how many are not.
So real-time inference is not always needed; if you use it everywhere, it is expensive and you may be over-designing the system. To get scalability right, you have to carefully analyze the requirements first and, based on that, design and architect the system.
9: Integrating AI into existing enterprise environments can be complex. Teams often need to balance cloud-native AI services with capabilities within the customer’s current on-premises infrastructure so they can leverage existing investments and avoid disruption. How do you evaluate which deployment strategy is appropriate for a given project? What factors—for example, data sensitivity, legacy system constraints, regulatory requirements, or team skills—should influence whether AI systems run on-premises versus in the cloud or in a completely new environment?
Imran Ahmad: OK, so there are two things here. First of all, I suggest that we carefully determine the maturity level. There are four maturity levels, and those maturity levels are about the technical infrastructure maturity and the skill maturity level as well.
Let us imagine a company. There are 30 people working in that company, and they are working on developing a product that deals with recommendations. It is a recommendation engine that recommends products to their existing customers, and they are using some algorithms, but now they want to modernize that. They want to use deep learning, they want to use Gen AI, and they want to use cloud computing.
The first requirement is that they cannot afford any disruption. So, first you need to look at what maturity level you are going for, but you also have the hard requirements that you have to use the existing infrastructure and you have to use the existing people. Then we have to develop a phased approach. Usually there are four phases.
Phase one is where we come up with the plan, looking at the current situation and deciding what the path forward is. This is where we start. In that phased approach, depending on the maturity level, we may say that in phase one perhaps we can move this part of the system and keep the other part on-premises. Using the example of that company, perhaps accounting can stay on-premises, but the algorithms can move to the cloud. That is one thing we can do.
Then we have to figure out how we are going to create a pipeline that can link the on-premises environment with the cloud. Usually what we do is keep redundant systems both on the cloud and on-premises, and slowly we test that and then we remove the part that is no longer needed on-premises. So this phased-out approach will be vertical, it will be company-dependent, and it will reduce the risk, and that usually works.
In some cases we do not have a choice. If you are working for a government organization or at a financial company, then sometimes there are regulatory requirements that your data cannot be on the cloud. There are three sectors—usually government, healthcare, and the financial industry—and in these three, some of their data needs to be compliant with existing regulations. It is not impossible, but it is more difficult for them to bring the data to the cloud. For government, sometimes it is not even possible to bring the data to the cloud.
Let me give you an example. There is a tool from IBM that is called IBM SPSS Modeler. Banks and companies in the banking industry are still using that. If your processes are dependent on that and it is working fine, you will not get the same level of comfort if you move to the cloud, because you are using a legacy system with a lot of embedded knowledge. All of that embedded knowledge will not be available, so now you are tied to your legacy system unless and until you are ready to retire your legacy software. There is no way you can move to the cloud.
Then sometimes what happens is that companies, when they say that they will move to the cloud, think mainly in terms of cost savings. I will give you an example here. The Canadian federal government, about four or five years ago, thought that they would move to the cloud, and they started that journey. The infrastructure to support the Canadian federal government is worth billions of dollars. They thought that they would save money, and the study was that it would save about 20% of the cost. That was the initial study.
Now, five years down the road, that did not happen. They moved to the cloud and now they have spent more money. Cost has increased by about 12%. That is the number. And there is a reason for it. The reason is that if you do not make a conscious effort, the simplest architectures in the cloud are not cost-effective. If you run a virtual machine 24/7, it will meet the functional requirements, but it is not elastic—yet that is the easiest solution.
That is why, throughout this talk, we have been talking about the case for architecture: taking a step back and spending some time there, because in the long run, in that example, if you calculate cost, initially they thought that the cost would be 20% less; it is 12% more. What that tells us is that we should not rush into the cloud. We should first understand what architecture we need, and once you have that clear architectural vision, then you implement that so that in the long run you are going to be saving the money.
If, in a hurry, you have already started with something like the easiest possible solution, it will be very difficult to change it down the road when you have already started your computing resources, you already have your compute dimension, data dimension, and functional dimension running. If you want to change it, it will be very difficult and risky. You are doing something that you should have done a couple of years ago. That is why there is a whole section in our book that talks about the case for architecture—why we should, and what is the need for system architecture for AI systems.
10: User experience is a critical yet often overlooked aspect of AI systems. Even if the model is accurate, poor UX can block adoption. What can architects and designers do to ensure an AI system delivers a good UX and drives user adoption? For example, what is your view on using user-centered design practices or designing for diverse user needs such as voice UIs and accessibility features? Do you have any best practices for aligning AI architecture with great UX design?
Imran Ahmad: Yes. So for UX we should always be designing the system, we should always be thinking of it as a service. If you are a technical person and you have a spouse who is non-technical—or if you have a brother or sister who is non-technical—think about that person and whether that person can use this service or not.
My brother is a medical doctor, so I always think: OK, the eventual service that I will provide, can he use it or not? Sometimes what happens is that we bring too much technicality to the front. We are very impressed with our own algorithms, our own models, and our own infrastructure, but the end user is a non-technical person. They should not even need to know the details in the data dimension or the compute dimension or which models we are using. That all should be a black box once things are done.
It is a good idea to always try to see, from the eye of a non-technical person, how easy it is for a non-technical person to use it as a service. So think about it as a service. Your solution should be a service to the end user. There are different zoom levels. You can think of your solution as a microservice architecture. Now, microservice architecture is quite technical; it is great for providing abstraction to a data engineer, but not to the end user. We need to zoom out more.
I am into photography, so I give examples from zoom levels. Zoom out more and think of it as a service. At the highest zoom level the user just sees, “This is a service that helps me do X,” and everything else is hidden.
The example of that is that sometimes we are using AI without noticing it. The greatest example is when you use Google Maps. When you use Google Maps, it uses an optimization algorithm to get you from point A to point B. If you look under the hood—because my PhD was in algorithms—optimization algorithms are one of the hard areas. There is a famous example of the travelling salesman’s algorithm, and the travelling salesman’s algorithm is basically that you have a list of cities—city one, city two, city three, city four—and you try to find the optimal route. This is an NP-hard algorithm.
So it means that whenever you say, “OK, I want to go from point A to point B,” you do not know that under the hood there is a lot going on. First of all, your GPS location needs to be tracked. Then the destination needs to be there, and the traffic situation needs to be there—what are the real-time traffic conditions on each of the possible routes—so it is dynamic in nature as well. And then you reach your destination, and it asks you for feedback, and we do not even realize that for this simple use case there is so much power being used.
This is the best example. People use it. My daughter can use it; she uses it to go to her school. People will use it if they find the service easy to use, and we do not need to know what is under the hood. That is the UX.
And I will tell you the gap there as well, that I talked about earlier. In real time it needs to know that we are travelling on those routes, and the way it collects that information is that it assumes people are carrying those devices in their car, and if those cars slow down, it means that there is traffic congestion. It works most of the time. But where I live in the north, there is a place called Gatineau Park. It is about 80 kilometers long. People are biking there on their bikes, on their cycles, and their GPS devices, Google Maps, are being used, and Google Maps always thinks there is traffic congestion. It is always red. But if you go there, there is no one there. So there will be failures. It is not that algorithms always work.
Still, as a user, you trust it because of the overall experience: it is easy to use, it hides the complexity, and most of the time it works. That is what we should be aiming for when we align AI architecture with great UX design.
Emerging Trends and Future Outlook
11: The rise of “agentic AI” is a hot topic in 2025. We touched on it in the last conversation we had. Major platforms are jumping in—for instance, Microsoft’s new Azure AI Agent service helps orchestrate multiple specialized agents and tools. What might this shift from single AI applications to multi-agent systems mean for software architects? How might architectures evolve to accommodate networks of AI agents that can plan, collaborate, and act autonomously? What challenges should we be prepared for in areas like agent coordination, security, or reliability?
Imran Ahmad: OK, first, let us think about this. Right now, when we design an AI system, the goal is to mimic human wisdom. That is what artificial intelligence is: mimicking human wisdom.
Imagine a person who wants to develop a fraud detection system and wants to get it done by the end of Monday. The first step in the human mind is discovery: OK, what are the requirements, and what are the tools that are available? Maybe there are existing tools, maybe there are friends to ask about which tools exist. In my mind, I will orchestrate. I will use those tools in different ways, I will come up with a plan, and I will start using those tools. Some of those tools will work, some will not, and the solution that I deliver will be the result of using existing tools, being aware of the tools that are available to me, and combining them in a meaningful way.
An AI agent is mimicking exactly this human behavior. An ideal agent should be aware of the tools that are available. Second, it should be able to orchestrate those tools in a way that leads to a meaningful solution. Third, it should be ready for surprises. Just like I can change my plan when something unexpected happens, the agent should be dynamic enough that it can change and re-plan as it goes. These are the three attributes of an agent.
In an agentic system, a large language model is just one of the tools. It is one of the important tools, but right now the large language model sometimes becomes the “king” and everything else is forgotten. What Azure has provided, and what Google has also provided with their own agent solutions—for example, agent spaces, agent design tools—is a way to step back and see these as orchestration platforms. We can zoom out and look at them in a vendor-agnostic manner; essentially, they are all doing almost the same thing.
Now, for architects, the first thing is that they should be aware of these new developments. That is why this book is about the architecture of AI systems. We are entering a time—2025 and 2026—where AI architecture itself is becoming a specialty. You need to be aware of these developments and track them on a regular basis. One way I keep up is by subscribing to good YouTube channels and other high-quality sources. There is a lot of content out there where people give talks but do not really know what they are talking about, so you have to be selective. And you have to recognize that what is relevant today may not be relevant at the end of 2025.
At the same time, some fundamentals do not change: the need for good architectures, the need for performant architectures, the need to create operational excellence, and the need to have data that is reliable. If agentic systems are one way of doing things, they are not the only way. There will always be new ways coming. You should keep an eye on them and keep incorporating new ideas as they come along.
The challenges are very similar to what we saw with Kubernetes. When Kubernetes was introduced, there was so much excitement. I used to teach courses on Kubernetes, and people mainly wanted to learn how to design and manage applications on it; they were less interested in the internals. Now, if you use a managed service like Vertex AI, under the hood it provisions a Kubernetes system for you and you do not need to think about those details; you just use it.
Right now, these agentic systems are like Kubernetes in its early days. They are still being developed, so sometimes they will work, sometimes they will not. But you will see that in less than a year these systems will become mature. As an architect, you should expect that maturity. Things like agents talking to each other should come out of the box; multi-agent systems, where each agent is a specialist with its own piece of wisdom for a particular vertical, will become the norm.
Our responsibility as architects is to start bringing these entities into our architecture and then let the system evolve and mature in the coming months. Some glitches will be there, but over time those glitches will be resolved.
12: Enterprises also grapple with how to integrate their data with AI models effectively. One common pattern is bringing domain knowledge into AI workflows so that models can reason over real enterprise context. What is the right approach for infusing domain knowledge into AI systems? Do you think Retrieval-Augmented Generation (RAG) will remain the dominant architecture for bringing enterprise data into AI workflows, or will other patterns become more prominent as AI capabilities evolve?
Imran Ahmad: Yes. RAG is becoming obsolete in some ways—you are right about that—because context windows are becoming larger and larger, and that can remove the need for RAG. But it also means that with each request you may have to send a lot of information, and that may not be an efficient use of the model.
The advantage of RAG is that it is more efficient. Instead of sending everything, you only attach the right vectors or the right text. So our requests become more focused, and we are not wasting capacity on irrelevant context.
Agentic RAG is a step ahead. This is something that is still being developed, and classical RAG may become obsolete eventually. That is why I was saying earlier that these systems are expected to evolve. But RAG is still important, because you need to understand RAG in order to get to agentic RAG. In the book we have talked about RAG, and I feel that this is the right learning path: learn the simple use case before moving to the more complex one.
Coming back to your question, there are always multiple ways of doing things. You can have agentic RAG, you can have a large context window, you can have what I would call “classical” RAG. There will be an overlap in functionality between these approaches. In that case, it becomes subjective. You have to carefully see what the advantages and disadvantages are for each option, and then choose the approach that gives you the best solution that is available currently.
13: Some say the “AI architect” is no longer just a technologist, but a strategic leader at the intersection of data, infrastructure, and product. How do you see the role of architects changing as AI becomes a core part of software systems?
Imran Ahmad: Yes. So the traditional architect was basically operating in the days of the waterfall methodology, where you had clearly defined phases: your project gets approved, it gets funding, then someone writes the business requirements for you. Then there is a layer of red tape. After that comes the architect, who designs the system—and whatever that person designs is written in stone. Then the technical team needs to implement it, and the criteria of success is meeting that design in the most precise way. Gone are those days.
The reason is that now the architect needs to be involved in the iterative process. When you are doing AI, you are trying new things, you are experimenting, and sometimes ideas will not work. So it means that the role of the architect is more dynamic in nature. As you move towards AI systems, the architect has to be involved in the pilot project; the architect may need to refactor, may need to redesign the data dimension or the compute dimension if they see performance bottlenecks. So the role has become more agile, but the need for the software architect is still there. It is very important—it has become more important than ever.
Let me give you a reason. A large-scale project is like building a home. In some villages, people still build houses without an architect. They have bricks, they have an idea—“let us build a room here, let us put a kitchen there”—and they just start. But in an organized way, an architect first plans: “OK, this is the room, this is the hall, this is the kitchen,” makes a blueprint, gets it approved, and then we start building the home.
Now think about this: if the architecture is wrong—let us say the bedroom was supposed to be on the ground floor because the owner has a knee problem and cannot climb the stairs—but that decision was not captured, and the bedroom ends up on the first floor, then you have a serious problem. You can imagine how expensive and disruptive it is to change the structure after everything has been built. The same goes for large-scale software architecture. The basic templates need to be decided before you build the system.
That is why there needs to be an architect who designs the large-scale components, and then someone starts filling in the details. Otherwise, you end up with very costly mistakes. If you look at some real-world stories—for example, JP Morgan—you will find cases where they designed their system and spent minimal time on architecture. They picked, for example, MongoDB, went ahead with their design, and eight months down the road they realized that this was the wrong choice. There was a loss of revenue, a loss of time, and this is something we want to avoid at all costs.
So the role of the architect in the age of AI is not going away. It is becoming more central: more dynamic, more involved throughout the lifecycle, and more responsible for making sure we do not build the “bedroom upstairs” when the user cannot climb the stairs.
14: What new responsibilities or skills—for example, understanding model behavior, data governance, or AI ethics—should architects cultivate now to successfully design and oversee AI-enabled software in the coming years?
Imran Ahmad: This is essentially about making yourself aware of what technologies are available and what is happening in AI. The architect should not treat AI as a black box or something that is “someone else’s job.” You should be able to understand, at least at a high level, what these AI components do and how they behave.
A key skill is the ability to choose the right AI components under given requirements: which model to use, what kind of data pipeline is needed, what kind of storage is appropriate, and how the compute dimension should be designed. You should be able to look at the requirements and say, “Under these constraints, this combination of components will work best.” That selection ability is very important.
Another responsibility is to understand the implications of AI decisions on things like data governance, security, and compliance. When you bring AI into the system, you are also bringing in new questions: how the data is collected, how it is stored, how it is used for training, how it is monitored in production, and how you make sure that you are meeting ethical and regulatory expectations.
So for many architects, this means retraining themselves in AI. For some, AI is a blind spot at the moment. Closing that blind spot is crucial: keep learning about AI concepts, stay current with the tools and patterns, and build enough understanding that you can make informed architectural decisions. You do not have to be the person implementing every model, but you should be comfortable enough with AI that you can confidently design, review, and oversee AI-enabled systems end to end.
To go deeper on designing robust, scalable AI-enabled systems—from integrating machine learning into existing architectures to managing risks like underperformance, cost overruns, and operational complexity—check out Architecting AI Software Systems by Richard D Avila and Imran Ahmad (Packt, 2025). Through a structured progression of architectural concepts, real-world case studies, and hands-on exercises (including a fictional AI-enabled system you can dissect end to end), it shows software and systems architects, CTOs, VPs of Engineering, AI/ML engineers, and developers how to select the right models and data pipelines, use architectural models to ensure cohesion, simulate and optimize AI performance through iteration, and apply patterns and heuristics to integrate AI into large-scale systems with strong user experience and performance—so you can confidently architect AI-driven products across a range of domains.
Here’s what some readers have said:






