Patience and Perseverance: The Trials of Testing Agentic AI
We’ve learned a lot about testing bots over the past few years. As our technology became more sophisticated, it became clear that our testing methods must do the same.
Classic bots of old mainly just required functionality testing. They were designed solely around scripted interactions. You would define specific inputs (e.g. a user’s question), associate them with specific outputs (e.g. the answer to said question), and then just test every combination to ensure that they worked. If they worked once, you could be confident that they would work every time, while failure usually just meant adding a few extra input variants for the bot to recognize.
The addition of Generative AI changed everything. While large language models (LLMs) have the potential for tremendous benefits in this space, they also open the product to a lot of risks if not properly accounted for.
One such risk is unpredictability. Unlike a classic bot where every recognized input is tied to a singular output, LLMs have some amount of randomness built in. It’s inherent to the technology. LLMs have a parameter called “Temperature” which governs how creative or focused a model is by influencing how it calculates which words to use next when generating a response. A higher temperature value will generally result in more creative or interesting responses. In other words, the reason they’re able to converse naturally is the same reason they’re unlikely to provide the same response to the same query twice in a row.
Therein lies a problem: If some amount of randomness is required for a bot to feel natural and conversational, how do you ensure its responses are accurate?
You’re likely familiar with the concept of “hallucinations,” where an LLM generates a response that contains factually incorrect information, often in a way that sounds confident or entirely plausible. These can occur even when the model is trained on the correct information. To use Generative AI in a production setting, these kinds of issues must be accounted for.
One solution to the problem is to use a Retrieval-Augmented Generation (RAG) architecture, which is when you combine Generative AI with an external search tool to augment the results. At Rezolve.ai, our state-of-the-art RAG architecture is built to find the answer to a query on its own, and once we’re confident we have found the answer, only then will we let the LLM attempt to answer the question. In short, we tell the LLM “Here’s a question, and here’s an answer. Reword the answer into a contextually appropriate response.” By reducing the scope of the task and providing the context, we can bring predictability to the unpredictable. The exact wording or formatting may vary, but the meaning of the response remains consistent.
But Rezolve.ai doesn't just provide generated answers; our bot also offers scripted experiences, such as task automation, and even open-ended interactions, like small talk. This puts us in a difficult position: We can’t have the bot attempt to answer questions it doesn’t have knowledge of, but also need it to be able to give a proper answer to small talk questions like “How are you doing today?”
Early versions of our product avoided small talk altogether. A couple of years ago that sort of query would have been met with a response like, “Sorry, I don’t have any information in my knowledge base regarding how I am doing today.” This was far from ideal.
We solved this issue by introducing Generative AI Agents. If you’ve been following recent developments in AI, you may already be familiar with the term “Agentic AI”. Agents are essentially bespoke systems powered by AI that are given some amount of autonomy to accomplish a given task. In short, agents are permitted to make their own decisions using context and natural language.
In our case, we took all the different core functions of our product (Q&A, Task Automation, Small Talk, etc.) and delegated each to a different agent with its own set of instructions. We then added a new agent whose sole job is to analyze the user’s query, determine their intent, and decide which downstream Agent is best for that task.
By separating out tasks, we were able to give the agent in charge of Small Talk the freedom required to provide a creative, friendly response, while ensuring the agent in charge of Q&A was strict and grounded in verified knowledge.
Building a system like this to work effectively requires extremely precise prompt engineering, intelligent scripting, and a lot of patience. We’ve learned a lot while building this system over the past few years, and we feel confident that our results speak for themselves. But how do you test a system that is able to make its own decisions?
The key is to break up the bot’s capabilities into its base functions. A single conversation with the bot can involve several agents, each with functions that must be tested individually. For example, a user might start with small talk, report a vague query, answer a follow-up question to provide more context, and then ultimately have an automation run on their behalf to solve their issue. Throughout this process, various agents are involved to respond and determine the next step, and there are dozens of factors you need to pay attention to. Can the bot maintain the context throughout the conversation? Can it switch from Small Talk mode to Q&A mode seamlessly? Does it ask a follow-up question that makes sense in this context? It’s critical to list every possible interaction and define how you want the bot to respond.
From there, we separate them into two buckets – functional testing and qualitative testing.
Functional testing is easier. A query like “What’s the status of my ticket?” should always trigger an automation to return the status of a recent ticket to the user. We test repeatedly with a variety of queries to ensure we’re getting the desired response consistently.
Qualitative testing is a bit more involved. It’s not enough for the bot to provide an answer and a citation. You need to compare the answer to the source and confirm that it’s grounded, and then ultimately decide as to whether you like the answer. Is it too long? How’s the formatting? Does it sound natural? Did it take too long to generate? Generative AI testing is full of these little decisions.
Failure could point to any number of different issues: The prompt may not be strong enough, the information in the Knowledge Base may not have enough context to seem relevant, the language model we chose for that agent may not be very good at those types of tasks, or there may simply be a bug in the code. Identifying which change would be the most impactful requires insight, experience, and excellent server-side logging.
Prompt changes, when needed, can be particularly tricky. Frequently we find that adjusting a prompt to encourage or discourage behavior results in an extreme shift by the LLM. For example, at one point, one of our attempts to encourage follow-up questions from the bot unintentionally resulted in the bot asking a follow-up question to every user query.
The name of the game here really is patience and perseverance. Our solutions rarely work on the first try. You need to trust that good solutions will prove themselves out, as you continue to test and tweak until you get the desired results.
It’s an ongoing process but it’s also incredibly rewarding. There’s nothing quite like working on the bleeding edge of technology and seeing a new idea pass testing. I can’t tell you how many times I’ve called up a coworker just to say, “Hey, check this out!”
That excitement, that joy of seeing our ingenuity made manifest – that’s what drives us. At Rezolve.ai, we’re building the future of HR and ITSM. The ambitiousness of this goal is not lost on us. But we believe we can do it because we have first-hand experience with how powerful and transformative this technology can be. As we continue pushing the boundaries of what an AI product can be, we will also continue to refine our testing methods so that we can bring the best possible product to market.
We have a ton of ideas in the works right now, and we can’t wait to share them with you when they’re ready. To the future!