Insights

"Don't Trust What You Haven't Tested Yourself"

A conversation with Leonard Wossnig from Forgent AI on Trust, Benchmarks, and Building Reliable AI Agents.

Mar 3, 2026

Bar chart on the weighting of sustainability criteria in public tenders in the energy sector

Felicitas von Rauch

Founding Growth Lead

Leo, Forgent AI is building AI agents for public procurement — a field where mistakes can be costly. How do you think about trust when deploying AI in that environment?

Trust in AI agents isn't something you declare. It's something you earn through rigorous testing. In public procurement, a single missed requirement can get a company disqualified from a tender. That's not a UX issue, that's a business-critical failure. A €500.000 tender lost because AI made a mistake. So from day one, we asked ourselves: how do we actually know our system works? And the honest answer is: you have to build the infrastructure to prove it, over and over again.

In an article you published a while ago, you talk a lot about not trusting general benchmarks. That seems almost provocative. Why shouldn't people trust published benchmarks?

Benchmarks are not optimized for your specific, real-world problem. We went into our project expecting that state-of-the-art foundational models and AI-based document analysis would work out of the box. They didn't. For example, only 50-80% of the requirements were extracted from the tender documents with tools that scored brilliantly on public leaderboards. For us, that's simply not good enough for RFP response writing. The failure of general benchmarks is a phenomenon called “overfitting to evaluations”. And it's rampant. Our advice: run your own tests on your own data before you trust anything you read on LinkedIn or X.

You found that even human experts disagreed on what to extract. How do you build trust in an AI agent when the ground truth itself is fuzzy?

That was probably our biggest early lesson. We brought in domain experts to create our evaluation dataset and discovered they weren't even consistent with each other. Some would skip "obvious" requirements based on tacit knowledge. Others would add implicit assumptions. That means agents can easily fail since they lack the context — and you might not even notice. The fix was to step back and identify first what an ideal workflow would look like, that all experts could agree to. This enabled consistent outputs across our different expert users. Then we redesigned the task for the agent to match that: first, extract everything comprehensively; then, filter based on consistent, explicit criteria that our expert committee defined. Modularizing the problem also gave us much sharper control over where errors crept in and made each step individually testable.

What role does evaluation infrastructure play in building trustworthy agents?

It's everything. Early on we tracked experiments in notebooks and a standard tracing tool. We were essentially flying blind. We couldn't quickly spot patterns, non-technical team members couldn't participate, and iteration was painfully slow. The solution: We built a proper evaluation UI where both experts and technical users can kick off a run, inspect results visually, and compare metrics side by side. Our pace of learning accelerated dramatically. Trust in your agent starts with visibility. If you can't see what it's doing and where it fails, you can't fix it and you certainly can't stand behind it.

Models are changing so fast. How do you maintain trust in a system built on foundations that shift every few months?

You have to build for the future, not the present. As we moved from Gemini 2.5 to 3.0 and 3.1, many of the third-party solutions we had been comparing against simply became uncompetitive overnight. That's a signal: frontier model capabilities are advancing faster than specialized tooling can. So we stay close to the latest models, use the evals infrastructure to enable quick testing, keep our architecture modular so we can swap components, and we never stop re-evaluating. Trust isn't a one-time certification. It's a continuous process. The moment you stop testing, you're flying blind again.

Leonard Wossnig is a co-founder and CTO at Forgent AI, a Berlin-based startup building domain-specific AI agents for winning public tenders.

Insights

Apr 28, 2026