AI-Generated Tests Are Only As Good As What You Feed the AI

What we learned building a RAG pipeline for test generation and where human judgment still wins

Apr 11, 2026

AI-generated tests are promising, but by default, the AI has never seen your app.

Out of the box, an AI test generation tool has no knowledge of your specific application. It does not know your selectors, your valid usernames, or your known failure scenarios. It is working from general knowledge, not your actual codebase.

That is where RAG comes in. Retrieval-Augmented Generation. Instead of relying on a generic AI, you feed it your own documentation. Your API spec. Your component library. Your bug history. When a test is being generated, it pulls what is relevant and uses that as its foundation.

We ran this internally using Sauce Demo, a free e-commerce app built for testing practice. We created three simple docs:

An API spec covering login, inventory, cart, and checkout endpoints.
A component doc with exact CSS selectors for every page.
A bug history doc with known failure scenarios.

We indexed these into ChromaDB using Google Gemini embeddings. When we queried “user login with valid credentials” it retrieved exactly the right context. The API spec. The correct selectors. The locked out user bug. No guessing.

What the Pipeline Actually Looks Like

The pattern is straightforward. You index your app’s documents into a vector database, and at query time the most relevant chunks are retrieved and passed to the AI as context. The AI generates tests grounded in your actual docs rather than guessing.

Not everything needs to be indexed. What actually moved the needle was the component selectors and the bug history. The AI stopped guessing at button labels and started working from real data.

Here is what came back when we queried “user login with valid credentials”:

Chunk 1: Login API spec with username and password structure.
Chunk 2: Exact CSS selectors for the login page.
Chunk 3: Known bug history related to login and locked out user.

The AI had everything it needed. The right selectors. The right endpoints. The known failure scenarios. All retrieved automatically from the docs.

A couple of things worth knowing before you try this. ChromaDB needs to run as a separate service before you index or query anything. The embedding model name matters — for Google Gemini the correct model is gemini-embedding-001, not text-embedding-004 which returns a 404. And RAG is a pre-processing layer, not a direct injection into the test generation tool. You need a wrapper to bridge the two.

Human vs AI: Who Actually Wins

We went in expecting the human to win. That is not quite what happened.

After indexing the three docs and running the pipeline with that context, we ran both tests. The same app, the same flows, one written by a human and one grounded in RAG context.

The AI knew the locked out user scenario because it was in the bug history doc. It knew the exact selectors because they were in the component doc. It did not guess. It worked from what we gave it. Both tests passed.

But here is where it gets interesting. The AI verified that an error message existed. It did not verify that the message said “Sorry, this user has been locked out.” That is intent knowledge. It lives in someone’s head, not in a doc. The human catches that. The AI does not.

And anything that was never documented will not show up in the tests either. A flow built last Tuesday that never made it into any spec or component doc is invisible to the pipeline. The RAG context is only as good as what you indexed.

So neither wins cleanly. The AI covers breadth. The human covers intent. The most useful thing is not picking a winner, it is understanding where each one has blind spots and using both accordingly.

This is part of how we approach AI-augmented quality engineering at QualityBridge. Not theory. Real experiments with honest observations about what worked and what did not. More at qualitybridgeconsulting.com

We are curious. If you have tried this, did the AI surprise you with what it caught or what it missed? And if you have found a better chunking strategy for API specs, we would love to hear it.

QualityBridge Consulting

Discussion about this post

Ready for more?