Designed and built the rule-governed AI agent skill that powers MongoDB Atlas's conversational search quickstart, from noticing the gap to shipping it.
I spend a lot of time in developer tools. Not as a developer exactly, but close enough to feel the friction. I was poking around MongoDB Atlas Search, genuinely trying to understand what it could do, and I noticed something: the documentation was thorough, but it didn't respond to me. It couldn't tell when I was lost. It had no idea I was on a free tier and was routing myself toward instructions that wouldn't work for my setup.
We had skills. We had MCP integrations. What we didn't have was a proper getting-started experience that met developers where they actually are — confused, impatient, and usually two environments away from the example in the docs. That felt like a real gap, not just a nice-to-have.
So I picked it up. No one asked me to. I just thought: this should exist, and I can build it.
"The docs can walk someone through the API. They can't say 'you're on a free tier — here's your path instead' in real time."
The first thing I did was resist the urge to write a prompt. A single prompt that tries to handle every developer's situation gets flabby fast. What I needed was structure: a branching ruleset that governs the agent's behavior at every fork, while leaving room for it to be genuinely conversational in between.
The decision tree below isn't implementation detail. It was the primary design artifact. I drafted it before writing any skill code, walked through it against real developer scenarios, and rebuilt it several times before the routing logic felt right.
The thing I kept coming back to: every path had to converge at hybrid search. Regardless of what a developer came in wanting to do, they should leave having seen $rankFusion run on real data. That wasn't just a nice outcome. It was a design constraint I built the whole tree around.
One of the principles I set for myself when building this: the skill should have an explicit behavioral contract. Not vibes. Actual rules. If a new scenario came up, I wanted a framework to evaluate it against, not a prompt I'd have to rewrite from scratch.
MCP connection and dataset availability verified silently before anything else. No one fails mid-tutorial because we skipped preconditions.
Before each operation, the agent explains what it's doing and why. Developers leave understanding the concepts, not just having followed instructions.
Index creation takes time. The agent holds until active. Running a query against a building index is a silent failure most tutorials don't catch.
Routes to auto-embed or manual embed based on actual cluster tier. This was the original insight: the docs assumed M10+. Most people aren't on M10+.
Offers iteration, genre filters, weight-flipping — but only after the core experience has landed. You earn the advanced stuff.
If something fails, there's always an alternative path. The agent doesn't leave someone staring at an error message with nowhere to go.
I want to be honest about what this process looked like, because it wasn't a tidy waterfall. It was: have an idea, build a rough version, run it against a bunch of scenarios, find the problems, fix them, repeat. The design and the building happened at the same time, by the same person.
The core sequence was:
Mapped every friction point in the Atlas Search onboarding. Found where and why people were dropping out. The cluster tier problem jumped out immediately.
Before writing any code, drafted the full routing logic — all paths, all branches, all fallbacks. This document was shared for alignment before implementation.
Specified the behavioral rules explicitly. Always/conditionally/never. This made testing tractable because I knew what to test against.
Built a structured scenario set covering infrastructure variety, intent variety, and edge cases. Ran every version of the skill through it. Found real bugs.
The weight-flipping comparison at hybrid search was treated as a product moment, not a feature. Designed the agent's framing to make it land.
Added a Python script at wrap-up so developers leave with something immediately usable. The tutorial becomes a starting point, not an endpoint.
One thing that shaped the whole approach: I kept asking "what does a developer walk away knowing?" Not what they did, but what they understand. That question changed how I wrote the agent's explanations at each step.
Testing an AI agent skill is a bit like QA-ing a conversation. You can't cover every possible exchange, but you can design for the most likely failure modes. Here's a selection from the scenario set:
| Scenario | Condition | Result |
|---|---|---|
| Free-tier user wants semantic search | No M10+ cluster | Rerouted to manual embed path correctly |
| User skips every optional step | Declines fuzzy, autocomplete, genre filter | Arrives at hybrid cleanly, no dead end |
| Dataset not loaded | sample_mflix absent | Holds, explains, waits — doesn't proceed |
| User types a custom semantic query | Freeform input mid-flow | Runs it, explains results, continues |
| User wants keyword only | Declines hybrid at Step 7b | Offers Python script, wraps gracefully |
| Index creation stalls | Atlas index stuck in building state | Bug found, fixed: early version continued anyway |
| User flips hybrid weights | Requests keyword-heavy query | Re-runs 70/30 swap, shows side-by-side results |
The stalled index bug was the most important catch. An agent that quietly runs queries against a building index gives wrong results and the developer has no idea why. That's worse than failing loudly.
The skill is complete and currently moving through the publishing approval process. Every developer who goes through it arrives at a live hybrid search comparison, a working understanding of $rankFusion, and a runnable Python script to take with them.
The decision tree pattern itself became reusable. It's now a template for structuring future quickstart skills — which means this project produced a methodology, not just a single deliverable.
The biggest shift was realizing that designing for AI agents is fundamentally a systems design problem, not a copywriting problem. Writing better prompts got me maybe 20% of the way there. The other 80% was the tree, the rules, and the testing protocol.
A well-structured decision tree with boring, explicit rules outperforms a clever, freeform prompt every time. The agent's behavior became predictable and testable. That made iteration possible instead of chaotic.
I found a real bug — the stalled index issue — through scenario testing that I would never have caught in normal use. Building a test matrix before the skill was done changed what I found and when.
Getting the agent to run queries is easy. Getting it to explain each step in a way that builds actual understanding, without sounding like a manual, took far more iteration than anything else.
I moved fast on this because I had one clear question anchoring every decision: "What does a developer walk away knowing?" When I got stuck, I'd go back to that. It cut a lot of noise.