Helper AI: My Claude Agent Journey

I spent the last few weeks building Helper AI — a Slack bot and remote MCP server that helps engineers investigate production issues. It started as "could I save my team an hour a day?" and turned into a small lesson in restraint.

The shape of the problem

The same questions kept showing up in our channels. "Why is this user failing to log in?" "Why did this payment 500?" "What's our retry config for the inventory service?" Every one of them required the same loop — read a runbook, run a SQL query, tail some logs, stitch the answer together.

What I really wanted was a teammate who had done that loop a hundred times and could do it again, patiently, at 2 AM.

Letting Claude drive the loop

My first sketch had me writing the orchestration myself — pseudo-code for planning steps, retries, tool dispatch. Then I tried the Claude Agent SDK, defined three tools as plain Python functions with @tool, and... that was it. Claude decided which tools to call, in what order, until it had a confident answer.

The temptation was to build a clever planner on top of the SDK. I'm glad I didn't. The agent loop is a primitive — give it good tools, give it good context, get out of the way.

Three tools, no more

I gave it exactly three:

search_documentation — keyword-scored retrieval over a corpus of runbooks, schemas, and known-issue docs that ships with the deployment.
query_database — read-only SQL through the RDS Data API, results capped at 50 rows, regex blocklist on every write keyword.
get_cloudwatch_logs — Logs Insights queries against an explicit allowlist of log groups.

Every tool has a guardrail. After the demo mishap I wrote about a couple of years back, I'm allergic to AI that can do anything it wants. Claude is impressive, but it's also confidently wrong sometimes. The blocklist and allowlist exist because I trust the model right up until it suggests DROP TABLE users;.

Keyword filtering before vector search

I almost reached for pgvector immediately. Embeddings, similarity search, the whole pipeline — it's the obvious move for "search a knowledge base."

I didn't. Instead, the corpus is bundled with the Lambda image and pre-filtered by keyword scoring, with weighted matches against service, domain, tags, and body content. Schema documents for matched domains get a bonus so Claude sees table structure before it writes any SQL.

It works. The corpus is small enough that keyword scoring beats embeddings on both latency and accuracy. When it crosses ~150 docs, I'll swap in vectors. Until then, the simpler thing wins.

Two interfaces, one agent

The Slack path and the remote MCP path share the same run_agent() function. Slack pushes events into a thin Lambda that acks within 3 seconds and async-invokes the heavy one. The MCP server is a separate Lambda holding an SSE connection open for the lifetime of a client session — same tools, different lifecycle, different timeouts.

Splitting them was the right call. Cramming both into one function would have meant the worst timeouts of both worlds.

What surprised me

How little orchestration code I ended up writing. The bulk of the repo is tools and corpus — the parts that are actually specific to my team. The loop, the planning, the retries — all came free from the SDK.

That's the lesson, I think. The interesting work was never the agent. It was the hundred little decisions about what context to give it, what guardrails to wrap around it, and what to leave out.