I’ve been on-call long enough to know what those first ten minutes of a production incident look like. Someone gets paged. They open their laptop. They start hunting: logs, run history, error messages, the runbook they half-remember from last quarter. By the time there’s enough context to form a hypothesis, ten minutes are gone and the pressure is already mounting.
I’m often the next-level escalation, the one who gets pulled in when the first responder needs backup. And I noticed that what I was doing when I got pulled in was almost mechanical: find the run, get the error, check the config, look for a matching runbook, see if we’d dealt with this before. Repeatable steps. Existing data. Stuff that could have been surfaced before I even opened my laptop.
That was the premise for Scentry: automate the first ten minutes so that by the time an engineer looks at their phone, the context is already waiting for them in Slack.
What it does
Scentry sits in front of your alert sources. A job fails, a monitor triggers, a teammate fires a Slack command, Scentry intercepts the webhook, normalizes the payload, and immediately starts a pipeline of AI agents to diagnose the incident. Roughly ten seconds in, you get a first hypothesis posted to Slack. By sixty seconds, you have a structured root cause with confidence scores, relevant runbook links, and a summary of how similar past incidents were resolved.
Everything surfaces in Slack. No dashboards to open, no portal to log into. The diagnosis comes to you.
Three phases, nine agents
I designed the pipeline in three phases, each with its own latency target.
Phase 1 — under 10 seconds
Three agents run in parallel the moment the alert arrives:
- Error Classifier: categorizes the error type (infrastructure, SQL, memory, timeout) so everything downstream has a taxonomy to work from
- Context Gatherer: calls the source system API directly to pull full job run details, cluster config, and task-level errors. No LLM involved, just fast API calls
- Quick Triage: drafts a 2–3 sentence human-readable hypothesis from the raw error. First plain-English explanation that gets threaded into Slack
Phase 2 — under 60 seconds
Four more agents run in parallel once Phase 1 enrichments are available:
- Knowledge Search: searches Confluence for matching runbooks by error category and service name, returns links and excerpts
- History Search: embeds the error message and queries past resolved incidents by vector similarity. Surfaces how the last similar incident was actually fixed
- Spark Expert: domain-specialist agent that pattern-matches known Spark failure signatures
- Root Cause Analyst: synthesises all Phase 1 context into a structured diagnosis with a root cause, confidence score, and suggested remediation actions
Phase 3 — on-demand
Two agents triggered by Slack buttons, only when the engineer wants to go deeper:
- Log Deep Dive: fetches detailed logs and surfaces key anomalies
- Remediation Suggester: combines the root cause, runbook content, and incident history to generate a step-by-step fix guide with rollback instructions
Each agent decides its own relevance, runs within a hard timeout, and fails without blocking others.
The stack
Core: FastAPI for the REST layer and webhook endpoints, Pydantic v2 for typed models throughout.
Orchestration: LangGraph. I specifically wanted to learn how to think in a proper multi-agent framework rather than chaining sequential API calls.
LLM: GPT-4o-mini. Cheap enough to experiment with, fast enough to hit the latency targets.
Storage: Databricks SQL with Delta Lake. Past resolved incidents are stored and vectorised for the History Search agent, which is a goal actively being implemented.
Integrations: Slack SDK with Block Kit cards for incident output, Databricks REST API for job run context, Confluence REST API for runbook search.
What was actually hard
Slack integrations were more involved than I expected — Block Kit cards, interactive buttons triggering Phase 3 agents, threading replies by phase so the channel stays readable. There’s a lot of surface area in a well-designed Slack bot.
The system design was the bigger challenge intellectually, and also the more interesting one. My senior year concentration in undergrad was concurrent and distributed systems, so the instinct to think in parallel was already there. But applying those patterns inside an AI agent framework is a different exercise: agents running concurrently, sharing context through a structured enrichment model, failing independently without blocking the pipeline. I got to dust off an old way of thinking and point it at something new.
Current state
Scentry started as a hackathon project and is currently running in a local dev environment. I’ve tested the full pipeline end-to-end with real webhook payloads, the agents run, the Slack cards post, diagnoses surface. Claude helped me cover a lot of ground quickly getting here.
What’s left is the harder part: testing each agent in isolation, tuning confidence calibration, and refining behavior with feedback from real incidents. There are a lot of moving parts, and tightening each one is going to take time and real-world signal.
What I’ve been learning
Multi-agent AI systems share a lot of DNA with distributed services: timeouts, independent failure modes, structured contracts between components, clear separation of concerns. What’s new is calibrating LLM behavior. A confidence score of 85% means something different than an HTTP 200, and knowing when not to use an LLM matters as much as knowing when to.
The Context Gatherer, pure API calls with no LLM, is one of the most valuable agents in Phase 1. Fast, reliable, high confidence. The Spark Expert uses an LLM and has the lowest confidence score in the pipeline. That gap says something worth paying attention to.