Evidence-Driven AI · Incident Investigation

From outage to root cause in minutes.

Diagnara uses governed AI agents to investigate incidents across logs, metrics, APIs, and deployments, with every conclusion backed by traceable evidence.

0%
Faster mean time to resolution
0%
Source-linked, auditable evidence
0
Autonomous triage coverage
−92% MTTR
vs. manual triage
14 sources
cited in this trace
diagnara · investigationSEV-1 · live
High latency after deploy · contract-provider
INC-4821 · 02:14 UTC
Detect +0.4s
p95 latency 4.2× baseline across 3 regions after release.
metrics://datadog · checkout.p95
Investigate +6.1s
Runbook executed — pulled logs, deploy events, and the change ticket.
runbook://latency-after-deploy
Correlate running…
Mapping deploy #9f2a1c → slow query cluster on payments-db.
github://PR-2287 · diff
Conclude
Synthesizing evidence into a ranked root-cause hypothesis.
Root cause identifiedconfidence 96%
PR #2287 removed an index used by the contract lookup query, causing full-table scans under load. Rollback recommended.
The problem

Critical context is scattered across every tool you own.

During an incident, engineers manually correlate logs, metrics, deployments, API behavior, tickets, and institutional knowledge — under pressure, against the clock. Most AI tools just summarize. They answer without defensible justification. Diagnara investigates instead.

// fragmented during an incident
Logs (Elastic)Metrics (Datadog)Deploys & PRs (GitHub)Jira / ServiceNowPostgresAPIsFeature flagsKafka / SQSAWS / Infrastructure
governed investigation
One evidence-backed conclusion
Root cause · confidence · impact · next actions · full audit trail
Why Diagnara is different

Not a chatbot. An investigation.

Every answer is the result of a structured investigation — with a reproducible reasoning path you can defend to engineering, leadership, compliance, and audit.

Evidence over conversation

Evidence over conversation

Every conclusion links to a cited data point — a log line, metric, ticket, deploy, or contract record. No claim without a source, no source without a tool call.

Protocol over improvisation

Protocol over improvisation

Investigations follow versioned runbooks that define the steps, required tools, evidence thresholds, and approval gates — so outputs are consistent and reproducible.

Governance over open access

Governance over open access

RBAC, quotas, rate limits, data masking, model routing, and audit events are built in — agents only touch approved tools, within approved environments.

The evidence pipeline

A structured protocol, end to end.

Diagnara turns a signal into a defensible conclusion through five explicit stages — every run reproducible, every step auditable.

Stage 01 · Detect

Catch the signal, from anywhere

An alert, a Jira ticket, or a manual prompt opens an investigation. Diagnara intakes the signal and scopes it to the right environment.

CRIT p95 latency 4.2× baseline · contract-provider 02:14:06
WARN slow query cluster detected on payments-db 02:14:09
INTAKE environment=Production · runbook matched 02:14:12
Stage 02 · Investigate

Run a runbook, not a guess

A versioned investigation runbook defines the steps, required tools, evidence thresholds, and approval gates for this class of incident — executed inside guardrails.

STEP 1 pull deploy events (last 60m) tool: deployments
STEP 2 compare pre/post p95 latency tool: metrics
STEP 3 fetch change ticket + PR diff tool: jira, github
GATE sensitive query → human approval policy
Stage 03 · Correlate

Test hypotheses against real data

Diagnara links evidence across systems, forms competing hypotheses, and validates each against the data — accepting and discarding with reasons.

H1 ✓ PR #2287 dropped index → full-table scans 96%
H2 ✗ upstream provider degradation discarded · 31%
H3 ✗ traffic surge / capacity discarded · 12%
Stage 04 · Conclude

A conclusion you can defend

The output is a ranked root cause, a confidence label, an impact assessment, and recommended next actions — every claim source-linked and reproducible.

ROOT CAUSE missing index on contract_lookup conf 96%
IMPACT 3 regions · ~14m · 38k requests assessed
ACTION rollback PR #2287 / hotfix index recommended
Stage 05 · Institutionalize

Every investigation makes the next one faster

The timeline, evidence graph, decisions, and report are stored as reusable, auditable knowledge — so when a similar signal returns, the investigation builds on what's known and reaches a confident answer far faster.

SAVED investigation timeline + evidence graph knowledge base
AUDIT 14 tool calls · 2 approvals logged immutable
REUSE linked to runbook v3 · postmortem indexed
Architecture

Governed, tool-driven, multi-agent by design.

Every input is scoped, every action is policy-checked, and every fact comes from a registered tool — so you can see exactly where your data flows before you ever book a demo.

Scroll horizontally to follow the flow
01 · Signals

Inputs

alerts · tickets · prompts
02 · Governance

Control plane

rbac · policy · audit
03 · Multi-agent

Investigation

runbook · agents
04 · Tool-first

Governed tools

logs · metrics · sql
05 · Sources

External systems

elastic · datadog · github
Evidence store & timeline
Every tool call returns a cited, source-linked fact into one reproducible, auditable record.
all evidence converges here
Governed tools

Connected to where the evidence lives.

Agents access data only through registered tools — each with permissions, rate limits, timeouts, and human-in-the-loop approval.

Logs

Search and correlate application and gateway logs across services and regions.

ElasticREST

Metrics & APM

Compare latency, error rates, and saturation; trace spans across dependencies.

Datadogtraces

Deployments & PRs

Tie incidents to releases — deploy events, PR diffs, and approval trails.

GitHubdeploy events

Tickets

Pull change tickets and incident records to reconstruct context and timeline.

JiraServiceNow

Databases (SQL)

Query production databases and read replicas directly through governed SQL.

PostgresSQL

APIs

Call internal and partner business services — billing, contracts, and ledger — as governed REST, SOAP, and GraphQL endpoints.

RESTSOAPGraphQL
The output

An evidence-backed report, not a hunch.

Every investigation ends in a structured report your whole organization can trust — and auditors can review.

Ranked root cause & confidence
A primary hypothesis with a calibrated confidence label — and the alternatives that were ruled out.
Source-linked evidence
Each conclusion cites the exact log, metric, ticket, or deploy — traceable back to the tool call.
Impact & next actions
Scope of impact and a recommended remediation — rollback, hotfix, replay, or escalation.
diagnara · reportresolved
Connection errors spike · checkout-api
INC-4906 · checkout-api · Production
Root cause93% confidence
The orders-db RDS instance hit its max_connections ceiling (820/820) after release v3.4 opened a new client per request without pooling, throwing FATAL: too many connections across checkout.
Recommended next action
Raise max_connections and recycle the leaked pool, then roll back v3.4. Add a connection-pool leak assertion to the load-test gate.
Evidence · 4 sources
rds://DatabaseConnections pinned at 820 ceiling04:12
metrics://checkout 5xx 0.2% → 31%spike
github://v3.4 opens a client per requestdiff
Tools used · 4
awsCloudWatchawsRDS consolelogsKibanagithubPR #341
Resolved in 3m12s9 tool calls · 1 approval · full audit trail
Schema drift after release · billing-service
INC-4877 · billing-service · Production
Root cause94% confidence
Liquibase changeset add-invoice-status never ran — the deploy migration job exited on a stale DATABASECHANGELOGLOCK, so v2.9 queried a column that doesn't exist yet: column "invoice_status" does not exist.
Recommended next action
Release the stale changelog lock and re-run the Liquibase migration, then add a post-deploy migration verification gate to the release runbook.
Evidence · 4 sources
logs://column "invoice_status" does not exist×2.3k
db://DATABASECHANGELOGLOCK held 41mlock
deploy://migrate job exit code 1step
Tools used · 4
logsKibanasqlPostgresciLiquibasegithubchangeset #512
Resolved in 1m54s11 tool calls · 2 approvals · full audit trail
High latency after deploy · contract-provider
INC-4821 · contract-provider · Production
Root cause96% confidence
PR #2287 changed the contract_lookup query to one that no longer uses its index, causing full-table scans under peak load and a 4.2× p95 latency increase across three regions.
Recommended next action
Roll back PR #2287 or ship a hotfix re-adding the index. Add a migration check to the deploy runbook.
Evidence · 5 sources
metrics://p95 latency 210ms → 880ms post-deploy02:14
github://PR #2287 drops idx_contract_lookupdiff
logs://seq scan on contracts (1.2k slow queries)1.2k
Tools used · 4
logsKibanametricsGrafanagithubcontract-querygithubPR #2287
Resolved in 2m38s14 tool calls · 2 approvals · full audit trail
Governance & Trust

Move faster without giving up control.

Governance isn't an add-on — it's the product. RBAC, quotas, audit, and human-in-the-loop approval are enforced by default, so teams accelerate while compliance stays intact.

RBAC & scope

Admin, Manager, and User roles — every action scoped to approved tools and environments.

Quotas & rate limits

Execution and tool usage are bounded to protect system stability and prevent abuse.

Audit trail

Runs, policy changes, and tool access are logged as immutable, reviewable events.

Human-in-the-loop

Low-confidence, sensitive, or costly actions pause for explicit human approval.

See Diagnara run a live investigation.

Book a guided demo tailored to your incident workflow — from signal detection to evidence-backed conclusion, with full traceability and governance.