The agentic SRE — now in private beta

The autonomous SRE that doesn't just find the bug. It ships the fix.

selfheal investigates the moment an alert fires — correlates signals across logs, traces and configs, tests hypotheses in parallel, isolates the root cause, then opens a reviewed pull request with a regression test attached. Incident in. Diff out. MTTR measured in seconds, not hours.

On-call MTTR · minutes → secondsRead-only access · no agent in your buildHuman-in-the-loop merge
heal-agent · on shift
incident-9f3a
00.00ALERTPagerDuty · checkout-api p99 latency 6.2s (SLO 1.5s)
_listening…
One agent · every signal · live

selfheal sits in the middle of your stack — 24 integrations, one continuous chain of custody from page to pull request.

selfheal
selfheal
heal-agent · on shift
24 live
Observability
DatadogGrafanaSentryCloudWatchNew Relic
Alerts
PagerDutyOpsGenieGrafana OnCallSplunk
Comms
SlackMS TeamsLinear
Source control
GitHubGitLabBitbucket
Data platform
AirflowKafkaSparkdbt
Infrastructure
AWSGCPAzureKubernetesVercel
01
10× faster

Mean time to resolution — investigation that lands while you're still loading the dashboard.

02
Durable fixes, not patches

The PR description ships with a regression test that reproduces the incident in under two seconds.

03
Senior context for everyone

Every on-call gets the depth of investigation that used to live in a single staff engineer's head.

What selfheal does

Most "AI SRE" tools find the root cause.
selfheal closes the loop with a pull request.

Every page these days ends the same way — a tired engineer translating a great RCA into a small diff at 3 AM. selfheal writes that diff. Reviewed by you. Tested before you read it.

01

Investigate

Structured, deterministic investigation — parallel hypotheses, evidence collected from your logs, metrics and configs, with a fully auditable trail.

evidence-backed
02

Heal

The missing last mile. selfheal opens a reviewed pull request on the offending service with the minimal patch and a test that reproduces the incident.

the diff that closes the page
03

Verify

The fix is canary-tested against the original failure mode before anyone sees a Slack message. Verified before it's escalated, not after it's merged.

canary held
04

Learn

Every closed incident becomes a runbook entry, a regression test, and a sharper signal for the next page. selfheal gets quieter over time on purpose.

quieter every quarter
The on-call tax · 2026

Sure, you can keep investigating alerts manually. Or you can let selfheal do it before the page even reaches your phone.

The cost of an incident isn't the bug — it's the hours of human attention spent rebuilding context every single time. selfheal was built to take that work back.

See the Heal Loop
Dig through logs across five tools

Pull traces from Datadog, switch to Grafana for metrics, drop into Sentry for the stack — context loss is the actual on-call tax.

Page colleagues who already left for the day

The senior engineer who knows this corner of the stack is asleep. You write a Slack thread that won't be answered for six hours.

Spend hours playing detective

Reproduce locally, eyeball charts, guess what changed. Under pager pressure, you ship a patch — not a fix — just to go to bed.

↑ this is the part selfheal replaces.
The Heal Loop · six steps · one agent

Alert in. Pull request out. Everything in between is automated, auditable, and parallel.

STEP 01

Detect

Ingest the alert from Datadog, Grafana, Sentry, CloudWatch or PagerDuty. selfheal reads the page like a human on-call — context first.

STEP 02

Correlate

Assemble evidence from logs, metrics, traces, configs and dependency graphs. Only what's tied to this incident — nothing else leaves your cloud.

STEP 03

Hypothesize

Frame the candidate failure modes — bad deploy, dependency flake, leak, quota, regression. Each becomes a parallel investigation branch.

STEP 04

Investigate

Run all hypotheses in parallel against your observability stack. Drop the ones the evidence refutes. Lock in the one it confirms.

STEP 05

Heal

Open a pull request on the source repo — minimal diff, regression test, RCA in the description, every claim linked to the evidence that produced it.

closes the page
STEP 06

Learn

Promote the verified fix into a runbook entry and a continuous regression check. The next incident of this shape never makes it to PagerDuty.

Why one agent for the whole loop
Investigation and remediation are the same job.

Tools that stop at "root cause report" force someone to re-derive the patch from the evidence — at 3 AM, under pager pressure, with the context already half-gone. selfheal keeps every link between symptom, hypothesis, evidence and diff in one continuous chain, so the PR description is just the investigation, written down.

01 → 04 · investigate05 → 06 · heal & learnone agent · one chain of custody
A real PR description, written by selfheal
fix(payments-worker): close DB connection on retry-after-error path

incident:    incident-9f3a · checkout-api p99 6.2s (SLO 1.5s)
root cause:  pool exhausted — connections leaked when a retry was
             scheduled before close() ran. confirmed against
             40m of cloudtrail + pg_stat_activity diff.

evidence:    https://selfheal.ai/i/9f3a   (auditable trail)
regression:  added e2e test that reproduces the leak in <2s.

verified:    canary held 320ms p99 for 12m. ready to merge.

Co-authored-by: heal-agent <agent@selfheal.ai>
How selfheal reasons

Test every hypothesis in parallel. Correlate every signal simultaneously. Land the answer in seconds, not hours.

Manual investigation is sequential — open a dashboard, eyeball a chart, page a colleague, repeat. selfheal fans out the investigation across every candidate failure mode at once, lets the evidence vote, and stops the moment one hypothesis is decisively confirmed.

4 hypotheses · running in parallel
Bad deploy
refuted
DB pool exhausted
confirmed
Upstream provider
refuted
Cache stampede
refuted
Decisively confirmed in 8s — three branches discarded with evidence, one survived with proof.
A single source of truth · connected

Investigations stop being scattered.

Logs in Datadog, metrics in Grafana, configs in Git, runbooks in Notion, post-mortems in Slack. Each tool saw part of what happened. None of them saw the whole thing. selfheal reads them all and stitches the evidence into one timeline — the same one that ships in the PR.

Observability
Datadog · Grafana · Sentry
Infrastructure
AWS · k8s · Vercel
Source
GitHub · GitLab
Team knowledge
Slack · Notion · runbooks

Every closed incident makes the next page easier.

The verified fix becomes a runbook entry, a regression test, and a memorised pattern. Re-occurrences are investigated in a fraction of the original time — or prevented before the page ever fires. selfheal gets quieter every quarter, on purpose.

Repeat incidents
↓ 73%
P1 pages / wk
↓ 41%
MTTR drift
−0.4× / qtr
Two architectural bets

Two decisions an incident-response vendor structurally can't copy without rebuilding from zero.

Most "AI on-call" tools are wrappers around a chat window. They paste a stack trace into a model and hand the answer back to you. selfheal is built on two assumptions that change what an SRE agent is allowed to do — and what it's allowed to ship.

BET 01 · TRUST

The fix ships as a pull request, not a deploy.

Auto-remediation tools want to push the change. We don't. selfheal opens a PR on your repo, with the diff, the regression test, and the evidence trail. Your reviewer is still the merge button. Production stays under human gates — the agent just removes the 3 AM typing.

PR #4831 · payments-workerawaiting review
-conn.retry(after).then(close)+await conn.close(); retry(after)
+12-3ci · 14 checks · all green
Reviewer keeps the merge button. Always.
BET 02 · CUSTODY

Your logs stay in your cloud. We pull a slice, on-incident.

The agent reads against a read-only role inside your account. There is no selfheal data lake. When an incident fires, we fetch the narrow time-and-actor slice the investigation needs — nothing else. Zero standing access to your code, zero bulk egress of your telemetry.

Logs egressed at rest
0 bytes
Standing prod write access
None
Slice pulled on incident
≈ 14m window
Audit trail
every query
SOC 2 pathYour VPC, your roleNo vendor lake
The on-call rota, before and after

Page the on-call. Or merge the fix.

Modern incident response is mostly handoff cost — pager to human, human to dashboard, dashboard to repo, repo to reviewer. selfheal removes every handoff that doesn't involve a human judgement call.

Dimension
Today's on-call
With selfheal
What lands on you at 3 AM
Pager, stack trace, blank Slack thread
PR draft, evidence trail, reviewer button
Time to first hypothesis
15–40 min · human paging the team
≈ 8 seconds · 4 hypotheses in parallel
Output of the investigation
A doc someone has to translate to a diff
A diff someone just reviews
Regression coverage after the fix
Maybe a Jira ticket, maybe never
Test that reproduces the incident, in the PR
Logs leaving your cloud
All of them, on a billable pipe
An incident slice. Audited.
Standing prod write access
Vendor agent has it
Nobody — fixes go through your reviewer
Integrations · plug into what you already run

Reads from your stack. Writes only to your repo.

60+ connectors across observability, infrastructure, source control and communication. Connect what you already run — selfheal does the correlation, the investigation and the fix-in-a-PR on top.

Read-only rolesNo vendor data lakeMCP-compatible
Alerts
  • PagerDuty
  • OpsGenie
  • Datadog
  • Grafana OnCall
Observability
  • Datadog
  • Grafana / Loki
  • Sentry
  • CloudWatch
  • New Relic
Infra
  • AWS
  • GCP
  • Azure
  • Kubernetes
  • Vercel
Data platform
  • Apache Airflow
  • Apache Kafka
  • Apache Spark
  • dbt
Source
  • GitHub
  • GitLab
  • Bitbucket
Comms
  • Slack
  • Microsoft Teams
  • Linear
PagerDutyOpsGenieDatadogGrafana OnCallDatadogGrafana / LokiSentryCloudWatchNew RelicAWSGCPAzureKubernetesVercelApache AirflowApache KafkaApache SparkdbtGitHubGitLabBitbucketSlackMicrosoft TeamsLinearPagerDutyOpsGenieDatadogGrafana OnCallDatadogGrafana / LokiSentryCloudWatchNew RelicAWSGCPAzureKubernetesVercelApache AirflowApache KafkaApache SparkdbtGitHubGitLabBitbucketSlackMicrosoft TeamsLinear
Put selfheal on shift tonight

Connect a read-only role. Get your first PR by morning standup.

Twenty minutes to connect. Forty minutes to verify the agent against a synthetic failure. After that — every page comes with a draft fix attached.

20m
to connect a read-only role
1h
to validate against a synthetic incident
0
bytes of logs leaving your cloud at rest