The agentic SRE — now in private beta

The autonomous SRE that doesn't just find the bug. It ships the fix.

selfheal investigates the moment an alert fires — correlates signals across logs, traces and configs, tests hypotheses in parallel, isolates the root cause, then opens a reviewed pull request with a regression test attached. Incident in. Diff out. MTTR measured in seconds, not hours.

Book a demo See the Heal Loop

On-call MTTR · minutes → secondsRead-only access · no agent in your buildHuman-in-the-loop merge

heal-agent · on shift

incident-9f3a

00.00ALERTPagerDuty · checkout-api p99 latency 6.2s (SLO 1.5s)P1

▍_listening…

PR opened in 24s

One agent · every signal · live

selfheal sits in the middle of your stack — 24 integrations, one continuous chain of custody from page to pull request.

Datadog

Grafana

Sentry

CloudWatch

New Relic

PagerDuty

OpsGenie

Grafana OnCall

Splunk

Slack

MS Teams

Linear

GitHub

GitLab

Bitbucket

Airflow

Kafka

Spark

dbt

AWS

GCP

Azure

Kubernetes

Vercel

selfheal

heal-agent

on shift

Observability· 5Alerts· 4Comms· 3Source control· 3Data platform· 4Infrastructure· 5

24 integrations · 1 agent · live

selfheal

heal-agent · on shift

24 live

Observability

DatadogGrafanaSentryCloudWatchNew Relic

Alerts

PagerDutyOpsGenieGrafana OnCallSplunk

Comms

SlackMS TeamsLinear

Source control

GitHubGitLabBitbucket

Data platform

AirflowKafkaSparkdbt

Infrastructure

AWSGCPAzureKubernetesVercel

10× faster

Mean time to resolution — investigation that lands while you're still loading the dashboard.

Durable fixes, not patches

The PR description ships with a regression test that reproduces the incident in under two seconds.

Senior context for everyone

Every on-call gets the depth of investigation that used to live in a single staff engineer's head.

What selfheal does

Most "AI SRE" tools find the root cause.
selfheal closes the loop with a pull request.

Every page these days ends the same way — a tired engineer translating a great RCA into a small diff at 3 AM. selfheal writes that diff. Reviewed by you. Tested before you read it.

Investigate

Structured, deterministic investigation — parallel hypotheses, evidence collected from your logs, metrics and configs, with a fully auditable trail.

evidence-backed

Heal

The missing last mile. selfheal opens a reviewed pull request on the offending service with the minimal patch and a test that reproduces the incident.

the diff that closes the page

Verify

The fix is canary-tested against the original failure mode before anyone sees a Slack message. Verified before it's escalated, not after it's merged.

canary held

Learn

Every closed incident becomes a runbook entry, a regression test, and a sharper signal for the next page. selfheal gets quieter over time on purpose.

quieter every quarter

The on-call tax · 2026

Sure, you can keep investigating alerts manually. Or you can let selfheal do it before the page even reaches your phone.

The cost of an incident isn't the bug — it's the hours of human attention spent rebuilding context every single time. selfheal was built to take that work back.

See the Heal Loop

Dig through logs across five tools

Pull traces from Datadog, switch to Grafana for metrics, drop into Sentry for the stack — context loss is the actual on-call tax.

Page colleagues who already left for the day

The senior engineer who knows this corner of the stack is asleep. You write a Slack thread that won't be answered for six hours.

Spend hours playing detective

Reproduce locally, eyeball charts, guess what changed. Under pager pressure, you ship a patch — not a fix — just to go to bed.

↑ this is the part selfheal replaces.

The Heal Loop · six steps · one agent

Alert in. Pull request out. Everything in between is automated, auditable, and parallel.

STEP 01

Detect

Ingest the alert from Datadog, Grafana, Sentry, CloudWatch or PagerDuty. selfheal reads the page like a human on-call — context first.

STEP 02

Correlate

Assemble evidence from logs, metrics, traces, configs and dependency graphs. Only what's tied to this incident — nothing else leaves your cloud.

STEP 03

Hypothesize

Frame the candidate failure modes — bad deploy, dependency flake, leak, quota, regression. Each becomes a parallel investigation branch.

STEP 04

Investigate

Run all hypotheses in parallel against your observability stack. Drop the ones the evidence refutes. Lock in the one it confirms.

STEP 05

Heal

Open a pull request on the source repo — minimal diff, regression test, RCA in the description, every claim linked to the evidence that produced it.

closes the page

STEP 06

Learn

Promote the verified fix into a runbook entry and a continuous regression check. The next incident of this shape never makes it to PagerDuty.

Why one agent for the whole loop

Investigation and remediation are the same job.

Tools that stop at "root cause report" force someone to re-derive the patch from the evidence — at 3 AM, under pager pressure, with the context already half-gone. selfheal keeps every link between symptom, hypothesis, evidence and diff in one continuous chain, so the PR description is just the investigation, written down.

01 → 04 · investigate05 → 06 · heal & learnone agent · one chain of custody

A real PR description, written by selfheal

fix(payments-worker): close DB connection on retry-after-error path

incident:    incident-9f3a · checkout-api p99 6.2s (SLO 1.5s)
root cause:  pool exhausted — connections leaked when a retry was
             scheduled before close() ran. confirmed against
             40m of cloudtrail + pg_stat_activity diff.

evidence:    https://selfheal.ai/i/9f3a   (auditable trail)
regression:  added e2e test that reproduces the leak in <2s.

verified:    canary held 320ms p99 for 12m. ready to merge.

Co-authored-by: heal-agent <agent@selfheal.ai>

How selfheal reasons

Test every hypothesis in parallel. Correlate every signal simultaneously. Land the answer in seconds, not hours.

Manual investigation is sequential — open a dashboard, eyeball a chart, page a colleague, repeat. selfheal fans out the investigation across every candidate failure mode at once, lets the evidence vote, and stops the moment one hypothesis is decisively confirmed.

4 hypotheses · running in parallel

Bad deploy

git history quiet for 14hrefuted

DB pool exhausted

pg_stat_activity diff +812confirmed

Upstream provider

downstream peers greenrefuted

Cache stampede

hit-rate unchangedrefuted

Decisively confirmed in 8s — three branches discarded with evidence, one survived with proof.

A single source of truth · connected

Investigations stop being scattered.

Logs in Datadog, metrics in Grafana, configs in Git, runbooks in Notion, post-mortems in Slack. Each tool saw part of what happened. None of them saw the whole thing. selfheal reads them all and stitches the evidence into one timeline — the same one that ships in the PR.

Observability

Datadog · Grafana · Sentry

Infrastructure

AWS · k8s · Vercel

Source

GitHub · GitLab

Team knowledge

Slack · Notion · runbooks

Every closed incident makes the next page easier.

The verified fix becomes a runbook entry, a regression test, and a memorised pattern. Re-occurrences are investigated in a fraction of the original time — or prevented before the page ever fires. selfheal gets quieter every quarter, on purpose.

Repeat incidents

↓ 73%

P1 pages / wk

↓ 41%

MTTR drift

−0.4× / qtr

Two architectural bets

Two decisions an incident-response vendor structurally can't copy without rebuilding from zero.

Most "AI on-call" tools are wrappers around a chat window. They paste a stack trace into a model and hand the answer back to you. selfheal is built on two assumptions that change what an SRE agent is allowed to do — and what it's allowed to ship.

BET 01 · TRUST

The fix ships as a pull request, not a deploy.

Auto-remediation tools want to push the change. We don't. selfheal opens a PR on your repo, with the diff, the regression test, and the evidence trail. Your reviewer is still the merge button. Production stays under human gates — the agent just removes the 3 AM typing.

PR #4831 · payments-workerawaiting review

-conn.retry(after).then(close)+await conn.close(); retry(after)

+12-3ci · 14 checks · all green

Reviewer keeps the merge button. Always.

BET 02 · CUSTODY

Your logs stay in your cloud. We pull a slice, on-incident.

The agent reads against a read-only role inside your account. There is no selfheal data lake. When an incident fires, we fetch the narrow time-and-actor slice the investigation needs — nothing else. Zero standing access to your code, zero bulk egress of your telemetry.

Logs egressed at rest

0 bytes

Standing prod write access

None

Slice pulled on incident

≈ 14m window

Audit trail

every query

SOC 2 pathYour VPC, your roleNo vendor lake

The on-call rota, before and after

Page the on-call. Or merge the fix.

Modern incident response is mostly handoff cost — pager to human, human to dashboard, dashboard to repo, repo to reviewer. selfheal removes every handoff that doesn't involve a human judgement call.

Dimension

Today's on-call

With selfheal

What lands on you at 3 AM

Pager, stack trace, blank Slack thread

PR draft, evidence trail, reviewer button

Time to first hypothesis

15–40 min · human paging the team

≈ 8 seconds · 4 hypotheses in parallel

Output of the investigation

A doc someone has to translate to a diff

A diff someone just reviews

Regression coverage after the fix

Maybe a Jira ticket, maybe never

Test that reproduces the incident, in the PR

Logs leaving your cloud

All of them, on a billable pipe

An incident slice. Audited.

Standing prod write access

Vendor agent has it

Nobody — fixes go through your reviewer

Integrations · plug into what you already run

Reads from your stack. Writes only to your repo.

60+ connectors across observability, infrastructure, source control and communication. Connect what you already run — selfheal does the correlation, the investigation and the fix-in-a-PR on top.

Read-only rolesNo vendor data lakeMCP-compatible

Alerts

PagerDuty
OpsGenie
Datadog
Grafana OnCall

Observability

Datadog
Grafana / Loki
Sentry
CloudWatch
New Relic

Infra

AWS
GCP
Azure
Kubernetes
Vercel

Data platform

Apache Airflow
Apache Kafka
Apache Spark
dbt

Source

GitHub
GitLab
Bitbucket

Comms

Slack
Microsoft Teams
Linear

PagerDutyOpsGenieDatadogGrafana OnCallDatadogGrafana / LokiSentryCloudWatchNew RelicAWSGCPAzureKubernetesVercelApache AirflowApache KafkaApache SparkdbtGitHubGitLabBitbucketSlackMicrosoft TeamsLinearPagerDutyOpsGenieDatadogGrafana OnCallDatadogGrafana / LokiSentryCloudWatchNew RelicAWSGCPAzureKubernetesVercelApache AirflowApache KafkaApache SparkdbtGitHubGitLabBitbucketSlackMicrosoft TeamsLinear

Put selfheal on shift tonight

Connect a read-only role. Get your first PR by morning standup.

Twenty minutes to connect. Forty minutes to verify the agent against a synthetic failure. After that — every page comes with a draft fix attached.

Book a demo Request early access or email the founders →

20m

to connect a read-only role

to validate against a synthetic incident

bytes of logs leaving your cloud at rest

The autonomous SRE that doesn't just find the bug. It ships the fix.

selfheal sits in the middle of your stack — 24 integrations, one continuous chain of custody from page to pull request.

Most "AI SRE" tools find the root cause.selfheal closes the loop with a pull request.

Investigate

Heal

Verify

Learn

Sure, you can keep investigating alerts manually. Or you can let selfheal do it before the page even reaches your phone.

Alert in. Pull request out. Everything in between is automated, auditable, and parallel.

Detect

Correlate

Hypothesize

Investigate

Heal

Learn

Test every hypothesis in parallel. Correlate every signal simultaneously. Land the answer in seconds, not hours.

Investigations stop being scattered.

Every closed incident makes the next page easier.

Two decisions an incident-response vendor structurally can't copy without rebuilding from zero.

The fix ships as a pull request, not a deploy.

Your logs stay in your cloud. We pull a slice, on-incident.

Page the on-call. Or merge the fix.

Reads from your stack. Writes only to your repo.

Connect a read-only role. Get your first PR by morning standup.

Most "AI SRE" tools find the root cause.
selfheal closes the loop with a pull request.