Automated SRE · on shift · scanning 14 services

The Automated SRE
for your cloud.
On call. Always.

Meet Radar — the watchdog on shift for your AWS. Ears up 24/7, sniffs out real incidents, ignores the squirrels (known noise), correlates deploys, and only barks when production actually needs a human.

Signals/24h

Noise suppressed

412

Pages fired

payments-api

checkout-web

edge-cdn

auth-svc

primary-db

search-api

000090180270

Radar, the InfraWatchdog mascot — an alert shepherd dog with a glowing cyan signal collar tag

radar · on shift · us-east-1

Time to first action

30 seconds. Not 30 minutes.

The same 3 a.m. PagerDuty alert. Two timelines. The difference is whether your on-call engineer is reading evidence or hunting for it.

Without InfraWatchdog

~30 min

00:00PagerDuty fires. Wake up. Open laptop.
02:30Squint at CloudWatch dashboard. Which service?
06:00Switch to Logs Insights. Write a query. Wait.
11:00Re-write the query. Filter the right log group.
16:00Check #deploys. Was anything shipped?
22:00Grep correlated services. Cross-reference SQS depth.
27:00Form a hypothesis. Page a teammate to verify.
30:00Roll back. Pray.

With InfraWatchdog

~30 sec

00:00Slack page from Radar. Headline + verdict in one line.
00:05AI summary: which service, which deploy, which queue.
00:12Linked log groups + the dominant error pattern.
00:18Correlated deploy chip — confidence: high.
00:25One click to investigation. Evidence rendered.
00:30Roll back. Confident.

See more use cases →

Everything in the box

The whole investigation surface. Not just an AI.

Radar is the brain. These are the senses, the muscle memory, and the comms layer it runs on. Every one is a real, working surface — click any tile to see exactly what you'll get.

AI investigations

An autonomous agent triages every incident.

p99 → 11.4s · deploy v3.42.1 · roll back

Trace journey

Follow one ID across every log group in a service.

req_a1b2c3 · 47 events · 6 log groups

Natural-language search

Ask CloudWatch in English. Get answers, not queries.

"5xx on payments-api last 30m"

Service map

Group log groups into the services they actually belong to.

acme-api · 3 ECS services · 1 RDS

Cluster reviews

Periodic AI health reports across ECS, RDS, EC2, Atlas.

weekly · 7 findings · 2 actionable

Error watch

Continuous baseline-aware error pattern detection.

users-api 5xx +820% vs baseline

Deploy correlation

Webhook your deploys. Incidents get linked automatically.

v3.42.1 · 3m before INC-2041

Slack & webhook alerts

Pages with the headline, the evidence, and a verdict.

#infra-alerts · confidence · high

Public status pages

Customer-facing incident comms on your own slug.

status.acme.com · 3 services · investigating

Incident simulator

Realistic AWS chaos for safe demos and runbook drills.

scenario · api-cascade · 12 events

Signal · Noise

Most alerts are noise. Radar finds signal.

suppressed

412

Known noise

14:02edge-cdn 502 burst — known upstream flap
13:48search-api timeout pattern — recurring
13:31auth-svc cold-start spike — within baseline
13:14primary-db slow-query — on follow-up cooldown

surfaced

Signal

14:21payments-api p99 11.4s — deploy v3.42.1 correlated
12:08users-api 5xx +820% — SQS depth ×4.2
09:44rds-prod-1 IOPS saturation — customer-facing

How Radar scans

A continuous loop. Scan, correlate, alert.

step · 01

Scan

Every 60 seconds Radar samples error rates, latency, deploy events, and stack telemetry across every connected service.

step · 02

Correlate

It cross-references against the 6h baseline, recent deploys, prior noise lessons, and customer-facing impact signals.

step · 03

Alert

If the gates fire, Radar pages Slack with the headline, the evidence, and a one-paragraph explanation in plain English.

What your Automated SRE does

A full SRE shift, automated. From blip to root cause.

Everything a senior SRE does on call — watch dashboards, grep logs, correlate deploys, write the post-mortem — InfraWatchdog does continuously. Read-only AWS access, deterministic retrieval, AI on top.

investigations

AI Investigations

Ask Radar a question — “why did users-api 5xx spike at 14:22?” It pulls logs, metrics, deploys, and traces, then writes a timeline with evidence and a verdict.

module · live

logs

Logs Search

CloudWatch Logs Insights without the syntax tax. Natural-language queries, saved patterns, multi-log-group fan-out, and result diffing across time windows.

module · live

error watch

Error Watch

Continuously scans every connected log group for new error fingerprints, ranks them by blast radius, and groups recurrences so the same stack trace doesn’t page twice.

module · live

resources

Resource Inventory

Live read-only view of ECS services, RDS clusters, EC2 fleets and their health, capacity, and recent change events — keyed by workspace, scoped per IAM role.

module · live

service map

Service Maps

Auto-built dependency graph from traces and SQS/SNS topology. See what depends on the failing service before opening a single dashboard.

module · live

trace

Distributed Trace

Jump from a Slack alert into the exact slow span. Trace timelines align with deploys and metric anomalies on a single x-axis.

module · live

timelines

Incident Timelines

A merged stream of deploys, alarms, scale events, error blips, and manual notes — the runbook you wished post-mortems already had.

module · live

reviews

Cluster Reviews

Weekly per-service review: top errors, latency drift, deploy churn, noisy alarms, and Radar’s recommendations to quiet them.

module · live

history

Searchable History

Every investigation, query, alert, and verdict is auditable and replayable. Onboarding the next on-call is a link, not a meeting.

module · live

Slack alerts

Alerts that explain themselves.

Every page links the evidence: the metric, the deploy, the dominant error pattern, and whether customers are seeing it. No grepping. No tab archaeology. Just signal.

Headline that fits in a glance
Threaded follow-ups for related blips
Confidence + customer-impact verdict
Direct link to the timeline + raw logs

#infra-alerts · radar14:22 UTC

RadarAPP · agentincident · INC-2041

🛰 Signal detected · payments-api

p99 latency rose to 11.4s (baseline 2.1s) shortly after deploy v3.42.1. Cascading 5xx on users-api via job-events queue (depth ×4.2).

customer-facingdeploy-correlatedconfidence · high

threaded

14:25 · auto-rollback completed. Watching for recurrence over next 30m.

14:55 · No recurrence. Marking resolved.

Memory

Radar remembers.
It doesn’t bark twice.

Once a pattern is acknowledged or labeled noise, Radar suppresses recurrences inside cooldown windows and escalates only when severity, frequency, or customer impact changes.

cluster_memory · acme-api4 active

watchingedge-cdn 502 burst×72h cooldown
noisesearch-api timeout pattern×23suppressed
follow-uprds-prod-1 IOPS saturation×2due 18:00
resolvedauth-svc cold-start×4—

Correlation

Deploys, alarms, telemetry — linked automatically.

time · UTCeventverdict

14:21:08deploy payments-api · v3.42.1candidate
14:22:14p99 latency · 2.1s → 11.4sanomaly
14:22:31SQS depth · ×4.2cascade
14:22:44users-api 5xx · +820%customer-facing
14:25:02auto-rollback initiatedrecovering
14:55:00no recurrence · 30m windowresolved

Simulator · Demo Mode

Try Radar against a fake outage.

Don’t want to wire AWS just to evaluate? Flip on Demo Mode to spin up a synthetic Acme Robotics workspace — full ECS / RDS / EC2 inventory, CloudWatch logs, deploy history, and an in-process incident simulator that lets you trigger real-feeling outages and watch Radar respond.

Synthetic infra: ECS services, RDS clusters, EC2 fleets
Replayable scenarios: deploy regression, queue backlog, IOPS saturation
Slack output suppressed — nothing leaves the workspace
Reset to a clean state in one click

sim · scenariosdemo · acme-robotics

SCN-01payments-api · deploy regressionready
SCN-02users-api · SQS backlog cascadeready
SCN-03rds-prod-1 · IOPS saturationrunning
SCN-04auth-svc · cold-start stormready
SCN-05edge-cdn · upstream 502 burstready

5 scenarios · 0 side effectsread-only · sandboxed

Encrypted at rest, optional storage

Every log body, agent excerpt, and alert payload we keep is encrypted with AES-256 (pgcrypto). Don't want us to store any of it? One toggle in settings — counts stay, log lines never touch our database.

Read-only AWS, scoped per workspace

Radar requests logs:Filter*, logs:Describe*, cloudwatch:Get*. No write actions. Connections, IAM roles, and queries are isolated per workspace and audit-logged.

Deterministic first

Retrieval is deterministic. AI explains evidence — it never invents log lines or metrics.

Pricing

Hire your Automated SRE. Pay less than a coffee run.

All tiers are read-only by design. Cancel anytime. Annual billing saves 20%.

tier

Free

Bring your own OpenAI key. Kick the tires.

$0BYO key

Bring your own OpenAI API key
You pay OpenAI directly for tokens
1 AWS account · read-only
Up to 3 monitored services
7-day signal history
Slack alerts · 1 channel
Demo Mode + Simulator
Community support

tier

Solo

One operator, one account, full visibility.

$25/ seat / mo

BYO key, or +$14 for managed AI
1 AWS account
Up to 10 services · 14-day history
250 investigations / seat / month
Email + Slack alerts · weekly cluster reviews
30-day audit log · 1 status page
Email support

most teams

tier

Team

On-call rotations sharing context.

$59/ seat / mo (3 min)

AI included · or BYO key
Up to 3 AWS accounts · 40 services
30-day signal history
500 investigations / seat / month
Slack + PagerDuty + webhooks
Daily cluster reviews · Google SSO
90-day audit log · 3 status pages
Priority email support

tier

Business

Multi-account orgs needing audit + SSO.

$119/ seat / mo (5 min)

Up to 10 AWS accounts · 150 services
90-day signal history
1,000 investigations / seat / month
SAML SSO · 1-year audit log
Unlimited status pages
Dedicated Slack channel

All plans include: Investigations · Logs Search · Resource Inventory · Service Maps · Demo Mode

early access · limited

Put your Automated SRE on shift.

Connect AWS read-only. Point InfraWatchdog at the services that matter. Stop being the first responder for every CloudWatch hiccup.

The Automated SREfor your cloud.On call. Always.

30 seconds. Not 30 minutes.

The whole investigation surface. Not just an AI.

AI investigations

Trace journey

Natural-language search

Service map

Cluster reviews

Error watch

Deploy correlation

Slack & webhook alerts

Public status pages

Incident simulator

Most alerts are noise. Radar finds signal.

A continuous loop. Scan, correlate, alert.

Scan

Correlate

Alert

A full SRE shift, automated. From blip to root cause.

AI Investigations

Logs Search

Error Watch

Resource Inventory

Service Maps

Distributed Trace

Incident Timelines

Cluster Reviews

Searchable History

Alerts that explain themselves.

Radar remembers.It doesn’t bark twice.

Deploys, alarms, telemetry — linked automatically.

Try Radar against a fake outage.

Encrypted at rest, optional storage

Read-only AWS, scoped per workspace

Deterministic first

Hire your Automated SRE. Pay less than a coffee run.

Put your Automated SRE on shift.

The Automated SRE
for your cloud.
On call. Always.

Radar remembers.
It doesn’t bark twice.