Automated SRE · on shift · scanning 14 services

The Automated SRE
for your cloud.
On call. Always.

Meet Radar — the watchdog on shift for your AWS. Ears up 24/7, sniffs out real incidents, ignores the squirrels (known noise), correlates deploys, and only barks when production actually needs a human.

Signals/24h
3
Noise suppressed
412
Pages fired
1
payments-api
checkout-web
edge-cdn
auth-svc
primary-db
search-api
000090180270
Radar, the InfraWatchdog mascot — an alert shepherd dog with a glowing cyan signal collar tag
radar · on shift · us-east-1
Time to first action

30 seconds. Not 30 minutes.

The same 3 a.m. PagerDuty alert. Two timelines. The difference is whether your on-call engineer is reading evidence or hunting for it.

Without InfraWatchdog
~30 min
  1. 00:00PagerDuty fires. Wake up. Open laptop.
  2. 02:30Squint at CloudWatch dashboard. Which service?
  3. 06:00Switch to Logs Insights. Write a query. Wait.
  4. 11:00Re-write the query. Filter the right log group.
  5. 16:00Check #deploys. Was anything shipped?
  6. 22:00Grep correlated services. Cross-reference SQS depth.
  7. 27:00Form a hypothesis. Page a teammate to verify.
  8. 30:00Roll back. Pray.
With InfraWatchdog
~30 sec
  1. 00:00Slack page from Radar. Headline + verdict in one line.
  2. 00:05AI summary: which service, which deploy, which queue.
  3. 00:12Linked log groups + the dominant error pattern.
  4. 00:18Correlated deploy chip — confidence: high.
  5. 00:25One click to investigation. Evidence rendered.
  6. 00:30Roll back. Confident.
Signal · Noise

Most alerts are noise. Radar finds signal.

suppressed
412
Known noise
  • 14:02edge-cdn 502 burst — known upstream flap
  • 13:48search-api timeout pattern — recurring
  • 13:31auth-svc cold-start spike — within baseline
  • 13:14primary-db slow-query — on follow-up cooldown
surfaced
3
Signal
  • 14:21payments-api p99 11.4s — deploy v3.42.1 correlated
  • 12:08users-api 5xx +820% — SQS depth ×4.2
  • 09:44rds-prod-1 IOPS saturation — customer-facing
How Radar scans

A continuous loop. Scan, correlate, alert.

step · 01

Scan

Every 60 seconds Radar samples error rates, latency, deploy events, and stack telemetry across every connected service.

step · 02

Correlate

It cross-references against the 6h baseline, recent deploys, prior noise lessons, and customer-facing impact signals.

step · 03

Alert

If the gates fire, Radar pages Slack with the headline, the evidence, and a one-paragraph explanation in plain English.

What your Automated SRE does

A full SRE shift, automated. From blip to root cause.

Everything a senior SRE does on call — watch dashboards, grep logs, correlate deploys, write the post-mortem — InfraWatchdog does continuously. Read-only AWS access, deterministic retrieval, AI on top.

investigations

AI Investigations

Ask Radar a question — “why did users-api 5xx spike at 14:22?” It pulls logs, metrics, deploys, and traces, then writes a timeline with evidence and a verdict.

module · live
logs

Logs Search

CloudWatch Logs Insights without the syntax tax. Natural-language queries, saved patterns, multi-log-group fan-out, and result diffing across time windows.

module · live
error watch

Error Watch

Continuously scans every connected log group for new error fingerprints, ranks them by blast radius, and groups recurrences so the same stack trace doesn’t page twice.

module · live
resources

Resource Inventory

Live read-only view of ECS services, RDS clusters, EC2 fleets and their health, capacity, and recent change events — keyed by workspace, scoped per IAM role.

module · live
service map

Service Maps

Auto-built dependency graph from traces and SQS/SNS topology. See what depends on the failing service before opening a single dashboard.

module · live
trace

Distributed Trace

Jump from a Slack alert into the exact slow span. Trace timelines align with deploys and metric anomalies on a single x-axis.

module · live
timelines

Incident Timelines

A merged stream of deploys, alarms, scale events, error blips, and manual notes — the runbook you wished post-mortems already had.

module · live
reviews

Cluster Reviews

Weekly per-service review: top errors, latency drift, deploy churn, noisy alarms, and Radar’s recommendations to quiet them.

module · live
history

Searchable History

Every investigation, query, alert, and verdict is auditable and replayable. Onboarding the next on-call is a link, not a meeting.

module · live
Slack alerts

Alerts that explain themselves.

Every page links the evidence: the metric, the deploy, the dominant error pattern, and whether customers are seeing it. No grepping. No tab archaeology. Just signal.

  • Headline that fits in a glance
  • Threaded follow-ups for related blips
  • Confidence + customer-impact verdict
  • Direct link to the timeline + raw logs
#infra-alerts · radar14:22 UTC
R
RadarAPP · agentincident · INC-2041
🛰 Signal detected · payments-api

p99 latency rose to 11.4s (baseline 2.1s) shortly after deploy v3.42.1. Cascading 5xx on users-api via job-events queue (depth ×4.2).

customer-facingdeploy-correlatedconfidence · high
threaded

14:25 · auto-rollback completed. Watching for recurrence over next 30m.

14:55 · No recurrence. Marking resolved.

Memory

Radar remembers.
It doesn’t bark twice.

Once a pattern is acknowledged or labeled noise, Radar suppresses recurrences inside cooldown windows and escalates only when severity, frequency, or customer impact changes.

cluster_memory · acme-api4 active
  • watchingedge-cdn 502 burst×72h cooldown
  • noisesearch-api timeout pattern×23suppressed
  • follow-uprds-prod-1 IOPS saturation×2due 18:00
  • resolvedauth-svc cold-start×4
Correlation

Deploys, alarms, telemetry — linked automatically.

time · UTCeventverdict
  • 14:21:08deploy payments-api · v3.42.1candidate
  • 14:22:14p99 latency · 2.1s → 11.4sanomaly
  • 14:22:31SQS depth · ×4.2cascade
  • 14:22:44users-api 5xx · +820%customer-facing
  • 14:25:02auto-rollback initiatedrecovering
  • 14:55:00no recurrence · 30m windowresolved
Simulator · Demo Mode

Try Radar against a fake outage.

Don’t want to wire AWS just to evaluate? Flip on Demo Mode to spin up a synthetic Acme Robotics workspace — full ECS / RDS / EC2 inventory, CloudWatch logs, deploy history, and an in-process incident simulator that lets you trigger real-feeling outages and watch Radar respond.

  • Synthetic infra: ECS services, RDS clusters, EC2 fleets
  • Replayable scenarios: deploy regression, queue backlog, IOPS saturation
  • Slack output suppressed — nothing leaves the workspace
  • Reset to a clean state in one click
sim · scenariosdemo · acme-robotics
  • SCN-01payments-api · deploy regressionready
  • SCN-02users-api · SQS backlog cascadeready
  • SCN-03rds-prod-1 · IOPS saturationrunning
  • SCN-04auth-svc · cold-start stormready
  • SCN-05edge-cdn · upstream 502 burstready
5 scenarios · 0 side effectsread-only · sandboxed

Encrypted at rest, optional storage

Every log body, agent excerpt, and alert payload we keep is encrypted with AES-256 (pgcrypto). Don't want us to store any of it? One toggle in settings — counts stay, log lines never touch our database.

Read-only AWS, scoped per workspace

Radar requests logs:Filter*, logs:Describe*, cloudwatch:Get*. No write actions. Connections, IAM roles, and queries are isolated per workspace and audit-logged.

Deterministic first

Retrieval is deterministic. AI explains evidence — it never invents log lines or metrics.

Pricing

Hire your Automated SRE. Pay less than a coffee run.

All tiers are read-only by design. Cancel anytime. Annual billing saves 20%.

tier
Free

Bring your own OpenAI key. Kick the tires.

$0BYO key
  • Bring your own OpenAI API key
  • You pay OpenAI directly for tokens
  • 1 AWS account · read-only
  • Up to 3 monitored services
  • 7-day signal history
  • Slack alerts · 1 channel
  • Demo Mode + Simulator
  • Community support
tier
Solo

One operator, one account, full visibility.

$25/ seat / mo
  • BYO key, or +$14 for managed AI
  • 1 AWS account
  • Up to 10 services · 14-day history
  • 250 investigations / seat / month
  • Email + Slack alerts · weekly cluster reviews
  • 30-day audit log · 1 status page
  • Email support
most teams
tier
Team

On-call rotations sharing context.

$59/ seat / mo (3 min)
  • AI included · or BYO key
  • Up to 3 AWS accounts · 40 services
  • 30-day signal history
  • 500 investigations / seat / month
  • Slack + PagerDuty + webhooks
  • Daily cluster reviews · Google SSO
  • 90-day audit log · 3 status pages
  • Priority email support
tier
Business

Multi-account orgs needing audit + SSO.

$119/ seat / mo (5 min)
  • Up to 10 AWS accounts · 150 services
  • 90-day signal history
  • 1,000 investigations / seat / month
  • SAML SSO · 1-year audit log
  • Unlimited status pages
  • Dedicated Slack channel
All plans include: Investigations · Logs Search · Resource Inventory · Service Maps · Demo Mode
early access · limited

Put your Automated SRE on shift.

Connect AWS read-only. Point InfraWatchdog at the services that matter. Stop being the first responder for every CloudWatch hiccup.