Meet Radar — the watchdog on shift for your AWS. Ears up 24/7, sniffs out real incidents, ignores the squirrels (known noise), correlates deploys, and only barks when production actually needs a human.

The same 3 a.m. PagerDuty alert. Two timelines. The difference is whether your on-call engineer is reading evidence or hunting for it.
Radar is the brain. These are the senses, the muscle memory, and the comms layer it runs on. Every one is a real, working surface — click any tile to see exactly what you'll get.
An autonomous agent triages every incident.
Follow one ID across every log group in a service.
Ask CloudWatch in English. Get answers, not queries.
Group log groups into the services they actually belong to.
Periodic AI health reports across ECS, RDS, EC2, Atlas.
Continuous baseline-aware error pattern detection.
Webhook your deploys. Incidents get linked automatically.
Pages with the headline, the evidence, and a verdict.
Customer-facing incident comms on your own slug.
Realistic AWS chaos for safe demos and runbook drills.
Every 60 seconds Radar samples error rates, latency, deploy events, and stack telemetry across every connected service.
It cross-references against the 6h baseline, recent deploys, prior noise lessons, and customer-facing impact signals.
If the gates fire, Radar pages Slack with the headline, the evidence, and a one-paragraph explanation in plain English.
Everything a senior SRE does on call — watch dashboards, grep logs, correlate deploys, write the post-mortem — InfraWatchdog does continuously. Read-only AWS access, deterministic retrieval, AI on top.
Ask Radar a question — “why did users-api 5xx spike at 14:22?” It pulls logs, metrics, deploys, and traces, then writes a timeline with evidence and a verdict.
CloudWatch Logs Insights without the syntax tax. Natural-language queries, saved patterns, multi-log-group fan-out, and result diffing across time windows.
Continuously scans every connected log group for new error fingerprints, ranks them by blast radius, and groups recurrences so the same stack trace doesn’t page twice.
Live read-only view of ECS services, RDS clusters, EC2 fleets and their health, capacity, and recent change events — keyed by workspace, scoped per IAM role.
Auto-built dependency graph from traces and SQS/SNS topology. See what depends on the failing service before opening a single dashboard.
Jump from a Slack alert into the exact slow span. Trace timelines align with deploys and metric anomalies on a single x-axis.
A merged stream of deploys, alarms, scale events, error blips, and manual notes — the runbook you wished post-mortems already had.
Weekly per-service review: top errors, latency drift, deploy churn, noisy alarms, and Radar’s recommendations to quiet them.
Every investigation, query, alert, and verdict is auditable and replayable. Onboarding the next on-call is a link, not a meeting.
Every page links the evidence: the metric, the deploy, the dominant error pattern, and whether customers are seeing it. No grepping. No tab archaeology. Just signal.
p99 latency rose to 11.4s (baseline 2.1s) shortly after deploy v3.42.1. Cascading 5xx on users-api via job-events queue (depth ×4.2).
14:25 · auto-rollback completed. Watching for recurrence over next 30m.
14:55 · No recurrence. Marking resolved.
Once a pattern is acknowledged or labeled noise, Radar suppresses recurrences inside cooldown windows and escalates only when severity, frequency, or customer impact changes.
Don’t want to wire AWS just to evaluate? Flip on Demo Mode to spin up a synthetic Acme Robotics workspace — full ECS / RDS / EC2 inventory, CloudWatch logs, deploy history, and an in-process incident simulator that lets you trigger real-feeling outages and watch Radar respond.
Every log body, agent excerpt, and alert payload we keep is encrypted with AES-256 (pgcrypto). Don't want us to store any of it? One toggle in settings — counts stay, log lines never touch our database.
Radar requests logs:Filter*, logs:Describe*, cloudwatch:Get*. No write actions. Connections, IAM roles, and queries are isolated per workspace and audit-logged.
Retrieval is deterministic. AI explains evidence — it never invents log lines or metrics.
All tiers are read-only by design. Cancel anytime. Annual billing saves 20%.
Bring your own OpenAI key. Kick the tires.
One operator, one account, full visibility.
On-call rotations sharing context.