System Operational

Production Hardening for EKS Observability (Part 5 of 5)

The final layer: seven production alert rules in Amazon Managed Prometheus, SNS and PagerDuty routing, fully private clusters via VPC endpoints, and a go-live checklist.

Dr Salek Ali 8 June 2026
Production Hardening for EKS Observability (Part 5 of 5)

This is the final part of my five-part series on the eks-observability-stack. Part 1 was the architecture, Part 2 metrics, Part 3 logging, and Part 4 tracing. You now have signals flowing. This post is about turning that into something you can actually run in production: alerts that fire, routing that reaches a human, a fully private deployment option, and the checklist I use before signing off.

The code is in modules/alerting/ and modules/vpc_endpoints/.

Alerting: Collecting Signals Is Not the Point

Dashboards you have to remember to look at are not monitoring. Monitoring is when the system tells you something is wrong. The alerting module ships seven production-ready Prometheus rules, defined as a single rule group in rules/alerts.yaml:

AlertSeverityFires whenFor
HighErrorRatecritical5xx responses exceed 5% of requests5m
ContainerOOMKilledcriticala container is OOMKilledimmediate
PodCrashLoopingwarningcontainer restart rate > 05m
NodeMemoryPressurewarningnode memory usage > 90%10m
HighLatencywarningp95 latency > 500ms10m
NodeDiskPressurewarningroot filesystem > 85% used5m
KubePodNotReadywarninga pod is not ready5m

The thresholds are opinionated on purpose. A few worth explaining:

  • HighErrorRate at 5% for 5 minutes, not “any 5xx”. A single 500 during a deploy is noise. Five percent sustained for five minutes is an incident.
  • ContainerOOMKilled fires immediately (no for window). An OOM kill is a discrete event, not a trend. You want to know the moment it happens, because it usually means a memory limit is wrong or there’s a leak.
  • The for durations exist to kill alert flapping. A node briefly touching 90% memory during a GC pause should not page anyone. Ten minutes of sustained pressure should.

These are a starting point, not gospel. Tune them to your workload, but they’ll catch the things that actually take clusters down.

How Rules Get Into Amazon Managed Prometheus

The rules are injected into your AMP workspace as a rule group namespace:

resource "aws_prometheus_rule_group_namespace" "alerting_rules" {
  count = var.amp_workspace_id != null ? 1 : 0

  name         = "eks-observability-alerting-rules"
  workspace_id = var.amp_workspace_id
  data         = file("${path.module}/rules/alerts.yaml")
}

The file() call loads the YAML as-is, and the root module wires amp_workspace_id straight from the metrics module (Part 2). If you enabled metrics, alerting plugs in with no extra wiring. AMP evaluates the rules server-side, so there’s no Alertmanager pod to run, scale, or patch.

Routing: From a Firing Rule to a Human

A firing rule is useless if it lands in a void. The module creates an SNS topic and subscribes your endpoints:

resource "aws_sns_topic" "alerts" {
  name = "${local.name_prefix}-alerts"
}

resource "aws_sns_topic_subscription" "email" {
  for_each  = toset(var.alert_email_endpoints)
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "email"
  endpoint  = each.value
}

Email subscriptions work out of the box: pass a list to alert_email_endpoints. For real on-call, there’s optional PagerDuty:

resource "aws_sns_topic_subscription" "pagerduty" {
  count                = var.enable_pagerduty ? 1 : 0
  topic_arn            = aws_sns_topic.alerts.arn
  protocol             = "https"
  endpoint             = "https://events.pagerduty.com/integration/${var.pagerduty_integration_key}/enqueue"
  raw_message_delivery = false
}

Set enable_pagerduty = true, drop in your integration key (it’s a sensitive variable, so it stays out of plan output), and alerts route to your existing escalation policy. raw_message_delivery = false is deliberate: PagerDuty wants the full SNS JSON envelope, not just the message body.

Running the Whole Thing in a Private Cluster

This is the part that matters for regulated environments. Set private_cluster_mode = true and the stack deploys VPC endpoints so the entire observability pipeline works with no internet access and no NAT gateway:

module "vpc_endpoints" {
  source = "./modules/vpc_endpoints"
  count  = var.private_cluster_mode ? 1 : 0

  vpc_id          = var.vpc_id
  enable_metrics  = var.enable_metrics
  enable_logging  = var.enable_logging
  enable_tracing  = var.enable_tracing
  logging_backend = var.logging_backend
}

It creates only the endpoints your enabled features actually need:

  • Always: STS (for IRSA), EC2, and an S3 gateway endpoint.
  • With metrics: aps-workspaces (AMP) and monitoring (CloudWatch).
  • With logging: logs (CloudWatch Logs), plus es if you chose the OpenSearch backend.
  • With tracing: xray.

Every interface endpoint gets private DNS and a security group locked to the VPC CIDR. The result: metrics reach AMP, logs reach CloudWatch or OpenSearch, and traces reach X-Ray, all over private AWS networking. Nothing leaves your VPC. For government, financial services, and anyone working to an ISM-style baseline, this is the difference between “we can use it” and “we can’t.”

IRSA, Everywhere, by Default

I’ve said this in every part of the series because it’s the thing most home-grown stacks get wrong. Every workload in this stack (Fluent Bit, the OTel Collector, AMP ingestion, Grafana) authenticates through IAM Roles for Service Accounts, each scoped to exactly what it needs and bound by trust policy to a single Kubernetes service account in a single namespace. There are no AWS access keys anywhere in the cluster. If you take one idea from this series into your own setup, take that one.

Validate Before You Trust It

The stack ships a sample app (enable_sample_app = true) that emits metrics in the format the dashboards expect, writes logs, and generates traces. Deploy it first. It’s a smoke test for the whole pipeline: if the sample app shows up in Grafana, its logs land in your backend, and its traces appear in X-Ray, you know every layer is wired correctly before you point real workloads at it.

Optional Add-On: Service Mesh

If you want service-to-service mTLS and L7 traffic visibility on top of the observability stack, there’s an optional service_mesh module (enable_service_mesh = true, off by default). It deploys Istio via the managed EKS add-ons (aws-istio and aws-istio-cni) and enforces mesh-wide STRICT mTLS through a PeerAuthentication policy, with sidecar injection enabled per namespace. It isn’t part of the core five-layer observability stack and its telemetry contribution is passive (the sidecars emit metrics and traces automatically once the metrics and tracing modules are running), so treat it as a security/networking add-on you reach for when you specifically need encrypted service-to-service traffic, not a required piece.

The Go-Live Checklist

Before I call any deployment of this production-ready, I run through this:

  1. Sample app validated end to end (metrics in Grafana, logs in the backend, traces in X-Ray).
  2. Alert routing tested with a real fire, not just a green plan. Trigger a test alert and confirm it reaches email/PagerDuty.
  3. IRSA roles reviewed for least privilege. No role does more than its one job.
  4. Retention set for your compliance needs (CloudWatch log_retention_days, AMP retention).
  5. Private endpoints confirmed resolving if you’re in private_cluster_mode (no accidental NAT egress).
  6. Cost baseline understood. AMP, AMG, X-Ray, and your log backend each have a meter. Know roughly what a normal month looks like so an anomaly stands out.
  7. Thresholds tuned to your traffic. The defaults are sensible; your workload is specific.

That’s the Series

Five parts, one stack:

  1. Architecture and design decisions
  2. Metrics with Amazon Managed Prometheus + Grafana
  3. Logging with Fluent Bit
  4. Tracing with OpenTelemetry + X-Ray
  5. Production hardening (this post)

It’s all open source, Apache 2.0, on GitHub. Metrics, logs, traces, alerting, private-cluster support, all AWS-managed, all IRSA-scoped, no static credentials. It’s running in production.

If you want this deployed on your EKS clusters, configured for your compliance requirements, and integrated with your existing infrastructure, book a free 30-minute discovery call. I’ll scope what you need and give you a straight answer on timeline and cost.