Production Hardening for EKS Observability (Part 5 of 5)
The final layer: seven production alert rules in Amazon Managed Prometheus, SNS and PagerDuty routing, fully private clusters via VPC endpoints, and a go-live checklist.
This is the final part of my five-part series on the eks-observability-stack. Part 1 was the architecture, Part 2 metrics, Part 3 logging, and Part 4 tracing. You now have signals flowing. This post is about turning that into something you can actually run in production: alerts that fire, routing that reaches a human, a fully private deployment option, and the checklist I use before signing off.
The code is in modules/alerting/ and modules/vpc_endpoints/.
Alerting: Collecting Signals Is Not the Point
Dashboards you have to remember to look at are not monitoring. Monitoring is when the system tells you something is wrong. The alerting module ships seven production-ready Prometheus rules, defined as a single rule group in rules/alerts.yaml:
| Alert | Severity | Fires when | For |
|---|---|---|---|
| HighErrorRate | critical | 5xx responses exceed 5% of requests | 5m |
| ContainerOOMKilled | critical | a container is OOMKilled | immediate |
| PodCrashLooping | warning | container restart rate > 0 | 5m |
| NodeMemoryPressure | warning | node memory usage > 90% | 10m |
| HighLatency | warning | p95 latency > 500ms | 10m |
| NodeDiskPressure | warning | root filesystem > 85% used | 5m |
| KubePodNotReady | warning | a pod is not ready | 5m |
The thresholds are opinionated on purpose. A few worth explaining:
- HighErrorRate at 5% for 5 minutes, not “any 5xx”. A single 500 during a deploy is noise. Five percent sustained for five minutes is an incident.
- ContainerOOMKilled fires immediately (no
forwindow). An OOM kill is a discrete event, not a trend. You want to know the moment it happens, because it usually means a memory limit is wrong or there’s a leak. - The
fordurations exist to kill alert flapping. A node briefly touching 90% memory during a GC pause should not page anyone. Ten minutes of sustained pressure should.
These are a starting point, not gospel. Tune them to your workload, but they’ll catch the things that actually take clusters down.
How Rules Get Into Amazon Managed Prometheus
The rules are injected into your AMP workspace as a rule group namespace:
resource "aws_prometheus_rule_group_namespace" "alerting_rules" {
count = var.amp_workspace_id != null ? 1 : 0
name = "eks-observability-alerting-rules"
workspace_id = var.amp_workspace_id
data = file("${path.module}/rules/alerts.yaml")
}
The file() call loads the YAML as-is, and the root module wires amp_workspace_id straight from the metrics module (Part 2). If you enabled metrics, alerting plugs in with no extra wiring. AMP evaluates the rules server-side, so there’s no Alertmanager pod to run, scale, or patch.
Routing: From a Firing Rule to a Human
A firing rule is useless if it lands in a void. The module creates an SNS topic and subscribes your endpoints:
resource "aws_sns_topic" "alerts" {
name = "${local.name_prefix}-alerts"
}
resource "aws_sns_topic_subscription" "email" {
for_each = toset(var.alert_email_endpoints)
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = each.value
}
Email subscriptions work out of the box: pass a list to alert_email_endpoints. For real on-call, there’s optional PagerDuty:
resource "aws_sns_topic_subscription" "pagerduty" {
count = var.enable_pagerduty ? 1 : 0
topic_arn = aws_sns_topic.alerts.arn
protocol = "https"
endpoint = "https://events.pagerduty.com/integration/${var.pagerduty_integration_key}/enqueue"
raw_message_delivery = false
}
Set enable_pagerduty = true, drop in your integration key (it’s a sensitive variable, so it stays out of plan output), and alerts route to your existing escalation policy. raw_message_delivery = false is deliberate: PagerDuty wants the full SNS JSON envelope, not just the message body.
Running the Whole Thing in a Private Cluster
This is the part that matters for regulated environments. Set private_cluster_mode = true and the stack deploys VPC endpoints so the entire observability pipeline works with no internet access and no NAT gateway:
module "vpc_endpoints" {
source = "./modules/vpc_endpoints"
count = var.private_cluster_mode ? 1 : 0
vpc_id = var.vpc_id
enable_metrics = var.enable_metrics
enable_logging = var.enable_logging
enable_tracing = var.enable_tracing
logging_backend = var.logging_backend
}
It creates only the endpoints your enabled features actually need:
- Always: STS (for IRSA), EC2, and an S3 gateway endpoint.
- With metrics:
aps-workspaces(AMP) andmonitoring(CloudWatch). - With logging:
logs(CloudWatch Logs), plusesif you chose the OpenSearch backend. - With tracing:
xray.
Every interface endpoint gets private DNS and a security group locked to the VPC CIDR. The result: metrics reach AMP, logs reach CloudWatch or OpenSearch, and traces reach X-Ray, all over private AWS networking. Nothing leaves your VPC. For government, financial services, and anyone working to an ISM-style baseline, this is the difference between “we can use it” and “we can’t.”
IRSA, Everywhere, by Default
I’ve said this in every part of the series because it’s the thing most home-grown stacks get wrong. Every workload in this stack (Fluent Bit, the OTel Collector, AMP ingestion, Grafana) authenticates through IAM Roles for Service Accounts, each scoped to exactly what it needs and bound by trust policy to a single Kubernetes service account in a single namespace. There are no AWS access keys anywhere in the cluster. If you take one idea from this series into your own setup, take that one.
Validate Before You Trust It
The stack ships a sample app (enable_sample_app = true) that emits metrics in the format the dashboards expect, writes logs, and generates traces. Deploy it first. It’s a smoke test for the whole pipeline: if the sample app shows up in Grafana, its logs land in your backend, and its traces appear in X-Ray, you know every layer is wired correctly before you point real workloads at it.
Optional Add-On: Service Mesh
If you want service-to-service mTLS and L7 traffic visibility on top of the observability stack, there’s an optional service_mesh module (enable_service_mesh = true, off by default). It deploys Istio via the managed EKS add-ons (aws-istio and aws-istio-cni) and enforces mesh-wide STRICT mTLS through a PeerAuthentication policy, with sidecar injection enabled per namespace. It isn’t part of the core five-layer observability stack and its telemetry contribution is passive (the sidecars emit metrics and traces automatically once the metrics and tracing modules are running), so treat it as a security/networking add-on you reach for when you specifically need encrypted service-to-service traffic, not a required piece.
The Go-Live Checklist
Before I call any deployment of this production-ready, I run through this:
- Sample app validated end to end (metrics in Grafana, logs in the backend, traces in X-Ray).
- Alert routing tested with a real fire, not just a green plan. Trigger a test alert and confirm it reaches email/PagerDuty.
- IRSA roles reviewed for least privilege. No role does more than its one job.
- Retention set for your compliance needs (CloudWatch
log_retention_days, AMP retention). - Private endpoints confirmed resolving if you’re in
private_cluster_mode(no accidental NAT egress). - Cost baseline understood. AMP, AMG, X-Ray, and your log backend each have a meter. Know roughly what a normal month looks like so an anomaly stands out.
- Thresholds tuned to your traffic. The defaults are sensible; your workload is specific.
That’s the Series
Five parts, one stack:
- Architecture and design decisions
- Metrics with Amazon Managed Prometheus + Grafana
- Logging with Fluent Bit
- Tracing with OpenTelemetry + X-Ray
- Production hardening (this post)
It’s all open source, Apache 2.0, on GitHub. Metrics, logs, traces, alerting, private-cluster support, all AWS-managed, all IRSA-scoped, no static credentials. It’s running in production.
If you want this deployed on your EKS clusters, configured for your compliance requirements, and integrated with your existing infrastructure, book a free 30-minute discovery call. I’ll scope what you need and give you a straight answer on timeline and cost.