EKS Distributed Tracing: OpenTelemetry + X-Ray (Part 4 of 5)
How the tracing module works: an OpenTelemetry Collector on every node, tail-based sampling that always keeps errors and slow requests, and IRSA-scoped export to X-Ray.
This is Part 4 of my five-part series on the eks-observability-stack. Part 1 covered the architecture, Part 2 the metrics layer, and Part 3 logging. This post is about traces: following a single request across services to find out which one is actually slow.
The code is in modules/tracing/.
Why Tracing Is the Pillar People Skip
Metrics tell you the system is slow. Logs tell you what each service said. Neither tells you where the time went on a request that touched six services.
That’s what distributed tracing answers. Every request gets a trace ID, every service it hits adds a span, and you end up with a waterfall showing exactly where the 800ms went. On a microservices cluster, this is the difference between “checkout is slow” and “checkout is slow because the payments service is waiting 600ms on a downstream call to the fraud check.”
Most teams skip it because the tooling has a reputation for being heavy. This module keeps it light.
What the Tracing Module Deploys
It deploys an OpenTelemetry Collector via the official Helm chart (pinned to 0.78.0), running as a DaemonSet:
resource "helm_release" "otel_collector" {
name = "otel-collector"
repository = "https://open-telemetry.github.io/opentelemetry-helm-charts"
chart = "opentelemetry-collector"
version = "0.78.0"
namespace = var.namespace
# ... values configure mode = daemonset
}
Two deliberate choices here:
Upstream OpenTelemetry, not ADOT. This is the vanilla OTel Collector, not AWS Distro for OpenTelemetry. ADOT gives you tighter AWS defaults, but upstream OTel gives you the full processor set (including the tail sampler below) and keeps you portable. If you ever move off X-Ray, the collector config barely changes.
DaemonSet, not a central Deployment. One collector per node means your applications send spans to a collector on the same node (low latency, no cross-node hop), and a single collector falling over only affects one node’s telemetry. Applications point at otel-collector.tracing.svc.cluster.local:4317 (gRPC) or :4318 (HTTP).
The Pipeline
The whole behaviour lives in collector-config.yaml.tpl. The trace pipeline is:
OTLP receiver → memory_limiter → tail_sampling → batch → resource → [X-Ray, logging]
Receivers
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
Standard OTLP in both gRPC and HTTP. Whatever your apps are instrumented with (the OpenTelemetry SDKs all speak OTLP), they can send here.
Tail-Based Sampling: The Part That Matters
This is the design decision worth understanding. Most setups use head-based sampling: decide at the start of a request whether to trace it, usually a flat percentage. The problem is obvious once you say it out loud. If you sample 10% at the head, you keep 10% of your errors and 10% of your slow requests, which are exactly the traces you actually wanted.
The collector uses tail-based sampling instead. It buffers all the spans in a trace, waits for the trace to finish, then decides whether to keep it based on what actually happened:
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 1000
policies:
- name: always-sample-errors
type: status_code
status_code:
status_codes: [ERROR]
- name: always-sample-slow
type: latency
latency:
threshold_ms: 500
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: ${sampling_rate_percentage}
Three policies, evaluated together:
- Every error trace is kept. 100% of traces with an error status.
- Every slow trace is kept. 100% of traces over 500ms.
- Everything else is sampled at
sampling_rate_percentage(default 10%).
So you never lose a failure or a slow request, and you keep cost down on the boring successful traffic. The decision_wait: 10s is the trade-off: the collector holds spans in memory for up to 10 seconds waiting for the trace to complete before deciding. That’s why there’s a memory limiter in front of it:
memory_limiter:
check_interval: 1s
limit_mib: 400
spike_limit_mib: 100
A hard 400 MiB ceiling with 100 MiB spike headroom. This is tuned for moderate-volume clusters. If you run very high span throughput per node, raise these or the collector will start dropping spans to protect the node, which is the correct failure mode but worth knowing about.
Export to X-Ray
exporters:
awsxray:
region: ${aws_region}
index_all_attributes: true
Sampled traces go to AWS X-Ray. index_all_attributes: true makes every span attribute searchable in the X-Ray console, which is great for debugging and worth a note on cost: X-Ray bills partly on indexed data, so on a high-volume cluster you may want to index selectively. The resource processor stamps every span with cluster_name first, so multi-cluster setups stay separable.
Trace-Derived Metrics to AMP (Optional)
If you set amp_remote_write_endpoint, the module adds a second pipeline that remote-writes metrics to Amazon Managed Prometheus with SigV4 auth:
prometheusremotewrite:
endpoint: "${amp_remote_write_endpoint}"
auth:
authenticator: sigv4auth
resource_to_telemetry_conversion:
enabled: true
One honest clarification, because it’s easy to overstate: this pipeline forwards the OTLP metrics your applications already emit (it converts resource attributes like service.name into Prometheus labels). It does not synthesise RED metrics from spans on its own. If your apps emit OTLP metrics, this gets them into the same Grafana you set up in Part 2. If they don’t, this pipeline sits idle, which is fine.
IRSA: Scoped to Writing Traces
The collector authenticates through IAM Roles for Service Accounts. No static keys. The X-Ray write policy is always attached:
{
"Effect": "Allow",
"Action": [
"xray:PutTraceSegments",
"xray:PutTelemetryRecords"
],
"Resource": "*"
}
And the AMP remote-write permission is only added when you enable the metrics pipeline. One caveat I’ll be straight about: X-Ray’s IAM actions don’t support resource-level ARNs, so the resource is * by necessity. That’s an X-Ray limitation, not a shortcut. The role still can’t do anything except put trace segments and telemetry records, and the trust policy binds it to one service account in one namespace.
The Variables You’ll Touch
| Variable | Default | What it controls |
|---|---|---|
namespace | "tracing" | Namespace for the collector |
sampling_rate_percentage | 10 | Probabilistic rate for normal traffic (errors and slow traces are always kept) |
amp_remote_write_endpoint | null | Set it to enable the OTLP metrics → AMP pipeline |
cluster_name | (required) | Stamped onto every span |
Outputs hand you the collector endpoints (otel-collector.tracing.svc.cluster.local:4317 for gRPC, :4318 for HTTP) so you can point your application instrumentation at them, plus the IRSA role ARN.
Instrumenting Your Apps
The collector is only half the job. Your applications have to emit spans. The good news is you rarely write tracing code by hand anymore: the OpenTelemetry SDKs auto-instrument most popular frameworks (Express, Spring, Django, .NET) once you set two environment variables:
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector.tracing.svc.cluster.local:4317"
- name: OTEL_SERVICE_NAME
value: "checkout-service"
Point the SDK at the collector, give the service a name, and spans start flowing. The collector handles sampling and export from there.
What I’d Tell You Before Deploying
- Tail sampling trades memory for accuracy. The 10-second decision window means the collector holds traces in RAM. On high-throughput nodes, watch the collector’s memory and tune
limit_mib. The default 400 MiB suits most clusters. index_all_attributesis convenient but has a cost dimension. Great default for getting started; revisit it if your X-Ray bill climbs.- Sampling is not filtering. Every sampled trace is exported as-is; there’s no per-service allowlist. If you want to drop a noisy health-check endpoint, do it at the SDK or add a filter processor.
What’s Next
Part 5 closes the series: the seven production alert rules, SNS and PagerDuty routing, running the whole stack in a fully private cluster with VPC endpoints, and the checklist I run before calling it production-ready.
The full stack is on GitHub. Star it, fork it, open issues.
If you want this deployed on your EKS clusters, configured for your compliance requirements, and integrated with your existing tooling, book a free 30-minute discovery call. I’ll scope what you need and give you a straight answer on timeline and cost.