EKS Distributed Tracing: OpenTelemetry + X-Ray (Part 4 of 5)

This is Part 4 of my five-part series on the eks-observability-stack. Part 1 covered the architecture, Part 2 the metrics layer, and Part 3 logging. This post is about traces: following a single request across services to find out which one is actually slow.

The code is in modules/tracing/.

Why Tracing Is the Pillar People Skip

Metrics tell you the system is slow. Logs tell you what each service said. Neither tells you where the time went on a request that touched six services.

That’s what distributed tracing answers. Every request gets a trace ID, every service it hits adds a span, and you end up with a waterfall showing exactly where the 800ms went. On a microservices cluster, this is the difference between “checkout is slow” and “checkout is slow because the payments service is waiting 600ms on a downstream call to the fraud check.”

Most teams skip it because the tooling has a reputation for being heavy. This module keeps it light.

What the Tracing Module Deploys

It deploys an OpenTelemetry Collector via the official Helm chart (pinned to 0.78.0), running as a DaemonSet:

resource "helm_release" "otel_collector" {
  name       = "otel-collector"
  repository = "https://open-telemetry.github.io/opentelemetry-helm-charts"
  chart      = "opentelemetry-collector"
  version    = "0.78.0"
  namespace  = var.namespace
  # ... values configure mode = daemonset
}

Two deliberate choices here:

Upstream OpenTelemetry, not ADOT. This is the vanilla OTel Collector, not AWS Distro for OpenTelemetry. ADOT gives you tighter AWS defaults, but upstream OTel gives you the full processor set (including the tail sampler below) and keeps you portable. If you ever move off X-Ray, the collector config barely changes.

DaemonSet, not a central Deployment. One collector per node means your applications send spans to a collector on the same node (low latency, no cross-node hop), and a single collector falling over only affects one node’s telemetry. Applications point at otel-collector.tracing.svc.cluster.local:4317 (gRPC) or :4318 (HTTP).

The Pipeline

The whole behaviour lives in collector-config.yaml.tpl. The trace pipeline is:

OTLP receiver → memory_limiter → tail_sampling → batch → resource → [X-Ray, logging]

Receivers

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

Standard OTLP in both gRPC and HTTP. Whatever your apps are instrumented with (the OpenTelemetry SDKs all speak OTLP), they can send here.

Tail-Based Sampling: The Part That Matters

This is the design decision worth understanding. Most setups use head-based sampling: decide at the start of a request whether to trace it, usually a flat percentage. The problem is obvious once you say it out loud. If you sample 10% at the head, you keep 10% of your errors and 10% of your slow requests, which are exactly the traces you actually wanted.

The collector uses tail-based sampling instead. It buffers all the spans in a trace, waits for the trace to finish, then decides whether to keep it based on what actually happened:

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      - name: always-sample-errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: always-sample-slow
        type: latency
        latency:
          threshold_ms: 500
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: ${sampling_rate_percentage}

Three policies, evaluated together:

Every error trace is kept. 100% of traces with an error status.
Every slow trace is kept. 100% of traces over 500ms.
Everything else is sampled at sampling_rate_percentage (default 10%).

So you never lose a failure or a slow request, and you keep cost down on the boring successful traffic. The decision_wait: 10s is the trade-off: the collector holds spans in memory for up to 10 seconds waiting for the trace to complete before deciding. That’s why there’s a memory limiter in front of it:

  memory_limiter:
    check_interval: 1s
    limit_mib: 400
    spike_limit_mib: 100

A hard 400 MiB ceiling with 100 MiB spike headroom. This is tuned for moderate-volume clusters. If you run very high span throughput per node, raise these or the collector will start dropping spans to protect the node, which is the correct failure mode but worth knowing about.

Export to X-Ray

exporters:
  awsxray:
    region: ${aws_region}
    index_all_attributes: true

Sampled traces go to AWS X-Ray. index_all_attributes: true makes every span attribute searchable in the X-Ray console, which is great for debugging and worth a note on cost: X-Ray bills partly on indexed data, so on a high-volume cluster you may want to index selectively. The resource processor stamps every span with cluster_name first, so multi-cluster setups stay separable.

Trace-Derived Metrics to AMP (Optional)

If you set amp_remote_write_endpoint, the module adds a second pipeline that remote-writes metrics to Amazon Managed Prometheus with SigV4 auth:

  prometheusremotewrite:
    endpoint: "${amp_remote_write_endpoint}"
    auth:
      authenticator: sigv4auth
    resource_to_telemetry_conversion:
      enabled: true

One honest clarification, because it’s easy to overstate: this pipeline forwards the OTLP metrics your applications already emit (it converts resource attributes like service.name into Prometheus labels). It does not synthesise RED metrics from spans on its own. If your apps emit OTLP metrics, this gets them into the same Grafana you set up in Part 2. If they don’t, this pipeline sits idle, which is fine.

IRSA: Scoped to Writing Traces

The collector authenticates through IAM Roles for Service Accounts. No static keys. The X-Ray write policy is always attached:

{
  "Effect": "Allow",
  "Action": [
    "xray:PutTraceSegments",
    "xray:PutTelemetryRecords"
  ],
  "Resource": "*"
}

And the AMP remote-write permission is only added when you enable the metrics pipeline. One caveat I’ll be straight about: X-Ray’s IAM actions don’t support resource-level ARNs, so the resource is * by necessity. That’s an X-Ray limitation, not a shortcut. The role still can’t do anything except put trace segments and telemetry records, and the trust policy binds it to one service account in one namespace.

The Variables You’ll Touch

Variable	Default	What it controls
`namespace`	`"tracing"`	Namespace for the collector
`sampling_rate_percentage`	`10`	Probabilistic rate for normal traffic (errors and slow traces are always kept)
`amp_remote_write_endpoint`	`null`	Set it to enable the OTLP metrics → AMP pipeline
`cluster_name`	(required)	Stamped onto every span

Outputs hand you the collector endpoints (otel-collector.tracing.svc.cluster.local:4317 for gRPC, :4318 for HTTP) so you can point your application instrumentation at them, plus the IRSA role ARN.

Instrumenting Your Apps

The collector is only half the job. Your applications have to emit spans. The good news is you rarely write tracing code by hand anymore: the OpenTelemetry SDKs auto-instrument most popular frameworks (Express, Spring, Django, .NET) once you set two environment variables:

env:
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://otel-collector.tracing.svc.cluster.local:4317"
  - name: OTEL_SERVICE_NAME
    value: "checkout-service"

Point the SDK at the collector, give the service a name, and spans start flowing. The collector handles sampling and export from there.

What I’d Tell You Before Deploying

Tail sampling trades memory for accuracy. The 10-second decision window means the collector holds traces in RAM. On high-throughput nodes, watch the collector’s memory and tune limit_mib. The default 400 MiB suits most clusters.
index_all_attributes is convenient but has a cost dimension. Great default for getting started; revisit it if your X-Ray bill climbs.
Sampling is not filtering. Every sampled trace is exported as-is; there’s no per-service allowlist. If you want to drop a noisy health-check endpoint, do it at the SDK or add a filter processor.

What’s Next

Part 5 closes the series: the seven production alert rules, SNS and PagerDuty routing, running the whole stack in a fully private cluster with VPC endpoints, and the checklist I run before calling it production-ready.

The full stack is on GitHub. Star it, fork it, open issues.

If you want this deployed on your EKS clusters, configured for your compliance requirements, and integrated with your existing tooling, book a free 30-minute discovery call. I’ll scope what you need and give you a straight answer on timeline and cost.