EKS Logging: Fluent Bit, CloudWatch vs OpenSearch (Part 3 of 5)

This is Part 3 of my five-part series on the eks-observability-stack. Part 1 covered the architecture. Part 2 dug into metrics with Amazon Managed Prometheus and Grafana. This post is about logs: collecting them off every pod with Fluent Bit, and shipping them to either CloudWatch Logs or OpenSearch depending on one variable.

The code is all in modules/logging/.

The Decision This Module Forces You to Make

Most logging guides pick a backend and pretend the other one doesn’t exist. The logging module doesn’t. It supports two, and it makes you choose:

variable "logging_backend" {
  description = "Logging backend to deploy. Accepted values: cloudwatch, opensearch."
  type        = string
  default     = "cloudwatch"

  validation {
    condition     = contains(["cloudwatch", "opensearch"], var.logging_backend)
    error_message = "logging_backend must be either \"cloudwatch\" or \"opensearch\"."
  }
}

That’s the entire interface. Set logging_backend = "cloudwatch" and you get a serverless, AWS-managed log destination with almost no operational surface. Set it to "opensearch" and you get a full search and analytics cluster you can build Kibana-style dashboards on top of.

The point of putting both behind one variable is that switching later doesn’t mean re-architecting. The collection layer stays the same. Only the destination changes.

What Fluent Bit Actually Collects

Both backends run Fluent Bit as a DaemonSet, which means one pod on every node, reading logs directly off the local disk. There is no central aggregator to bottleneck on. Each node ships its own logs.

The collection config is the same idea in both paths. Tail the container logs, parse them, enrich them with Kubernetes metadata, then forward:

[INPUT]
    Name              tail
    Path              /var/log/containers/*.log
    multiline.parser  docker, cri
    DB                /var/fluent-bit/state/flb_container.db
    Mem_Buf_Limit     50MB
    Skip_Long_Lines   On

A few things worth understanding here, because they are the difference between logs you can use and logs that lie to you:

The state database. That DB path tracks how far Fluent Bit has read in each file. If a node reboots or the pod restarts, it picks up where it left off instead of re-shipping everything or dropping what it missed.

Multiline parsing at the input. multiline.parser docker, cri handles the two container runtime log formats. Without this, a single log line split across multiple physical lines (which CRI does constantly) arrives as fragments.

Mem_Buf_Limit 50MB and Skip_Long_Lines On. This is backpressure protection. If the destination is slow, Fluent Bit buffers up to 50MB in memory, then starts skipping rather than OOM-killing the node. On a logging agent, dropping a few lines is better than taking down the node it runs on.

Enrichment: Making Logs Mean Something

Raw container logs are useless without context. A line that says connection refused means nothing if you don’t know which pod, namespace, and cluster it came from. The Kubernetes filter fixes that:

[FILTER]
    Name                kubernetes
    Match               kube.*
    Merge_Log           On
    Merge_Log_Key       log_processed
    K8S-Logging.Parser  On
    Buffer_Size         32k

This attaches pod name, namespace, labels, and annotations to every record. Merge_Log On means if your application already logs JSON, Fluent Bit parses it and merges the fields in rather than nesting your structured log inside a string. K8S-Logging.Parser On lets individual pods declare their own parser via annotation, so a team running a non-standard log format can opt into custom parsing without touching the cluster config.

Then two more filters run on the CloudWatch path:

[FILTER]
    Name    grep
    Match   kube.*
    Exclude $kubernetes['namespace_name'] ^(kube-system|amazon-cloudwatch)$

[FILTER]
    Name    modify
    Match   *
    Add     cluster_name my-cluster

The grep filter drops logs from kube-system and amazon-cloudwatch so you’re not paying to store the logging system’s own chatter. The modify filter stamps every record with the cluster name, which matters the moment you have more than one cluster shipping to the same account.

There’s also a multiline filter that stitches Java and Python stack traces back into single records, because a 40-line traceback arriving as 40 separate log entries is how you miss the actual error.

Path A: CloudWatch Logs (The Default)

The CloudWatch path leans on the amazon-cloudwatch-observability EKS add-on. AWS manages the Fluent Bit DaemonSet itself, and the module manages the configuration and the destinations.

Two log groups get created:

/aws/eks/{cluster_name}/application for your container logs
/aws/eks/{cluster_name}/dataplane for the systemd units (kubelet, containerd)

Retention is a variable (log_retention_days, default 30), and you can pass a KMS key if you need the logs encrypted with your own key instead of the AWS-managed one.

The CloudWatch output also writes in EMF (json/emf), which means you can extract metrics from log fields using CloudWatch Logs Insights without standing up anything extra. If you’ve ever wanted “count of 5xx responses grouped by service, from the logs” without a metrics pipeline, that’s what this buys you.

This is the right default for most teams. It’s serverless, there’s nothing to size, nothing to patch, and nothing to wake up for. You pay per GB ingested and per GB stored, and that’s the whole cost model.

Path B: OpenSearch (When You Need Real Search)

Set logging_backend = "opensearch" and the module deploys a different shape entirely. Fluent Bit comes in via the official Helm chart (pinned to 0.43.0) as a DaemonSet in the logging namespace, and the destination is a managed OpenSearch domain.

The example at examples/logging-opensearch/ shows the full toggle. What you get:

An OpenSearch 2.11 domain with fine-grained access control
Auto-generated master credentials stored in Secrets Manager
A security group restricting HTTPS to the VPC CIDR
Logs landing in a eks-logs-* index, rotated daily by Logstash-format naming

The Fluent Bit output uses IRSA credentials to sign requests, so there are no static keys in the cluster:

[OUTPUT]
    Name               opensearch
    Host               <domain-endpoint>
    Port               443
    TLS                On
    AWS_Auth           On
    Suppress_Type_Name On
    Index              eks-logs
    Logstash_Format    On
    Logstash_Prefix    eks-logs
    Retry_Limit        5

Choose this path when you actually need full-text search across logs, ad-hoc querying, and dashboards that go deeper than CloudWatch Logs Insights. The trade-off is that OpenSearch is a real cluster. You size it (opensearch_instance_type defaults to r6g.large.search, opensearch_instance_count to 1), you pay for it by the hour whether it’s busy or not, and you own its capacity planning. Multi-AZ is a flag (opensearch_multi_az) when you need the durability.

IRSA: One Role, Scoped to One Job

Whichever backend you pick, Fluent Bit authenticates through IAM Roles for Service Accounts. No access keys anywhere. The module creates a single role bound to the fluent-bit service account, and attaches exactly the policy that backend needs.

For CloudWatch:

{
  "Effect": "Allow",
  "Action": [
    "logs:CreateLogGroup",
    "logs:CreateLogStream",
    "logs:PutLogEvents",
    "logs:DescribeLogStreams"
  ],
  "Resource": "arn:aws:logs:*:*:log-group:/aws/eks/my-cluster/*"
}

That role can write to this cluster’s log groups and nothing else. It cannot read logs, cannot delete log groups, cannot touch any other cluster’s logs. The resource ARN is scoped to the cluster prefix.

For OpenSearch:

{
  "Effect": "Allow",
  "Action": ["es:ESHttp*"],
  "Resource": "<opensearch-domain-arn>/*"
}

HTTP operations on one specific domain. No es:*, no admin, no ability to delete the domain or reconfigure it. The trust policy on both restricts the role to a single Kubernetes service account in a single namespace via the OIDC sub claim.

This is what least-privilege looks like in practice. The log shipper can ship logs. That’s the entire blast radius if its credentials are ever compromised.

The Variables You’ll Actually Touch

The module exposes more than this, but these are the ones that matter day to day:

Variable	Default	What it controls
`logging_backend`	`"cloudwatch"`	The whole decision: `cloudwatch` or `opensearch`
`namespace`	`"logging"`	Namespace for the logging components
`log_retention_days`	`30`	CloudWatch retention (ignored for OpenSearch)
`kms_key_arn`	`null`	Optional CMK for CloudWatch log encryption
`opensearch_instance_type`	`"r6g.large.search"`	Node type when using OpenSearch
`opensearch_instance_count`	`1`	Number of OpenSearch data nodes
`opensearch_multi_az`	`false`	Spread OpenSearch across AZs
`vpc_id` / `private_subnet_ids`	`""` / `[]`	Required for OpenSearch placement

The outputs hand you what the rest of your stack needs: the IRSA role ARN, the active backend, the log destination (a CloudWatch ARN or an OpenSearch endpoint), and the log group details when you’re on CloudWatch.

Things I’d Tell You Before You Deploy This

Three honest notes, because the module has opinions and a couple of them have edges.

The two backends don’t run in the same namespace. CloudWatch’s Fluent Bit lives in the add-on-managed amazon-cloudwatch namespace. OpenSearch’s lives in your logging namespace. If you switch backends, the pod moves. Know where to look.

The grep exclusion is CloudWatch-only. On the CloudWatch path, system namespace logs (kube-system, amazon-cloudwatch) get dropped before shipping. On the OpenSearch path, they don’t. If you switch to OpenSearch and care about index volume, add the equivalent filter, or you’ll be storing and paying for control-plane chatter you didn’t ask for.

OpenSearch master credentials are generated once and not rotated. The module creates them and stores them in Secrets Manager, but there’s no rotation lifecycle attached. For a regulated environment, wire up rotation yourself or treat that secret as something you rotate on a schedule out of band.

None of these are dealbreakers. They’re the kind of thing you want to know going in rather than discover in an incident.

How to Choose

If you want a straight answer: start with CloudWatch. It’s the default for a reason. Serverless, cheap to run, nothing to operate, and EMF gives you metric extraction for free. The overwhelming majority of EKS clusters never need more than that.

Move to OpenSearch when you have a concrete need that CloudWatch can’t meet: full-text search across high log volume, complex ad-hoc investigation, or a security team that wants its own dashboards over the raw stream. When that day comes, it’s one variable and a re-apply, not a migration project.

What’s Next

Part 4 covers tracing: the OpenTelemetry Collector running on every node, tail-based sampling that always keeps your errors and slow requests, and exporting to X-Ray.

The full stack is on GitHub. Star it, fork it, open issues.

If you want this deployed on your EKS clusters, configured for your compliance requirements, and integrated with your existing logging, book a free 30-minute discovery call. I’ll scope what you need and give you a straight answer on timeline and cost.