EKS Logging: Fluent Bit, CloudWatch vs OpenSearch (Part 3 of 5)
How the logging module works: Fluent Bit on every node, one variable to switch CloudWatch Logs or OpenSearch, IRSA-scoped shipping, and a real parsing pipeline.
This is Part 3 of my five-part series on the eks-observability-stack. Part 1 covered the architecture. Part 2 dug into metrics with Amazon Managed Prometheus and Grafana. This post is about logs: collecting them off every pod with Fluent Bit, and shipping them to either CloudWatch Logs or OpenSearch depending on one variable.
The code is all in modules/logging/.
The Decision This Module Forces You to Make
Most logging guides pick a backend and pretend the other one doesn’t exist. The logging module doesn’t. It supports two, and it makes you choose:
variable "logging_backend" {
description = "Logging backend to deploy. Accepted values: cloudwatch, opensearch."
type = string
default = "cloudwatch"
validation {
condition = contains(["cloudwatch", "opensearch"], var.logging_backend)
error_message = "logging_backend must be either \"cloudwatch\" or \"opensearch\"."
}
}
That’s the entire interface. Set logging_backend = "cloudwatch" and you get a serverless, AWS-managed log destination with almost no operational surface. Set it to "opensearch" and you get a full search and analytics cluster you can build Kibana-style dashboards on top of.
The point of putting both behind one variable is that switching later doesn’t mean re-architecting. The collection layer stays the same. Only the destination changes.
What Fluent Bit Actually Collects
Both backends run Fluent Bit as a DaemonSet, which means one pod on every node, reading logs directly off the local disk. There is no central aggregator to bottleneck on. Each node ships its own logs.
The collection config is the same idea in both paths. Tail the container logs, parse them, enrich them with Kubernetes metadata, then forward:
[INPUT]
Name tail
Path /var/log/containers/*.log
multiline.parser docker, cri
DB /var/fluent-bit/state/flb_container.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
A few things worth understanding here, because they are the difference between logs you can use and logs that lie to you:
The state database. That DB path tracks how far Fluent Bit has read in each file. If a node reboots or the pod restarts, it picks up where it left off instead of re-shipping everything or dropping what it missed.
Multiline parsing at the input. multiline.parser docker, cri handles the two container runtime log formats. Without this, a single log line split across multiple physical lines (which CRI does constantly) arrives as fragments.
Mem_Buf_Limit 50MB and Skip_Long_Lines On. This is backpressure protection. If the destination is slow, Fluent Bit buffers up to 50MB in memory, then starts skipping rather than OOM-killing the node. On a logging agent, dropping a few lines is better than taking down the node it runs on.
Enrichment: Making Logs Mean Something
Raw container logs are useless without context. A line that says connection refused means nothing if you don’t know which pod, namespace, and cluster it came from. The Kubernetes filter fixes that:
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
Buffer_Size 32k
This attaches pod name, namespace, labels, and annotations to every record. Merge_Log On means if your application already logs JSON, Fluent Bit parses it and merges the fields in rather than nesting your structured log inside a string. K8S-Logging.Parser On lets individual pods declare their own parser via annotation, so a team running a non-standard log format can opt into custom parsing without touching the cluster config.
Then two more filters run on the CloudWatch path:
[FILTER]
Name grep
Match kube.*
Exclude $kubernetes['namespace_name'] ^(kube-system|amazon-cloudwatch)$
[FILTER]
Name modify
Match *
Add cluster_name my-cluster
The grep filter drops logs from kube-system and amazon-cloudwatch so you’re not paying to store the logging system’s own chatter. The modify filter stamps every record with the cluster name, which matters the moment you have more than one cluster shipping to the same account.
There’s also a multiline filter that stitches Java and Python stack traces back into single records, because a 40-line traceback arriving as 40 separate log entries is how you miss the actual error.
Path A: CloudWatch Logs (The Default)
The CloudWatch path leans on the amazon-cloudwatch-observability EKS add-on. AWS manages the Fluent Bit DaemonSet itself, and the module manages the configuration and the destinations.
Two log groups get created:
/aws/eks/{cluster_name}/applicationfor your container logs/aws/eks/{cluster_name}/dataplanefor the systemd units (kubelet, containerd)
Retention is a variable (log_retention_days, default 30), and you can pass a KMS key if you need the logs encrypted with your own key instead of the AWS-managed one.
The CloudWatch output also writes in EMF (json/emf), which means you can extract metrics from log fields using CloudWatch Logs Insights without standing up anything extra. If you’ve ever wanted “count of 5xx responses grouped by service, from the logs” without a metrics pipeline, that’s what this buys you.
This is the right default for most teams. It’s serverless, there’s nothing to size, nothing to patch, and nothing to wake up for. You pay per GB ingested and per GB stored, and that’s the whole cost model.
Path B: OpenSearch (When You Need Real Search)
Set logging_backend = "opensearch" and the module deploys a different shape entirely. Fluent Bit comes in via the official Helm chart (pinned to 0.43.0) as a DaemonSet in the logging namespace, and the destination is a managed OpenSearch domain.
The example at examples/logging-opensearch/ shows the full toggle. What you get:
- An OpenSearch 2.11 domain with fine-grained access control
- Auto-generated master credentials stored in Secrets Manager
- A security group restricting HTTPS to the VPC CIDR
- Logs landing in a
eks-logs-*index, rotated daily by Logstash-format naming
The Fluent Bit output uses IRSA credentials to sign requests, so there are no static keys in the cluster:
[OUTPUT]
Name opensearch
Host <domain-endpoint>
Port 443
TLS On
AWS_Auth On
Suppress_Type_Name On
Index eks-logs
Logstash_Format On
Logstash_Prefix eks-logs
Retry_Limit 5
Choose this path when you actually need full-text search across logs, ad-hoc querying, and dashboards that go deeper than CloudWatch Logs Insights. The trade-off is that OpenSearch is a real cluster. You size it (opensearch_instance_type defaults to r6g.large.search, opensearch_instance_count to 1), you pay for it by the hour whether it’s busy or not, and you own its capacity planning. Multi-AZ is a flag (opensearch_multi_az) when you need the durability.
IRSA: One Role, Scoped to One Job
Whichever backend you pick, Fluent Bit authenticates through IAM Roles for Service Accounts. No access keys anywhere. The module creates a single role bound to the fluent-bit service account, and attaches exactly the policy that backend needs.
For CloudWatch:
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogStreams"
],
"Resource": "arn:aws:logs:*:*:log-group:/aws/eks/my-cluster/*"
}
That role can write to this cluster’s log groups and nothing else. It cannot read logs, cannot delete log groups, cannot touch any other cluster’s logs. The resource ARN is scoped to the cluster prefix.
For OpenSearch:
{
"Effect": "Allow",
"Action": ["es:ESHttp*"],
"Resource": "<opensearch-domain-arn>/*"
}
HTTP operations on one specific domain. No es:*, no admin, no ability to delete the domain or reconfigure it. The trust policy on both restricts the role to a single Kubernetes service account in a single namespace via the OIDC sub claim.
This is what least-privilege looks like in practice. The log shipper can ship logs. That’s the entire blast radius if its credentials are ever compromised.
The Variables You’ll Actually Touch
The module exposes more than this, but these are the ones that matter day to day:
| Variable | Default | What it controls |
|---|---|---|
logging_backend | "cloudwatch" | The whole decision: cloudwatch or opensearch |
namespace | "logging" | Namespace for the logging components |
log_retention_days | 30 | CloudWatch retention (ignored for OpenSearch) |
kms_key_arn | null | Optional CMK for CloudWatch log encryption |
opensearch_instance_type | "r6g.large.search" | Node type when using OpenSearch |
opensearch_instance_count | 1 | Number of OpenSearch data nodes |
opensearch_multi_az | false | Spread OpenSearch across AZs |
vpc_id / private_subnet_ids | "" / [] | Required for OpenSearch placement |
The outputs hand you what the rest of your stack needs: the IRSA role ARN, the active backend, the log destination (a CloudWatch ARN or an OpenSearch endpoint), and the log group details when you’re on CloudWatch.
Things I’d Tell You Before You Deploy This
Three honest notes, because the module has opinions and a couple of them have edges.
The two backends don’t run in the same namespace. CloudWatch’s Fluent Bit lives in the add-on-managed amazon-cloudwatch namespace. OpenSearch’s lives in your logging namespace. If you switch backends, the pod moves. Know where to look.
The grep exclusion is CloudWatch-only. On the CloudWatch path, system namespace logs (kube-system, amazon-cloudwatch) get dropped before shipping. On the OpenSearch path, they don’t. If you switch to OpenSearch and care about index volume, add the equivalent filter, or you’ll be storing and paying for control-plane chatter you didn’t ask for.
OpenSearch master credentials are generated once and not rotated. The module creates them and stores them in Secrets Manager, but there’s no rotation lifecycle attached. For a regulated environment, wire up rotation yourself or treat that secret as something you rotate on a schedule out of band.
None of these are dealbreakers. They’re the kind of thing you want to know going in rather than discover in an incident.
How to Choose
If you want a straight answer: start with CloudWatch. It’s the default for a reason. Serverless, cheap to run, nothing to operate, and EMF gives you metric extraction for free. The overwhelming majority of EKS clusters never need more than that.
Move to OpenSearch when you have a concrete need that CloudWatch can’t meet: full-text search across high log volume, complex ad-hoc investigation, or a security team that wants its own dashboards over the raw stream. When that day comes, it’s one variable and a re-apply, not a migration project.
What’s Next
Part 4 covers tracing: the OpenTelemetry Collector running on every node, tail-based sampling that always keeps your errors and slow requests, and exporting to X-Ray.
The full stack is on GitHub. Star it, fork it, open issues.
If you want this deployed on your EKS clusters, configured for your compliance requirements, and integrated with your existing logging, book a free 30-minute discovery call. I’ll scope what you need and give you a straight answer on timeline and cost.