The all-seeing eye
The first time you deploy a production workload on Kubernetes you discover two truths at once: the platform’s power is intoxicating — and it will betray you the moment you stop watching it.
Pods churn, nodes recycle, network paths mutate, and the CNCF gremlins hold stand‑up meetings in your kube‑system namespace.
Monitoring is your key defense, and an accelerant. With the right visibility you can tighten feedback loops, prune cloud waste before finance pings you, and push features on a release train that never slows for blind corners.
The ecosystem has matured rapidly since the early Prometheus–Grafana days. We now have high‑availability storage layers, eBPF‑powered tracers that speak kernel, and OpenTelemetry shaping itself into a lingua franca for every signal you care about.
The goal of this article is to map that landscape, show how the pieces fit, and help you choose a toolchain that matches your scale, budget and patience.
Monitoring, observability and why the distinction matters
We keep a sharp line between monitoring — the discipline of measuring the health of a system against known expectations — and observability, the larger craft of debugging the unknown. We will stay focused on monitoring while acknowledging the data sources that observability borrows.
Monitoring asks predictable questions: Is my API returning 200s? How much memory does the ingress controller eat? Its outputs feed dashboards, alert pages and weekly capacity reviews.
Observability is what happens when you face a question you did not prepare for: Why did checkout latency spike only for users in Frankfurt?
They share telemetry — metrics, logs and traces — but they emphasise different workflows. Monitoring focuses on fast, often numeric time‑series queries that drive dashboards and alerts. Observability tends to stitch together richer context using traces and logs so you can follow an individual request.
For readers who want to go deeper into the observability (“o11y”) side, check out Melody Hazi’s blog post on the Spectro Cloud site as well as the accompanying webinar replay. Read the blog and watch the webinar.
What a modern Kubernetes monitoring solution must do
Before we dive into the tools, it helps to spell out the requirements that shape today’s choices:
- Horizontal scale without drama. A single‑tenant SaaS provider may run dozens of Kubernetes clusters and 100,000 pods. Scrapes and queries must remain fast when cardinality climbs.
- Multi‑source ingest. Metric data comes from the kubelet, cAdvisor, the control‑plane API, application exporters, network overlays and cloud services. A good stack handles them all.
- Low friction for developers. Engineers should be able to expose a metric with a single dependency import and see it appear in the dashboard within minutes.
- Actionable alerts, not spam. Pager fatigue is real; alert rules must support sophisticated routing and deduplication.
- Open standards. Teams want to avoid the sunk‑cost trap of proprietary agents and query languages.
With those needs in mind, let’s tour the 2025 landscape.
The Kubernetes monitoring landscape in 2025
Prometheus and Grafana
The grand‑daddy of cloud‑native monitoring, Prometheus still dominates metric collection thanks to its rich label system, PromQL query language and HTTP pull‑based language the cluster already understands. Scraping /metrics endpoints requires no sidecars and integrates with kube‑state‑metrics for cluster internals.
Prometheus 2.51 (April 2025) comfortably handles millions of active series on commodity SSDs, but single‑server limits still surface beyond a few hundred nodes.
Single‑server Prometheus will topple over once you pass around 10 million active series, so serious users layer in Thanos or Cortex for horizontal sharding and cheap object‑store retention:
- Thanos stitches multiple Prometheus instances together with an object‑storage backend and query federation that feels native.
- Cortex shards time‑series into a microservice mesh and stores data in the same cloud databases you already trust.
Both approaches give you high‑availability ingestion, horizontal query fan‑out, and virtually limitless retention. Teams usually pick Thanos when they run Prometheus themselves and reach for Cortex when they prefer a single control plane managed by someone else.
If even a small Thanos deployment sounds like weekend work you would rather avoid, the cloud vendors agree. Google Cloud Managed Prometheus deploys a fully compatible collector as a pod and ships data to a hosted backend that bills by samples ingested. AWS Container Insights and Azure Monitor follow similar models. You trade root access to TSDB files for an SLA and a tidy invoice line item.
Prometheus’ partner in crime, Grafana, translates Prometheus queries into the dashboards and valuable insights executives expect. Recent releases fold in a unified alerting engine, alert rule version control and stateful transformations that once required Loki or SQL.
Open Telemetry as the connective tissue
The Cloud Native Computing Foundation blessed OpenTelemetry (OTel) as the universal specification for metrics, logs and traces and it has moved rapidly from promising to pervasive.
The OTel Collector runs as a DaemonSet or sidecarless eBPF agent, scrapes Prometheus endpoints, tails container logs and receives spans from instrumented code, then funnels everything to a backend of your choosing.
Teams like OTel because collectors can push to both open‑source stores (Thanos, Loki, Tempo) and commercial APMs (Datadog, New Relic) at the same time — giving you a migration path.
The consensus in 2025 is clear: even if you love PromQL and Grafana, run an OTel Collector in front. You get vendor choice, future‑proof auto‑instrumentation and one place to enforce scrub rules before data leaves the cluster.
eBPF and event streams: seeing inside the kernel without sidecars
Extended Berkeley Packet Filter, or eBPF, quietly shifted from niche packet filter to observability superpower. Tools like Pixie, Cilium Hubble and the newcomer Groundcover attach probes to system calls and kernel tracepoints, streaming rich telemetry with near‑zero overhead. We wrote about Cilium and eBPF way back in 2022, and how far things have come!
For monitoring, eBPF excels at network‑flow metrics (latency by service pair), syscall error rates and resource hot spots that userspace exporters miss. Most teams deploy an eBPF layer as a complement to — not a replacement for — Prometheus because kernel events answer different questions.
The commercial tools
Datadog: A full SaaS platform that bundles metrics, logs, traces and security signals under one UI. Datadog’s Kubernetes integration auto‑discovers pods and ships metadata so you can build dashboards without touching YAML. Pricing is per‑host plus per‑metric overage, so keep an eye on cardinality.
Splunk Observability Cloud: Splunk bought SignalFx for real‑time metrics and added it to its portfolio that already handled logs at scale. The SignalFx backend can drink from OTel collectors, meaning you can switch without redeploying agents.
There are others, like New Relic, but this blog is already going to be long enough without covering them all!
Cloud provider native tools
As you might expect, each hyperscaler has some baked-in cloud monitoring tools that you can leverage as a customer.
- AWS CloudWatch Container Insights stitches together Prometheus metrics, Container Insights logs and X‑Ray traces into the CloudWatch UI.
- Google Cloud Managed Prometheus offers a drop‑in replacement that speaks Prometheus remote‑write and keeps data in Monarch, Google’s internal timeseries store.
- Azure Monitor Container Insights attaches an agent to each node and feeds metrics into Azure Monitor, Log Analytics and Application Insights.
All three cloud-based options excel at cost‑per‑time‑series at massive scale and integrate neatly with IAM and billing. The trade‑off is vendor lock‑in: your PromQL dashboard may not run outside that cloud.
When you stay inside one hyperscale provider, you inherit a data‑plane optimised for that provider’s storage and query engines. For example, Google’s Monarch back‑end stores trillions of points with millisecond latency, something you would struggle to replicate self‑hosted. The price you pay is not only hard dollars but also the agility cost of moving later.
Cross‑cloud outfits — or teams running on‑prem plus edge nodes — tend to prefer DIY stacks such as Prometheus + Thanos because they can run everywhere. Hybrid strategies are common too: write to Managed Prometheus for retention, mirror a seven‑day buffer to local Grafana for fast SLO dashboards.
Logs and traces: adjacent signals you can’t ignore
Although this article focuses on monitoring, metrics rarely live alone.
For logs, the traditional answer has been an Elasticsearch–Logstash–Kibana trio, but the storage cost of unstructured data grew faster than disk capacity. Grafana Loki took a different route, indexing only labels and leaving raw text in object storage. In practice this slices log bills by an order of magnitude and keeps latency reasonable.
Distributed traces complete the picture by following a single request hop by hop. Jaeger and Grafana Tempo share the same DNA as Prometheus and Loki: stateless collectors at the edge, cheap object storage at rest. While traces belong to observability more than monitoring, they power two monitoring staples: Request‑Per‑Second curves and tail latency alerts.
Fluentd (or its lighter sibling Fluent Bit) remains a common first hop for logs, fanning them out to Elasticsearch, Splunk or Loki. Fluentd can also convert counter increments into Prometheus metrics, bridging the two worlds.
Other handy tools
Kubernetes dashboard: Still bundled with upstream Kubernetes, the dashboard offers a quick heads‑up view of pods and nodes. It is handy for demos and for the early days of a cluster, but it stores nothing long‑term, and it cannot aggregate metrics across clusters. Think of it as a control‑panel, not a monitoring platform.
Kubewatch: Kubewatch is a lightweight event notifier that listens to the Kubernetes API and forwards changes (pod restarts, deployment updates) to Slack or Microsoft Teams. It is not a metrics store, but it is invaluable for catching accidental rollouts and mis‑configurations.
Choosing a bundle that fits your use case
A ten‑node lab cluster hums along happily with a single Prometheus pod and Loki running on local disks. A fintech platform spread across three regions will insist on replica‑aware Prometheus shards behind Thanos, OTel Collectors emitting to both Loki and Tempo and a dashboard tied to Service Level Objectives that drive contractual commitments.
If you operate at the edge, shipping metrics over flaky links may cost more than storing them locally. Lightweight Time Series Databases such as VictoriaMetrics edge or built‑in agent caches inside the OTel Collector let you buffer until the connection returns.
For startups with two SREs and no appetite for 3 a.m. Thanos upgrades, a managed backend is rational. The trade‑off is data egress lock‑in; always check whether you can export raw Prometheus blocks or trace spans if you decide to migrate later.
Scaling, cost control and reliability considerations
Large clusters can push Prometheus past safe limits in days. The rule of thumb today is 100,000 active series per 4 vCPU Prometheus shard. When you grow beyond that, add Thanos Querier plus bucket‑store on S3 or GCS, and keep only 24–48 hours hot on disk. Cardinality explosions often come from unbounded label values such as pod_name. Drop or relabel before data hits storage.
Alerting deserves equal care. Aim for five to ten paging alerts max, all mapped to user‑facing symptoms. Everything else should route to Slack during business hours. Silence flapping alerts ruthlessly — the on‑call rota and your cloud bill will thank you.
Finally, test failure modes. Kill a Prometheus pod and confirm a standby takes over without data loss. Disable the network path to your object store and observe how long WAL buffers hold up.
How Palette helps, without the hard sell
Spectro Cloud Palette is not a monitoring product, but it can make your monitoring operate better.
With Palette you can define a monitoring pack — for instance an OTel Collector plus Prometheus plus Grafana — and make it part of your ‘Cluster Profile’, then declaratively roll it out consistently to every cluster, whether on EKS, vSphere or bare metal.
Need to patch Prometheus for CVE‑2025‑121? Bump the pack version in the Cluster Profile, and Palette automates the rolling replacement.
Palette also surfaces basic cluster health metrics natively so you can catch capacity drifts before you even pivot to your external dashboards.
Conclusion
Monitoring remains the first safety net for any Kubernetes workload, and in 2025 the choices are powerful but sane.
Start with Prometheus because everything speaks its exposition format. Slide an OTel Collector in front to keep your options open. Add Loki for logs, Tempo for traces and sprinkle eBPF where kernel visibility pays dividends. When scale bites, reach for Thanos or a managed backend rather than stretching a single‑node Prometheus into heroic territory.
Above all, treat monitoring as a product. Guard alert quality, review dashboards like they were API contracts and budget for the storage you will need a year from now.