Best practices for Kubernetes observability at scale

What is Kubernetes observability?

Observability is all about giving IT teams actionable insights to maximize workload availability, performance and security. It’s essential when managing complex infrastructure like Kubernetes environments.

Observability tools provide near real-time visibility across the complete Kubernetes infrastructure and application stack, by continuously aggregating and analyzing multiple streams of telemetry from the infrastructure to the app layer.

As a result, operations and devops teams can quickly monitor and troubleshoot issues, no matter where they originate.

In this blog, we’ll explore the main concepts of observability, what it means for Kubernetes operators, and the best practices and tools available.

Monitoring vs observability

Many people use monitoring and O11y interchangeably. But there is a difference between monitoring and observability.

Monitoring is about collecting and analyzing data for specified systems and applications and measuring those metrics against predetermined benchmarks.

For example, we might monitor:

Server A’s CPU utilization in case it exceeds the threshold that we expect to cause performance issues.
Application B’s latency in case it’s higher than acceptable.

Seen this way, monitoring is static: you define your thresholds and targets in advance. But today’s systems are very dynamic — especially systems like Kubernetes.

Monitoring is also reactive. Teams often receive alerts only once a known metric has reached a known criticality threshold — in other words, the problem is already serious.

This can lead to SREs and other ops roles burning out, because they’re always fighting fires and dealing with incidents.

Observability is both more dynamic and more proactive than monitoring. While it uses many of the same telemetry data streams as monitoring, it doesn’t rely on static thresholds.

Observability can leverage ML and AI to detect and then alert on any system changes or anomalies that could cause potential issues, before they happen.

With these leading indicators of system health, capacity and performance, operators can act to prevent an issue before it cascades and starts affecting user experience. This all helps reduce Mean Time To Detection (MTTD).

Because observability looks across and correlates data from multiple instrumented IT systems, it is also a critical tool for root cause analysis of complex systems, reducing Mean Time To Resolution (MTTR).

Essentially, it helps you to understand not just what occurred, but why, given the broader context of system health.

Fine-tuned observability becomes proactive and provides a continuous feedback loop of improvement, reducing further incidents. Observability is a continuous process, not an end state.

Metrics, logs and traces: the three pillars of observability

The three core pillars (or data streams) used in observability are metrics, logs and traces.

Metrics are numeric insights generated by:

All core infrastructure components, from servers to networking devices.
Most common services from databases to load balancers to Kubernetes itself. For example, the Kubernetes metrics services can report on everything from state to resource utilization, and are visualized easily in the Kubernetes Dashboard.
Applications. For example, an ecommerce platform might emit custom metrics about the number of checkouts completed.

Logs are like the black box on a plane. They capture the record of events that actually happened, as they happen. Because logs can contain a lot of unfiltered detail, they are not good to monitor in real time. They do however become critical during root cause analysis after the event when you are looking into the when, where, and why.

Distributed Tracing is not a core tenet of monitoring, but is critical to observability for distributed systems that use a microservices architecture. Tracing follows a single distributed transaction through a series of microservices, even as it passes to external dependencies like a cloud based authentication service. Tracing provides insights not only into the dependencies in your application, but where in the distributed architecture there are errors, high latency etc.

Other data streams can be used for observability as well. These include:

Events, which are a point-in-time action (such as a code push or upgrade from a git process) that could affect system or application availability and performance. When correlated on a timeline with metrics, logs and traces, events can help pinpoint a root cause. For example, if application latency spikes after a code push event, it’s likely the root cause was bad code.

Digital User Experience Data. The two main types of Digital Experience Data are Synthetics and Real User Monitoring (RUM).

Synthetics tests application access and performance systematically from multiple locations, device types and browser types. This data can help you pinpoint if an issue is location specific or caused by an incompatibility issue with a device or browser version.
RUM tracks similar information, such as errors or latency based on location, device and or browser type, but from real user transactions.

Finally, many people call visualization the fourth pillar of observability. Visualization is what turns data into intuitively understandable insights, through charts, graphs and dashboards that use color, shape and size to call attention to urgent issues and emerging trends.

Why is observability important?

Every business in every industry depends on applications, and the infrastructure they run on, being available and performing as expected.

Issues and outages of any kind can have immediate, business-critical impact:

For a retailer, a point of sale or ecommerce payment system suffering downtime means lost transaction revenue and reputational damage.
For a manufacturer, a software issue could mean a factory grinding to a halt.
In healthcare, an outage in a diagnostic or patient record system could cost lives.

These scenarios are not just hypothetical. In the 2021 State of Observability report:

53% of respondents said app issues had resulted in customer or revenue loss.
45% reported lower customer satisfaction as a result of service failures.
30% reported losing customers.

In the old world of monolithic applications running on a server, determining the root case of an issue was much simpler. After the on-call pager went off, an incident response team could quickly identify the problem, review logs to pinpoint the cause, and act to restore service.

But in today’s world of distributed microservices and cloud-native infrastructure, things are very different.

More components, more dependencies

A Kubernetes environment is the opposite of a monolithic architecture:

First, the Kubernetes control plane itself is made up of multiple interacting components.
Then you have the infrastructure running both the control plane and workloads.
Lastly you have your workloads (applications and services) running in pods/deployments.

Traditional monitoring paradigms are not capable of identifying the root cause of an incident in a landscape like this. Only an observability solution can span all these different components, understand their dependencies and correlate the monitoring telemetry.

Ephemeral infrastructure

Kubernetes is dynamic and constantly changing. Both node pools and pods/deployments can scale and components can be created, replicated and destroyed frequently.

Traditional monitoring only tracks entities it already knows about. Modern observability-based monitoring tools can dynamically recognize new resources that need monitoring in continuously evolving environments.

Real-time application changes

In Kubernetes environments, provisioning or updating applications happens very frequently, often as part of a CI/CD pipeline. Near real-time observability is required to catch any bad pushes before they have a significant impact on performance or stability.

The most popular Kubernetes observability tools

A host of monitoring and observability tools are available for Kubernetes. Many have some stewardship from the Cloud Native Computing Foundation (CNCF) under the Observability TAG, including fluentd for logging, the Jaeger operator for tracing, and Prometheus.

Prometheus and Grafana

Undoubtedly, the dominant Kubernetes observability tool is Prometheus, although to perpetuate the confusion, it calls itself a Kubernetes monitoring tool!

Prometheus has been around for a while. It was first created back in 2012, and became the second CNCF incubated project after Kubernetes itself.

It was created because existing metric formats such as StatsD and Graphite were insufficient for monitoring distributed cloud native workloads such as Kubernetes.

Prometheus includes PromQL, its own query language used to create dashboards and alerts, and Alertmanager, to trigger alerts based on PromQL queries.

Prometheus is often paired with Grafana, which is used for dashboard visualizations of Prometheus metrics.

Prometheus is simple, versatile and easy to deploy, but in my experience, it works best in non-production environments or where each business unit/team manages their own observability.

Prometheus can be federated, but its simplicity can make it challenging to scale to support large centralized deployments and enterprise requirements (here’s a great blog on scaling considerations).

Important limitations include the lack of:

Long term storage for more historical data analysis
Multi-tenancy and enterprise RBAC to provide controls for segmenting and isolating certain sets of data to meet compliance requirements or multi team/department usage.

Datadog and other commercial tools

The other most common Kubernetes observability platforms are commercial solutions and are designed with additional features to tackle the complexity and scale of large enterprises.

Common tools in this space are Datadog, Dynatrace, NewRelic and Splunk Observability, and the Elastic Stack, formerly the open source ELK stack (Elasticsearch, Logstash, and Kibana).

These commercial solutions handle observability better at scale. They can process billions of data points per minute and store historical data for at least a year. They also include Enterprise Service Bureau features, such as data isolation, segmentation and enterprise level RBAC.

The cons of going the commercial route is they have proprietary components such as their query languages (examples are New Relic’s NRQL and Splunk’s Signalflow). You may need to build up a vendor-specific set of skills/knowledge that often means a longer time to deployment. You also run the risk of vendor lock-in.

The third way: OpenTelemetry

An interesting area to watch is OpenTelemetry (OTel). OpenTelemetry is now the second-largest CNCF project, and it’s bridging the gap between open source and commercial solutions.

OpenTelemetry is quickly being adopted as standard for collecting O11y telemetry data from cloud native distributed systems.

It is supported by all the major commercial solutions and can be the data collector for any of the main commercial/enterprise backend solutions such as Splunk, Datadog, and Grafana Cloud.

Best practices for observability success

Observability is complex and fast changing, but there are a few best practices you can follow.

Be clear on your basic requirements

First, look for a modern observability solution designed for distributed and cloud based workloads.

Kubernetes is made up of many components, so any solution you choose should be able to collect telemetry from all the components of Kubernetes (nodes, pods, etcd, API server, etc.) out of the box, with minimal configuration.

Prometheus and OpenTelemetry both do this well with their Kubernetes Operators.

Consider your specific telemetry needs

Map out what other telemetry data you may need to collect, now or in the future. For example: do you need metrics from a certain database or message bus type? What about the ability to collect custom metrics from your applications?

Make sure your shortlisted O11y solution supports the data sources you need or allows you to create a plugin/receiver to collect custom metrics.

Again Prometheus and OpenTelemetry both have a range of out of the box integrations, community provided integrations and frameworks to create custom integrations.

Keep it simple

Kubernetes is complex in itself, so you shouldn’t overcomplicate your O11y solution. Don’t be lured by long lists of features if they don’t map to your requirements. Instead, look for a solution that provides flexibility for the future.

We tend to recommend Prometheus. It’s feature-rich, but more importantly will also work for future requirements with minimal retooling. There’s a great community of solutions to extend its functionality.

For example: you may not need long term data storage today. But if you operate in an industry with compliance reporting requirements, you will probably need it in the future. At that point, you can choose from several TSDBs (time series databases) that work with Prometheus, such as Grafana Mimir OSS or InfluxDB.

O11y made easy: how to deploy the Spectro Cloud Palette Monitoring Stack

In the spirit of keeping it simple, here’s an easy way to deploy an observability solution to your Kubernetes clusters.

We’ve just released the Spectro Cloud Monitoring Stack, which lets you seamlessly deploy a centralized Prometheus instance that includes Alert Manager and Grafana for alerting and dashboarding.

The stack has two parts:

First there is the Prometheus Operator Pack designed to be deployed on its own cluster. The Prometheus Operator Pack has a preset to enable remote monitoring (i.e., accept metrics from other clusters as well as metrics from itself).

Next there’s the Prometheus Agent Pack. You can add this pack to the Cluster Profile of any cluster you want to observe. When deployed, it will scrape metrics from the cluster it is installed on and forward those metrics to the centralized Prometheus instance you deployed with the Prometheus Operator Pack.

We’re excited about how this Stack will simplify observability deployment in multi cluster environments. Observability is so critical to running modern infrastructure, which is why we’re already working on further integrations, including options to address long term storage with Prometheus, and collection and export of telemetry to an external commercial O11y solution.

As always, we love feedback, so please do come say hi in our Slack channel or on social if you have any questions or ideas to share. We’ll soon be running a webinar on Kubernetes observability, so make sure you sign up to our BrightTalk channel here and you’ll be the first to hear about it.

Tags:

Concepts

Best Practices

Observability