Kubernetes Monitoring: 5 Tools and 4 Best Practices You Must Know About

What is Kubernetes Monitoring?

Kubernetes monitoring helps you identify issues and proactively manage Kubernetes clusters. Effective monitoring for Kubernetes clusters makes it easier to manage your containerized workloads, by tracking uptime, utilization of cluster resources (such as memory, CPU, and storage), and interaction between cluster components.

Kubernetes monitoring allows cluster administrators and users to monitor the cluster and identify issues such as insufficient resources, failures, pods that are unable to start, or nodes that cannot join the cluster. Many organizations use specialized cloud-native monitoring tools to gain full visibility over cluster activity.

This is part of an extensive series of guides about observability.

In this article, you will learn:

What Kubernetes Metrics Should You Measure?
Top 5 Kubernetes Monitoring Tools
4 Kubernetes Monitoring Best Practices
Kubernetes Monitoring and Observability with Calico

What Kubernetes Metrics Should You Measure?

There are several things that are important to monitor in Kubernetes:

Cluster monitoring – Keeps track of the health of an entire Kubernetes cluster. Helps you verify if nodes are functioning properly and at the right capacity, how many applications run on a node, and how the cluster as a whole utilizes resources.
Pod monitoring – Keeps track of issues affecting individual pods, such as resource utilization of the pod, application metrics, and metrics related to replication or autoscaling of the pod.
Deployment metrics – When using Prometheus, you can monitor Kubernetes deployments. This metric shows cluster CPU, Kube state, cAdvisor, and memory metrics.
Ingress metrics – Monitoring ingress traffic can help identify and manage various issues. You can use controller-specific mechanisms to configure ingress controllers to track workload health and network traffic statistics.
Persistent storage – Setting up monitoring for volume health enables Kubernetes to implement CSI. You can also use the external health monitor controller to monitor node failures.
Control plane metrics – You should monitor schedulers, API servers, and controllers to track and visualize cluster performance for troubleshooting purposes.
Node metrics – Monitoring CPU and memory for each Kubernetes node can help ensure they never run out. Several conditions describe the status of a running node, such as Ready, MemoryPressure, DiskPressure, OutOfDisk, and NetworkUnavailable.

The following table summarizes important metrics for cluster and pod monitoring.

Monitoring Level	Metrics	Description
Cluster	Nodes	Measures how many nodes are available and healthy, letting you determine the cloud resources you need to run the cluster.
Cluster	Resource utilization	Measures the computing resources utilized by your nodes—including memory, CPU, bandwidth, and disk utilization. Understanding resource utilization helps inform decisions to decrease or increase the size or number of nodes in a cluster.
Pod	Container metrics	Include network, CPU, and memory usage, compared with the prescribed maximum. These metrics can be accessed through metrics-server, which exposes the Metrics API.
	Application metrics	These metrics are specific to the application and relate to its business logic. For example, a web application may provide metrics detailing the number of users accessing the application, user experience metrics, and conversion actions.
	Pod health and availability	Allow you to monitor how the orchestrator handles a specific pod. You can monitor for information such as the actual number of pod instances at a given moment compared to the expected number. These metrics also include health checks, network data, and on-progress deployment.

Top 5 Kubernetes Monitoring Tools

Kubernetes is a complex environment, and containerized applications can be distributed across multiple environments. Monitoring solutions must be able to aggregate metrics from across the distributed environment, and deal with the ephemeral nature of containerized resources. The following are popular monitoring tools designed for a containerized environment.

1. Kubernetes Dashboard

Image Source: Kubernetes.io

Kubernetes Dashboard is a web-based user interface for Kubernetes. You can use it to:

Deploy containerized applications to a Kubernetes cluster
Troubleshoot containerized applications
Manage cluster resources
Get an overview of the applications running on a cluster
Create and modify individual Kubernetes resources
Monitor the health of Kubernetes resources and discover errors

The Kubernetes dashboard will give you a bird’s eye view of what’s going on in your clusters, but it’s not enough for production monitoring. This requires a dedicated Kubernetes monitoring tool such as Prometheus.

2. Prometheus

Image Source: Prometheus

A popular monitoring tool that was developed by SoundCloud before being donated to the Cloud Native Computing Foundation (CNCF), Prometheus provides alerts with detailed metrics and analysis for Kubernetes and Docker. It is designed for monitoring container-based microservices and applications running at scale. Prometheus is often used in combination with Grafana to enable data visualization.

Prometheus metrics are exposed through HTTP(S). There is no need to install a service agent. Instead, you can expose a web port. Prometheus servers regularly scrape (pull), eliminating the need to push metrics or configure remote endpoints. Prometheus uses a human-readable metrics format that is easy to understand, ensuring you can start publishing metrics quickly and easily.

Some microservices use HTTP for their functionality. In this case, you can reuse the internal web server and add a folder called /metrics. Some services expose Prometheus metrics from the ground up, such as the Traefik web proxy, the Kubernetes kubelet, and the Istio microservice mesh. Services that are not natively integrated can be adapted with an exporter, a service that collects service statistics and turns them into scrape-ready Prometheus metrics.

Prometheus can collect metrics related to various aspects, including Kubernetes services, orchestration status, and nodes. Here are common metrics exporters:

Node exporter – Collects host-related metrics such as CPU and memory.
Kube-state-metrics – Collects orchestration and cluster-level metrics such as deployments, resource reservation, and pod metrics.
Kubernetes control plane metrics – Collects information about the kubelet, DNS, etcd, and scheduler.

Prometheus uses PromQL to configure rules that trigger alerts, putting alertmanager in charge of configuring the receivers and gateways to deliver alert notifications and managing alert notification, inhibition, and grouping.

3. Grafana

Image Source: Grafana

This open-source platform for visualization of metrics and analytics provides four built-in dashboards for Kubernetes—Cluster, Node, Pod/Container and Deployment. Kubernetes administrators can create data-rich dashboards in Grafana using the information sourced from Prometheus.

4. EFK Stack

The EFK Stack integrates three tools—Elasticsearch, Fluentd, and Kibana—to collect, store, and visualize metric data. Elasticsearch is a search engine that ingests and stores data in a central repository, while Fluentd collects data from the logs of Kubernetes pods and routes it to Elasticsearch. Kibana is a plugin for Elasticsearch that functions as the UI for the EFK Stack, enabling the visualization of the logs and metrics in the form of custom dashboards.

5. LOKI

Grafana Loki is a log aggregator that facilitates monitoring in Kubernetes. It can work with Prometheus, share labels, and quickly correlate Kubernetes telemetry between these tools. Correlating Kubernetes metrics and logs can help quickly locate an issue’s root cause and eliminate the need to configure and manage different technologies.

Source: Grafana

Learn more in our detailed guide to Kubernetes monitoring tools

4 Kubernetes Monitoring Best Practices

Here are several best practices that can help you effectively monitor and troubleshoot Kubernetes environments.

Monitor Kubernetes Metrics Using a Single Pane of Glass

Granular resource metrics (memory, CPU, load, etc.) are important for identifying issues with Kubernetes microservices, but these metrics can be convoluted and difficult to use. The best KPIs to help you easily identify service issues are API metrics, such as request rate, call error, and latency. These metrics will quickly locate degradations in a component within a microservices application.

Having a single pane of glass for monitoring your Kubernetes metrics is a best practice because it allows you to view all of these metrics in a single, unified interface. This can make it easier to monitor and manage your cluster, as you can see all of the relevant metrics and data in one place, rather than having to switch between multiple tools and interfaces.

Having a single pane of glass for monitoring your Kubernetes metrics can also help you to identify trends and patterns in your data more easily. With all of your metrics in one place, you can see how different metrics are related and how they change over time, which can help you to identify potential issues and take action to address them.

Ensure Monitoring Systems are Scalable and Have Sufficient Data Retention

Ensuring that monitoring systems are scalable allows you to monitor your Kubernetes cluster effectively, even as it grows and changes over time.

As your Kubernetes cluster grows, the amount of data that it generates will also increase, and your monitoring systems need to be able to handle this increase in data. If your systems are not scalable, they may become overwhelmed by the volume of data and may not be able to provide accurate or useful information.

In addition, having sufficient data retention is important because it allows you to retain and access historical data from your cluster. This can be useful for troubleshooting problems that occur, as well as for understanding trends and patterns in your cluster’s performance over time.

Data retention is also important for compliance because many regulatory frameworks, such as the General Data Protection Regulation (GDPR), require organizations to retain specific types of data for a certain period of time.

Ensure You Generate the Alerts and Deliver them to the Most Appropriate Staff Members

By generating the right alerts, you can identify potential problems with your Kubernetes cluster as soon as they occur, and take action to address them before they become more serious. For example, if you set up alerts for critical metrics, such as CPU or memory usage, you can be notified when those metrics reach certain thresholds, allowing you to take action before your cluster becomes overloaded.

Delivering alerts to the most appropriate staff members is also important because it ensures that the right people are notified when an issue arises. For example, if there is a problem with a specific deployment, you may want to notify the team responsible for that deployment, rather than sending the alert to everyone in your organization. By delivering alerts to the appropriate staff members, you can ensure that the right people are notified and can take action to address the issue.

Integrate Monitoring Systems with Your CI/CD Pipeline

Integrating Kubernetes monitoring systems with CI/CD pipelines allows you to monitor your applications and infrastructure as they are being deployed, rather than after they have been deployed.

By integrating your monitoring systems with your CI/CD pipeline, you can automatically collect and analyze metrics from your applications and infrastructure as they are being deployed. This can help you to identify potential issues early on, and take action to prevent them from becoming more serious.

In addition, integrating your monitoring systems with your CI/CD pipeline can also help you to automate deployment processes. For example, you can track specific metrics that indicate the success or failure of a deployment, and use these metrics to decide whether to automatically roll back a new release.

Kubernetes Monitoring and Observability with Calico

Because Kubernetes workloads are highly dynamic, ephemeral, and are deployed on a distributed and agile infrastructure, Kubernetes poses a unique set of monitoring and observability challenges. As such, Kubernetes-native monitoring and observability is required to monitor and troubleshoot communication issues between microservices in the Kubernetes cluster.

More specifically, context about microservices, pods, and namespaces is needed so that multiple teams can collaborate effectively to identify and resolve issues. Calico Cloud and Calico Enterprise help rapidly pinpoint and resolve performance, connectivity, and security policy issues between microservices running on Kubernetes clusters across the entire stack.

Calico Cloud and Calico Enterprise are currently the only Kubernetes monitoring tools that offer the following unique features for Kubernetes observability:

Dynamic Service Graph – A point-to-point, topographical representation of traffic flow and policy that shows how workloads within the cluster are communicating, and across which namespaces. Also includes advanced capabilities to filter resources, save views, and troubleshoot service issues.
DNS Dashboard – Helps accelerate DNS-related troubleshooting and problem resolution in Kubernetes environments by providing an interactive UI with exclusive DNS metrics.
L7 Dashboard – Provides a high-level view of HTTP communication across the cluster, with summaries of top URLs, request duration, response codes, and volumetric data for each service.
Dynamic Packet Capture – Captures packets from a specific pod or collection of pods with specified packet sizes and duration, in order to troubleshoot performance hotspots and connectivity issues faster.
Application-level Observability – Provides a centralized, all-encompassing view of service-to-service traffic in the Kubernetes cluster to detect anomalous behavior like attempts to access applications or restricted URLs, and scans for particular URLs.
Unified Controls – A single, unified management plane provides a centralized point-of-control for unified security and observability on multiple clouds, clusters, and distros. Users can monitor and observe across environments with a single pane of glass.

Learn more about Calico for Kubernetes monitoring and observability

See Additional Guides on Key Observability Topics

Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of observability.

Rate this article

ratings

0 / 5 Average