Kubernetes monitoring helps you identify issues and proactively manage Kubernetes clusters. Effective monitoring for Kubernetes clusters makes it easier to manage your containerized infrastructure, by tracking uptime, utilization of cluster resources (such as memory, CPU, and storage), and interaction between cluster components.
In Kubernetes, cluster operators monitor the cluster and alert you if the required number of pods is running, if resource utilization is nearing a critical limit, or if there is a failure or configuration error that prevents a pod or node from joining the cluster. Beyond this built-in monitoring functionality, many organizations use specialized cloud-native monitoring tools to gain full visibility over cluster activity.
This is part of an extensive series of guides about observability.
There are two main levels of monitoring in Kubernetes:
The following table summarizes important metrics for cluster and pod monitoring.
Kubernetes is a complex environment, and containerized applications can be distributed across multiple environments. Monitoring solutions must be able to aggregate metrics from across the distributed environment, and deal with the ephemeral nature of containerized resources. The following are popular monitoring tools designed for a containerized environment.
A popular monitoring tool that was developed by SoundCloud before being donated to the Cloud Native Computing Foundation (CNCF), Prometheus provides alerts with detailed metrics and analysis for Kubernetes and Docker. It is designed for monitoring container-based microservices and applications running at scale. Prometheus is often used in combination with Grafana to enable data visualization.
This open-source platform for visualization of metrics and analytics provides four built-in dashboards for Kubernetes—Cluster, Node, Pod/Container and Deployment. Kubernetes administrators can create data-rich dashboards in Grafana using the information sourced from Prometheus.
This open-source tracing system, developed by Uber, is used to monitor and troubleshoot distributed transactions. Jaeger addresses software issues related to distributed context propagation and latency optimization.
Kubernetes Dashboard is a web-based user interface for Kubernetes. You can use it to:
Kiali provides a management UI for service mesh architectures based on Istio. It provides dashboards for visualization, and allows you to operate the mesh with powerful capabilities for configuration and validation. The structure of the service mesh is revealed via inferred traffic topology. Kiali offers detailed metrics and visualization of the health of your mesh, enables access to Grafana and integrates with Jaeger for distributed tracing.
Kubewatch is an open-source Kubernetes watcher written in Go and developed by Bitnami Labs. It complements the monitoring solution by providing an easy-to-use interface between the Kubernetes cluster and collaboration tools.
You can monitor changes to specified Kubernetes resources and report them directly to Slack, or other collaboration platforms like HipChat, Mattermost and Flock. You can also use IT service management (ITSM) tools like ServiceNow to trigger generic webhooks for custom integrations.
The EFK Stack integrates three tools—Elasticsearch, Fluentd, and Kibana—to collect, store, and visualize metric data. Elasticsearch is a search engine that ingests and stores data in a central repository, while Fluentd collects data from the logs of Kubernetes pods and routes it to Elasticsearch. Kibana is a plugin for Elasticsearch that functions as the UI for the EFK Stack, enabling the visualization of the logs and metrics in the form of custom dashboards.
Learn more in our detailed guide to Kubernetes monitoring tools
Here are several best practices that can help you effectively monitor and troubleshoot Kubernetes environments.
Granular resource metrics (memory, CPU, load, etc.) are important for identifying issues with Kubernetes microservices, but these metrics can be convoluted and difficult to use. The best KPIs to help you easily identify microservice issues are API metrics, such as request rate, call error, and latency. These metrics will quickly locate degradations in a component within the microservice.
You can easily discover service-level metrics with automatic detection of REST API request anomalies, for instance over an ingress controller such as Istio or Nginx. These metrics measure every Kubernetes service in the same way, providing consistent visibility across the clusters.
High disk utilization is the most common problem on any system. There is no magic solution, nor can you automatically recover volumes that are statically attached to StatefulSet resources. Typically, you set the alert to 75% to 80% utilization. High disk utilization alerts are always important and usually indicate a problem with your application. All disk volumes must be monitored, including the root file system. Early detection of pattern changes can reduce issues later on.
End-user experience management is not built into the Kubernetes platform. However, an application’s primary objective is to provide a positive experience to the end-user, and this should be built into your monitoring strategy for Kubernetes.
To understand how your application is performing, you need to collect data via both synthetic and real-user monitoring. This will allow you to see how the end-user interacts with Kubernetes workloads, how the app responds, and how user-friendly it is. It will also inform you if you need to adjust anything to improve the usability and frontend.
If Kubernetes is running in the cloud, certain factors need to be considered when planning your monitoring strategy. In the cloud, you also need to monitor the following:
Because Kubernetes workloads are highly dynamic, ephemeral, and are deployed on a distributed and agile infrastructure, Kubernetes poses a unique set of monitoring and observability challenges. As such, Kubernetes-native monitoring and observability is required to monitor and troubleshoot communication issues between microservices in the Kubernetes cluster.
More specifically, context about microservices, pods, and namespaces is needed so that multiple teams can collaborate effectively to identify and resolve issues. Calico Cloud and Calico Enterprise help rapidly pinpoint and resolve performance, connectivity, and security policy issues between microservices running on Kubernetes clusters across the entire stack.
Calico Cloud and Calico Enterprise are currently the only Kubernetes monitoring tools that offer the following unique features for Kubernetes observability:
Learn more about Calico for Kubernetes monitoring and observability
Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of observability.
Authored by Lumigo