Prometheus is an open-source technology designed to provide monitoring and alerting functionality for cloud-native environments, including Kubernetes. It can collect and store metrics as time-series data, recording information with a timestamp. It can also collect and record labels, which are optional key-value pairs.
Key features of Prometheus include:
Prometheus was initially created by SoundCloud back in 2012. Since its inception, Prometheus has become a popular monitoring tool supported by an independent community of contributors. In 2016, Prometheus joined the Cloud Native Computing Foundation (CNCF), and is now a graduated CNCF project.
To get metrics, Prometheus requires an exposed HTTP endpoint. Once an endpoint is available, Prometheus can start scraping numerical data, capture it as a time series, and store it in a local database suited to time-series data. Prometheus can also be integrated with remote storage repositories.
Users can leverage queries to create temporary times series from the source. These series are defined by metric names and labels. Queries are written in PromQL, a unique language that allows users to choose and aggregate time-series data in real time. PromQL can also help you establish alert conditions, resulting in notifications to external systems like email, PagerDuty, or Slack.
Prometheus can display collected data in tabular or graph form, shown in its web-based user interface. You can also use APIs to integrate with third-party visualization solutions like Grafana.
Prometheus is a versatile monitoring tool, which you can use to monitor a variety of infrastructure and application metrics. Here are a few common use cases.
Prometheus is typically used to collect numeric metrics from services that run 24/7 and allow metric data to be accessed via HTTP endpoints. This can be done manually or with various client libraries. Prometheus exposes data using a simple format, with a new line for each metric, separated with line feed characters. The file is published on an HTTP server that Prometheus can query and scrape metrics from based on the specified path, port, and hostname.
Prometheus can also be used for distributed services, which are run on multiple hosts. Each instance publishes its own metrics and has a name that Prometheus can distinguish.
You can monitor the operating system to identify when a server’s hard disk is full or if a server operates constantly at 100% CPU. You can install a special exporter on the host to collect the operating system information and publish it to an HTTP-reachable location.
Prometheus doesn’t usually monitor website status, but you can use a blackbox exporter to enable this. You specify the target URL to query an endpoint, and perform an uptime check to receive information such as the website’s response time. You define the hosts to be queried in the prometheus.yml
configuration file, using relabel_configs to ensure Prometheus uses the blackbox exporter.
To check if a cronjob is running at the specified intervals, you can use the Push Gateway to display metrics to Prometheus through an HTTP endpoint. You can push the timestamp of the last successful job (i.e. a backup job) to the Gateway, and compare it with the current time in Prometheus. If the time exceeds the specified threshold, the monitor times out and triggers an alert.
Prometheus is a common choice for Kubernetes monitoring, because it was built for a cloud-native environment. Here are several key benefits of using Prometheus to monitor Kubernetes workloads:
Learn more in our detailed guide to Prometheus for Kubernetes
The client libraries of Prometheus offer four core types of metrics. However, the Prometheus server does not currently save these metrics as different data types. Instead, it flattens all information into an untyped time series.
This is a cumulative metric. It represents a single monotonically-increasing counter, and its value can either increase or be reset to zero on restart.
There are several use cases that suit counter metrics. You can, for example, use it to represent the number of served requests, errors, or completed tasks. You should never use counters to expose values that can decrease, like the number of running processes.
This metric represents one numerical value, which can arbitrarily go down and up. A gauge is often used to measure values like current memory usage or temperatures.
A histogram samples observations, such as request durations or response sizes. It then counts the observations in a configurable bucket. A histogram can also provide a total sum of all the observed values.
A summary can sample observations, such as request durations and response sizes. Additionally, it can provide a total count of the observations as well as a total sum of all observed values. It can calculate configurable quantiles over a sliding time window.
Learn more in our detailed guide to Prometheus metrics
Here are several key best practices for implementing Prometheus monitoring.
Prometheus uses exporters to retrieve metrics from systems that cannot easily be scraped, such as HAProxy or Linux operating systems. Exporters are client libraries deployed on the target system, which export metrics and send them to Prometheus.
While all Prometheus exporters provide similar functionality, you should choose the most relevant exporter for your purposes. This can critically affect the success of your Kubernetes monitoring strategy. You can research the available exporters and evaluate how each handles the metrics relevant to your workloads. You should also assess the quality of the exporter, according to parameters like user reviews, recent updates, and security advisories.
Consult the documentation of your chosen exporter and learn how to label your metrics in a way that provides context. Learn how to establish consistent labeling across different monitoring targets. While you can customize and define your own data, remember that each label you create uses resources. On a larger scale, too many labels can increase your overall resource costs. This is why you should strive to use up to 10 labels.
A well-defined alerting strategy can help you achieve effective performance monitoring. You should first determine which events or metrics are critical to monitor, and then set a reasonable threshold that can catch issues before they can affect your end-users. Ideally, you should define a threshold that does not cause alert fatigue. You should also ensure the notifications are properly configured to reach the appropriate team in a timely manner.
Calico Cloud and Calico Enterprise help rapidly pinpoint and resolve performance, connectivity, and security policy issues between microservices running on Kubernetes clusters across the entire stack. They offer the following key features for container and Kubernetes monitoring and observability, which are not available with Prometheus:
Learn more about Calico for container and Kubernetes monitoring and observability