Prometheus is an open-source tool for collecting metrics and sending alerts. It was developed by SoundCloud. It has the following primary components:
Prometheus monitoring works by identifying a target, which is an endpoint that supplies metrics for Prometheus to store. Targets may be physical endpoints or an exporter that attaches to a system and generates metrics from it. Endpoints are either supplied by a static configuration or discovered through a service discovery process.
When Prometheus has gathered a list of targets, it can start retrieving metrics. Metrics are retrieved via simple HTTP requests. The configuration directs Prometheus to a specific location on the target that provides a stream of text, which describes the metric and its current value.
Prometheus monitors endpoints and offers four different types of metrics:
This cumulative metric is suitable for tracking the number of requests, errors or completed tasks. It cannot decrease, and must either go up or be reset to zero.
Counters should be used for:
Use cases for counters include request count, tasks completed, and error count.
This point-in-time metric can go both up and down. It is suitable for measuring current memory use and concurrent requests.
Gauges should be used for:
Use cases for gauges include queue size, memory usage, and the number of requests in progress.
This metric is suitable for aggregated measures, including request durations, response sizes, and Apdex scores that measure application performance. Histograms sample observations and categorize data into buckets that you can customize.
Histograms should be used for:
Use cases for histograms include request duration and response size.
This metric is suitable for accurate quartiles. A summary samples observations and provides a total count of observations, as well as a sum of observed values, and calculates quartiles.
Summaries should be used for:
Multiple measurements of a single value, allowing for the calculation of averages or percentiles
Values that can be approximate
A range of values that you cannot determine upfront, so histograms are not appropriate
Use cases for summaries include request duration and response size.
Here are a few common use cases of Prometheus, and the metrics most appropriate to use in each case.
The metric used here is “node_cpu_seconds_total”. This is a counter metric that counts the number of seconds the CPU has been running in a particular mode. The CPU has several modes such as iowait, idle, user, and system. Because the objective is to count usage, use a query that excludes idle time:
sum by (cpu)(node_cpu_seconds_total{mode!="idle"})
The sum function is used to combine all CPU modes. The result shows how many seconds the CPU has run from the start. To tell if the CPU has been busy or idle recently, use the rate function to calculate the growth rate of the counter:
(sum by (cpu)(rate(node_cpu_seconds_total{mode!="idle"}[5m]))*100
The above query produces the rate of increase over the last five minutes, which lets you see how much computing power the CPU is using. To get the result as a percentage, multiply the query by 100.
The following query calculates the total percentage of used memory:
node_memory_Active_bytes/node_memory_MemTotal_bytes*100
To obtain the percentage of memory use, divide used memory by the sum and multiply by 100.
You need to know your free disk usage to understand when there needs to be more space on the infrastructure nodes. Again, the same memory usage method is used here, but with different metric names.
node_filesystem_avail_bytes/node_filesystem_size_bytes*100
Prometheus offers several operators and functions that can help you perform calculations on metrics, to make them more useful. You saw the use of some of these functions in the examples above.
These reduce an instant vector to a different instant vector that represents the same number of label sets or less. This is done by aggregating the values of multiple label sets or by prioritizing the distinct sets and discarding the rest. A simple aggregation operator may appear as avg(metric_per_second)
, providing the average speed of all instances in the set.
Another option that allows you to differentiate instances by labels is avg(metrics_per_second) by (project, location)
. This averages speed only for those instances that belong to the same project and are located in the same region (based on labels attached to the metrics). You can select the labels you want to keep for the new vector, or alternatively, discard a label you don’t want.
There are several aggregations available, most notably sum, min, max
, and avg
. The more complex aggregators may take additional parameters—for example, the following aggregator provides the three highest speeds overall:
topk(3, metric_per_second)
An arithmetic binary operator (+, -, *, /, %, ^), where “%” stands for a modulo operation and “^” stands for arithmetic power operation, can work with a combination of scalars and instant vectors, which can quickly become mathematically complex.
The following is a summary of some cases and how you might handle the unusual cases that arise when dealing with vector arithmetic:
The functions in Prometheus are similar to typical programming functions, except they are restricted to a predefined set. Most Prometheus functions are approximate—the results are extrapolated, so what should be an integer calculation may occasionally be turned into floating point values. This means that you should use Prometheus functions with care in cases that require high precision.
Here are three examples of especially useful functions. For each of these functions, a range vector is taken as an input and an instant vector is produced.
delta
operates on gauge metrics and outputs the difference between the beginning and end of a range.increase
and rate
use counter metrics and output the increase over a specified time. increase
provides the total increase, while rate
provides the per-second increase.histogram_quantile
function, which can be used to make sense of histogram buckets. This function takes two arguments: the quantile to be calculated and the instant vector of the bucket. You can use the command histogram_quantile(0.95, sum(metric_speed_bucket) by (le))
, which outputs the speed of instances in the 95-percentile of the histogram bucket. The statement by (le)
sums the metric time before performing the quantile calculation.
There are various Prometheus exporters you can use to monitor your applications and cloud services. It might be challenging to figure out which exporter you should use.
Have a look at the Exporters and Integrations page on the Prometheus website. Evaluate the suitability of each exporter for the kind of metrics you need. You could also assess the health of an exporter as a software project—how many contributors, stars, and whether it is frequently updated. Here are a few guidelines that can help you evaluate the suitability of an exporter for your needs.
Every exporter comes with a set of metrics that it generates. It is a best practice, but not mandatory, for Prometheus exporters to provide this information. You can usually find metrics details on the exporter’s repository main page, but you may need to search in a documentation page or help file. Some exporters use the OpenMetrics format, which can provide fields with additional information regarding the metric, such as the type, info, or units.
Check the exporter documentation to understand how you can label the metrics and make sense of them. Instrumentation labels are useful for analyzing what is happening inside an application, while target labels are useful when aggregating metrics across an entire deployment.
For example, an instrumentation label may provide context such as whether it is from a dev environment or production service, or on which host the service is running. On the other hand, you can use target labels to answer questions such as: “What is the current CPU usage of all backend applications in North America?”
When you are learning to use a new cloud or application monitoring tool, such as a Prometheus exporter, it can be challenging to handle alerts. If your alerts are set with a low threshold, this could quickly oversaturate your support teams with alerts. On the other hand, if your alerts don’t trigger in time, you could miss important information regarding a condition that may be affecting your end users.
When developing an alert strategy, you first need to understand your applications and the Prometheus exporters you are using. Your organization’s Service Level Indicators and Service Level Objectives, coupled with golden signals for your service or application, can help you determine which elements are critical and require an alert.
Calico Cloud and Calico Enterprise provide Kubernetes-native observability and monitoring, helping rapidly pinpoint and resolve performance, connectivity, and security policy issues between microservices running on Kubernetes clusters across the entire stack. They offer the following key features for Kubernetes-native observability and monitoring, which are not available with Prometheus:
Learn more about Calico for Kubernetes monitoring and observability