Guides

Cloud-Native Monitoring

Cloud-Native Monitoring: Challenges, Components, and Capabilities

What Is Cloud-Native Monitoring?

Monitoring involves instrumenting an application to collect and analyze metrics and logs that provide insight into its performance. This allows you to see whether the application behaves correctly. Metrics are used to measure specific aspects of a system’s health over time, while logs record specific events.

Cloud-native monitoring is unlike traditional application monitoring, because monitoring systems must deal with ephemeral objects that can be frequently created and destroyed, and distributed applications made up of multiple independent components.

Like traditional monitoring, cloud-native monitoring should cover a range of parameters including disk space, memory consumption, and CPU usage, as well as whether tasks are performed correctly and are protected from unauthorized access. This is important for tracking performance, cost-effectiveness, and security, which enables operators to quickly respond to any issues detected. Monitoring is also a crucial part of cloud-native security.

In this article:

Cloud-Native Security Challenges

Cloud-native architectures present challenges for application and infrastructure security. Here are several core challenges:

Large Number of Entities to Secure

Infrastructure and DevOps teams use microservices to execute cloud-native applications. Previously, several software functionalities or processes would execute on a single virtual machine. Today, developers package each capability or process as a serverless function or separate container, making each entity more vulnerable. An organization needs to protect these entities from compromise throughout the development lifecycle.

Diverse Architecture Patterns

Cloud-native systems can include a range of private and public clouds, application architectures, and cloud services. Each architectural pattern could display different weaknesses and security demands. Security teams need to visualize the attack surface and find solutions for securing each type of architecture.

Dynamic Environments

Private and public cloud environments are continuously evolving. Rapid release cycles might mean that security teams need to update every component of a microservices application daily. Furthermore, organizations are adopting practices such as infrastructure as code (IaC) and immutability, and so applications are continuously being destroyed and re-created. Security teams find it challenging to secure such deployments without slowing down the release cycle.

Over-Privileged Access

A related issue is over-privileged, non-administrator user accounts. Do not allow users to run programs as root, as this creates a variety of security issues.

A known example of this is Docker-based containers running as root. Just because it is possible to run containers as root does not mean it is a recommended practice. Always try to employ the principle of least privilege. Developers working on a cloud-native application, and entities within the application architecture, should only gain access to the resources they actually need.

Read our guide to zero trust security

Misconfigurations

Another typical security gap is misconfigurations. It is very common for databases, cloud storage buckets, and other cloud resources to be left accessible over the Internet without authentication. Take all measures to prevent such misconfigurations. The solution is not to fix them individually, but to use automation to gain visibility over these issues and remediate them centrally.

4 Key Components of Cloud-Native Monitoring

There are four pillars for capturing data and ensuring observability, and which provide crucial insights into the health and behavior of cloud-native applications. The data collected from each of these pillars can be used to evaluate systems and applications as they are developed and become more complex.

Logs

Logs are records of events—every service or application in your system should log events when they occur. You can use a log aggregation tool to centralize logs and make it easier to search and view them. For example, if an error occurs, the application will notice and log it, so developers can identify where there is an issue.

Metrics

Metrics combine data from a series of related, measurable events. They tend to be time-based and are measured at regular intervals. This helps provide insights into the type of issue—for example, the number and rate of errors may be consistent or represent a spike in errors.

A cloud-native monitoring tool will typically provide the following metrics to enable different measurement types:

  • Gauges – Metrics for values that can go either up or down arbitrarily. These include changeable elements such as memory usage, temperature of the physical device, and the number of users connected to a system.
  • Counters – Cumulative metrics that increase over time. These include the accumulated number of errors, requests, and completed tasks—anything that can only go up.
  • Meters – These measure the rate of change between events in a series. The rate is typically measured periodically, producing a mean rate spanning the application’s lifetime.
  • Histograms – These measure the statistical distribution of events, such as the duration of requests and size of responses. They allow users to view an average of observed values by tracking the number of observations and providing the sum of the values.

Tracing

Tracing involves recording related events and presenting them in a meaningful order. All events in the string being traced are linked via a unique ID that passes from the initial request to later events. In distributed systems, a single request can reach multiple services, so tracing helps provide a full, application-level view.

For example, in the case of an error, tracing reveals the overall flow from the initial request to the resulting error. If you can observe the trajectory of the request, you can identify which services it passed through and what may be the root cause.

Alerts

Alerts draw the attention of developers to a potential issue so they can address it. Alerting tools detect patterns in the data provided by logs, metrics, and tracing to identify anomalies. Activity that departs from the system’s normal state will trigger and alert.

When engineers identify an event (or set of events), they can generate alerts and modify them based on their level of priority. For example, they can set alerts to trigger based on specific thresholds, such as the number or rate of errors. The relevant team receives the alerts and can begin remediating the issue.

Key Capabilities of a Cloud-Native Monitoring Platform

Here are a few important capabilities you should look for in a cloud-native monitoring solution.

Decision Support Across Hybrid Systems

In the cloud-native datacenter, you must have visibility across VMs, the host, applications, API services, and containers. A monitoring solution must provide visibility even if services are dynamic, containers short-lived, and applications distributed. The monitoring solution must have an engine that can intelligently gather information from these different layers to enable real-time decision making.

Support for Forensic Investigation at Scale

Security investigations involving cloud-native workloads can be complex because there are multiple, distributed components communicating via API. Effective security investigations require a monitoring architecture that is distributed, can scale according to workloads, and provides sufficient data retention.

Integration with Orchestration and Automation Tools

In a cloud-native environment, you could have OpenShift, Kubernetes, Google GKE, Amazon EKS, ECS, or similar services orchestrating your container workloads. You may also use Puppet, Ansible, or Chef to automate deployments.

A monitoring solution should smoothly integrate with these components. In many cases, cloud-native monitoring solutions will provide the option of deploying an agent within a container cluster or alongside serverless functions—a necessity for the cloud-native environment.

Cloud-Native Monitoring with Calico

Calico offers powerful features for cloud-native monitoring and observability. These include:

  • Dynamic Service Graph – A point-to-point, topographical representation of traffic flow and policy that shows how workloads within the cluster are communicating, and across which namespaces. Also includes advanced capabilities to filter resources, save views, and troubleshoot service issues.
  • Dynamic Packet Capture – Captures packets from a specific pod or collection of pods with specified packet sizes and duration, in order to troubleshoot performance hotspots and connectivity issues faster.
  • Application-level observability – Provides a centralized, all-encompassing view of service-to-service traffic in the Kubernetes cluster to detect anomalous behavior like attempts to access applications or restricted URLs, and scans for particular URLs.
  • DNS Dashboard – Helps accelerate DNS-related troubleshooting and problem resolution in Kubernetes environments by providing an interactive UI with exclusive DNS metrics.
  • Compliance and reporting – Automates and simplifies compliance monitoring, enforcement, and audit, by tracking all policy changes and retaining a daily history of your compliance status. This makes it easy to create audit-ready reports.

Next Steps

Join our mailing list​

Get updates on blog posts, workshops, certification programs, new releases, and more!