You Cannot ‘Fly Blind’ in the Cloud Native Landscape

One of the chief complexities in running large scale containerized applications is the need for continuous systems/application monitoring. Containers are very different from traditional VMs and the 3 tier applications that run on them. Monitoring that needs to ensure that SLAs promised to the business are being met as well as an ability to forecast usage trends while identifying problem areas such as bugs, capacity challenges, slowing performance, and any potential downtime.
These challenges are further exacerbated by the fact that enterprises are deploying Kubernetes across multiple providers ranging from bare metal, VMware, OpenStack (on-premise) to public clouds. Further, they are deploying different kinds of application workloads on their k8s clusters ranging from web servers, message queues, databases, frameworks such as Apache Spark/Flink et al.
So what are the five key capabilities that application performance monitoring (APM) tools need to bring to bear in supporting such architectures?

#1 Support In-depth Data Collection for the container infrastructure including the IaaS Layer

Real world containerized applications are highly complex software applications consisting of not just a set of servers running microservices on pods but also communicating with partner systems and running on a potential set of IaaS providers ranging from Bare OS, OpenStack, VMware, AWS, Azure, GCP et al. Administrators need to be able to view metrics on to just the microservices themselves but also the underlying infrastructure.
The table stakes or the basic capability is to perform monitoring on the below areas –
  1. capacity issues with the underlying IaaS hosts where pods can run out of resources, including CPU/Memory/Disk/Network bottlenecks
  2. failed containers that need to be migrated, issues with the k8s master components themselves, such as the API server, etcd etc
  3. Metrics on pods, Deployments, ReplicaSets across Namespaces, their status, etc. Provide a real-time dynamic view of their usage patterns
  4. The status of all the microservices and components such as service meshes, serverless functions, etc
  5. Simple and advanced dashboards on all the above
  6. Enable the prioritization of monitoring; start with the most frequent or highest business value components & then the others in relative order of priority

#2 Serve as the centralized single source of Monitoring truth

The platform assumes the presence of complex applications with multiple interacting components. A state of the monitoring platform serves as a single source of truth for infrastructure teams – Cloud SREs, developers and the business.
This assumes the four things –
The monitoring platform should support all the regions & tenants across those regions where tenants are teams within the organization and have one or more namespaces across the platform. Log and metric management for all the tenants are centralized into the central platform that collects all data (logs and metrics) across the spectrum – e.g. daily functioning of the business application, the underlying containers/pods, nodes, the network traffic, etc
  1. Platform performs constant24x7 monitoring across the stack – applications, the container platform, the underlying compute/network and storage. Provide an ability to monitor IaaS resources by ’tags’ and the container infrastructure using kubernetes metadata.
  2. Provide self-service monitoring capability across the business and enables the delivery of reports/dashboards and analytics that matter to the relevant stakeholder. The “single pane of glass” paradigm is even more important here.
  3. Long term data storage that enables not just ephemeral data collection but also advanced analytics that enables quick fault diagnostics

#3 Integrate with the Application development lifecycle – DevOps process & CI/CD pipelines

It is a best practice to build monitoring into the application from the get-go. What this implies is that as user stories get written, monitoring should be a part of each story along with other operational concerns such as scalability and security.
Key things to account for here –
  1. API based integration with CI/CD services
  2. Flexible and easy to use UX that can be embedded with 3rd party applications
  3. Ability to perform root cause identification and troubleshooting for test environments right from the development pipeline itself
  4. Autoscaling of monitoring services based on application scale up/scale down and alerting

#4 Easy to Extend, to & must support fast time to root cause identification

A well-designed monitoring platform will be heavy on visuals in that it enables the user to view, digest and analyze the data in front of her with as few clicks as possible.
  1. Often users are drowning in alerts and need to spend as little time as possible before they understand the importance of what they are looking at as a way of deciding if an alert or trend merits further investigation.
  2. The visualization tool should enable slicing and dicing of large datasets that span time series data, system events while enabling multiple perspectives both from a user, metadata, business, and statistical perspective.  From a visualization standpoint support needs to be provided for all the usual suspects – min, max, std deviation, percentiles et al at a minimum.
  3. Alerts need to be provided for the desired range of conditions – Actual vs Real. Prominent examples include CPU, Memory, Network, # of pod replicas, internal kubernetes components and finally an ability to set a custom alert.
  4. A key point about notifications is to ensure that as much business context is added to system & application notifications. At the time of writing the vast majority of monitoring notification messages are meant for systems to read and interpret and not humans. Helpful hints also need to be provided on how to potentially act on the notification.
  5. While this may introduce AI-Ops into the discussion – the system also needs to continually learn about the environment with the goal of increasing monitoring automation over a period of time. Why is this as important? My wager is that around 60% of monitoring notifications, alerts and workflows can be solved with automated actions thus freeing up valuable and high-cost human resources for other value-added tasks such as helping onboard different lines of business applications into the central platform.

#5 This one is culture – understand what needs to be monitored and avoid monitoring anti-patterns as much as you can

We discussed creating a culture of monitoring in the organization. Unfortunately, due to cost or inertia other considerations, many enterprises do not invest in creating a monitoring platform until they are already in production.
This can be a costly mistake as often a single production outage can cause loss of both revenue and reputation.
Other anti patterns include :
  1. Applying a “one size fits all” or a “cookie cutter” approach to monitoring across the organization by using the same set of metrics & templates for every application. Every application is different and needs to be monitored in a fundamentally different manner
  2. Not building a business case for monitoring that takes the interdisciplinary nature of monitoring into account. By that we mean understanding the core concerns of business, Infrastructure teams/SREs and developers.
  3. Not providing self service & automation across a unified monitoring platform
  4. Not being proactive in using monitoring to detect failures or unfavorable trends In applications before they happen
  5. Not infusing monitoring early on in the development cycle

Conclusion

A business predicated on cloud-native applications is only as good as the monitoring around it. This blog laid the foundation for what makes a good modern monitoring platform. The market offers a myriad of closed source and open source choices. In the followup blog, we will discuss the most prominent open source monitoring platform for Kubernetes- Prometheus, its architecture and key features.

This article originated from http://www.vamsitalkstech.com/?p=7619

Vamsi Chemitiganti is a Tigera guest blogger. Vamsi Chemitiganti is Chief Strategist at Platform9 Systems. Vamsi works with Platform9’s Client CXOs and Architects to help them on key business transformation initiatives. He holds a BS in Computer Science and Engineering as well as an MBA from the University of Maryland, College Park.

Subscribe to our newsletter

Get updates on webinars, blog posts, new releases and more!

Thanks for signing up. You must confirm your email address before we can send you. Please check your email and follow the instructions.

Pin It on Pinterest

Share This