Kubernetes is a highly popular and widely used container orchestration platform designed to deploy and manage containerized applications at a scale, with strong horizontal scaling capabilities that can support up to 5,000 nodes; the only limit in adding nodes to your cluster is your budget. However, its vertical scaling is restricted by its default configurations, with a cap of 110 pods per node. To maximize the use of hardware resources and minimize the need for costly horizontal scaling, users can adjust the kubelet maximum pod configuration to increase this limit allowing more pods to run concurrently on a single node.
To avoid network performance issues and achieve efficient horizontal scaling in a Kubernetes cluster that is tasked to run a large number of pods, high-speed links and switches are essential. A reliable and flexible Software Defined Networking (SDN) solution, such as Calico, is also important for managing network traffic efficiently. Calico has been tested and proven by numerous companies for horizontal scaling, but in this post, we will discuss recent improvements made to help vertical scaling of containerized applications to just work.
For example, the following chart illustrates the efficiency achieved with the improvements of vertical scaling in Calico 3.25.1 and above. As you can see after the initial preparation at 00:24 time scale where calico-node surged to 0.500 the resource utilization came down and stayed lower than 0.250. Next we will dive into the importance of such a behavior and discuss it further.
Note: Utilization in the chart is stacked.
In this blog post, we will explore the benefits of vertical scaling, and how you can implement such a change for your cluster, and prepare you for the impact that it can have on your network. We will also explore how Calico 3.25.1 and above will aid you in running highly efficient vertically scalable clusters and prepare you for the future of cloud computing.
Vertical scaling highlights
As the popularity of Kubernetes continues to grow, users are becoming increasingly aware of the platform soft limits and are looking for ways to increase them. Recently, in response to popular demand Red Hat announced that they are working on allowing users to run 500 pods (from 250 pods) on their Kubernetes nodes. This change will bring more attention to such soft limits and will allow more people to take advantage of the flexibility and scalability of Kubernetes.
Kubernetes has a well-known ability to scale horizontally, allowing it to support up to 5,000 nodes. However, when it comes to vertical scaling, the default configuration sets a limit of 110 pods per node. If you want to increase the number of pods that a single node can run, you need to modify the kubelet configurations on each participating node within your cluster. If done right, it can help you optimize your hardware resources and reduce the need for horizontal scaling. It’s important to note that horizontal scaling (additional nodes) to your cluster can significantly increase your cloud spending, which is why vertical scaling can be a cost-effective way to tame the cloud bills and achieve the desired performance with your current nodes.
What about Calico?
Calico was Initially designed to support 100 pods per node with a tolerance allowing it to reach up to 250 pods per node, which provided ample headroom for additional pods when necessary. This additional capacity was particularly useful for testing scenarios that required a higher number of pods than what Kubernetes default configuration allowed.
As a leading provider of container networking and security solutions, Calico is committed to ensuring optimal performance for its users, and we recognized that there could be a performance bottleneck beyond the 250 pods per node. But there is nothing to worry about, Calico version 3.26.0 will include improvements that address and resolve these bottlenecks, opening up new possibilities for deploying a larger number of pods on a single worker. The best part is that you don’t have to wait for the new version to arrive; these improvements have already been backported to Calico v3.25.1, and the installation process can be found on the documentation website.
Vertical scaling in action
Now that we learned about the recent improvements, let’s test it in a real environment and measure its performance. Keep in mind that this environment is not the best hardware that money can buy or tuned to run perfectly, the purpose is to validate these improvements and check how vertical scaling can affect a cluster.
Note: If you wish to run your own test, or know more about the steps and motive behind this section please click here.
Cloud provider : AWS
1x Control plane
- Instance type: m5.2xlarge
- CPU: 1 with 8 vCPU
- RAM: 32 GB
- Storage: 20GB ESB local
2x Worker nodes
- Instance type: m5.xlarge
- CPU 1 with 4 vCPU
- RAM: 16 GB
- Storage: 20GB ESB local
Kubernetes distribution: v1.25.8+k3s1
Operating system: ubuntu 22.04 Kernel 5.15.0-1031-aws
Maximum pod setting: 4000
500 pods on a node with Calico v.3.25.0
The Kubernetes cluster will consist of 1000 pods, and each worker node will be running 500 test pods along with the Prometheus node_exporter for gathering statistics. The control plane, on the other hand, will host the Prometheus collector and Grafana WebUI, along with the SQLite database used to store Kubernetes cluster information. Additionally, the cluster will be equipped with Calico v3.25.0, which is a popular networking and network security solution for Kubernetes. Furthermore, the Calico monitoring options will be enabled to allow monitoring and observability within the network control plane.
The following image illustrates the distribution of pods after scaling to 1000 on each participating node:
At this particular moment, the demo cluster had a total of 1021 running pods. Out of these, 1000 pods were utilized for the purpose of stressing the cluster, while the remaining were system pods. These system pods included components such as core-dns and calico-nodes, which are essential for the functioning of the cluster. Another noteworthy fact was the 3036 workload endpoints (WEP), which is the total number of virtual interfaces that Calico assigned to resources inside the cluster.
Let’s dig deeper. Calico memory utilization on each node seems fine while running a total of 1000 Pods; worker nodes are dedicating around 94 MB of RAM to each pod. Calico-node pod packs the Felix component which is the brain of Calico among other executables that are involved in the control plane part of your cluster SDN.
The following image shows the memory usage of Calico 3.25.0 while running 1000 pods on two nodes:
The memory chart looks boring. The CPU chart on the other hand seems more interesting, it looks like Calico is periodically taking a considerable amount of time to calculate something.
The following image shows the CPU usage of Calico 3.25.0 while running 1000 pods on two nodes:
To further investigate the CPU usage, the next stop would be to query the calico-node logs as it is suggested in the Calico troubleshooting documentation. Upon analyzing the logs, it seems that the resync-operation has been taking more than the normal amount of time to complete, as indicated by the summary line.
The following code block is an example of summary logs published by calico node:
2023-04-07 06:04:46.425 [INFO] felix/summary.go 100: Summarising 23 dataplane reconciliation loops over 1m2.6s: avg=16ms longest=155ms (resync-routes-v4,resync-routes-v4,resync-routes-v4,resync-routes-v4,resync-routes-v4,resync-rules-v4,resync-wg) 2023-04-07 06:05:46.779 [INFO] felix/summary.go 100: Summarising 24 dataplane reconciliation loops over 1m0.4s: avg=50ms longest=310ms () 2023-04-07 06:06:46.823 [INFO] felix/summary.go 100: Summarising 231 dataplane reconciliation loops over 1m0s: avg=79ms longest=1.2s (resync-filter-v4) 2023-04-07 06:12:51.617 [INFO] felix/summary.go 100: Summarising 239 dataplane reconciliation loops over 1m0.6s: avg=169ms longest=3.548s (resync-routes-v4,resync-routes-v4,resync-routes-v4,resync-routes-v4,resync-routes-v4,resync-rules-v4,resync-wg)
The resync operation is an essential part of the procedures that Felix executes periodically to gather information about the available routes that each interface can take. This information, along with an ordered list of policies, WireGuard configurations, and other relevant data, goes into the calculation graph that forms the control plane of your SDN. By periodically resyncing this information, Felix ensures that the control plane has an up-to-date view of the current state of the network, which is crucial for proper routing and forwarding of network traffic.
500 pods on a node with Calico v3.25.1
After rerunning the test and scaling the test pods to 1000 with the new version of Calico the memory chart shows the same amount of memory utilized by calico-node pods which is expected.
The following image shows the memory usage of calico 3.25.1 while running 1000 pods on two nodes:
The CPU chart now displays consistent and steady performance, without the multiple peaks and valleys seen previously. After the initial surge at the 00:24 mark where Calico makes the necessary preparations and allocations, the combined CPU utilization of both calico-nodes drops to 0.25 and remains there.
The following image shows the CPU usage of calico 3.25.1 while running 1000 pods on two nodes:
To gain a better understanding of the improvement that Calico v3.25.1 and above offers, we can simulate a scenario where a custom application queries the routes on one of the nodes and then measures the process performance.
The following picture illustrates the time required for the custom application to iterate through 3036 interfaces, query the routes, and filter it inside the userspace. (Similar to Calico 3.25.0)
The following picture illustrates the improvement of kernel-side filtering for the same test. (Similar to Calico 3.25.1)
Two nodes, 1500 pods and beyond
With the 1000 pods target reached, it’s time to test the limits of this cluster and see how much further we can push it before it breaks, next stop 1500 pods (750 pods on each node).
The following image illustrates the distribution of pods on each participating node:
Scaling to 1500 pods seems to be fine, everything is functional and the cluster is in a healthy state. However, at this point k3s-agent saturates a considerable amount of memory on each worker node to keep up with the demand of scheduling the cluster events.
The following image illustrates the total memory usage on a worker node while running more than 750 pods.
After increasing the replica number to 2000 (1000 pods on each node), I noticed that the k3s agent became so busy that it started to affect the node status making the cluster node unhealthy and preventing it from communicating with the control plane, despite the fact that Felix CPU usage remained low on the same node.
The following image is a “htop” snapshot from one of the worker nodes while running 1000 Pods:
Just to clarify, the failure that occurred had nothing to do with k3s, Kubernetes, or Calico scaling capabilities and it could’ve been prevented If I followed the best practices for designing the lab environment and prepared it by following these considerations.
It’s important to note that adjusting this configuration carelessly can introduce potential challenges, such as network performance issues and resource contention among competing pods. For example, Increasing the maximum pod limit can lead to resource contention, as pods compete for resources such as CPU and memory. This can cause instability in the cluster and affect the performance of individual pods. To prevent such a problem, you can implement monitoring and workload profiling to get a better understanding of your workload resource utilization and dedicate enough resources for unexpected scenarios that can result in an outage.
Increased pod capacity and networking
Software Defined Networking (SDN) is the approach for networking in Kubernetes. SDN is very flexible, which makes it ideal for modern cloud-native applications. To better understand the networking, let’s divide it into two logical sections: data plane and control plane, these two parts work together to manage and route network traffic within the cluster.
The data plane is responsible for routing and forwarding network traffic between the different pods within the cluster. When working with a cluster that has a large number of pods, it is essential to have high-speed links and switches to ensure optimal network performance. You can also take data plane optimization further by choosing a robust dataplane.
Calico is a networking solution that provides a pluggable data plane architecture with options such as eBPF and standard Linux IPtables. This allows users to swap the data plane as necessary to meet their specific requirements.
If you like to know more about Calico eBPF dataplane and its pluggable architecture click here.
The control plane, on the other hand, is responsible for managing the overall state of the Kubernetes cluster, including the networking components. To do so, the control plane will regularly update its knowledge about the routes, interfaces, policies and other aspects of the cluster. That is why in the previous section, while testing the scalability of our demo cluster, we focused on the calico-node pods which host the calico-felix binary (the brain of Calico).
The ideal situation is a reasonable resource utilization that doesn’t have too many peaks and valleys.
Scheduling failures and IOPS
In a Kubernetes cluster, the Linux open file limit can affect the ability to vertically scale the cluster when the maximum number of pods per node is increased beyond the default limit. This is because the Kubernetes scheduler needs to open a large number of files when scheduling pods, and if the open file limit is too low, the scheduler will fail to schedule new pods on nodes that have reached the file limit.
To address this issue, the Linux open file limit can be increased to a value that is sufficient for the number of pods that will be running on each node. This can be done by modifying the /etc/security/limits.conf file on each node in the cluster.
High availability and external datastore
Finally, It is crucial to take into account the impact that a large number of pod deployments can have on your Kubernetes datastore. As the number of deployments increases, it can put a strain on the datastore, leading to degraded performance and potential data loss. To mitigate this issue, it is recommended to implement a highly available (HA) design for your cluster. An HA design can distribute the load across multiple nodes, reducing the burden on any single datastore and improving the overall performance and reliability of the system.
If you’re interested in learning more about Kubernetes, I’ve created hands-on workshops that can help you expand your knowledge from the comfort of your web browser.
Join our mailing list
Get updates on blog posts, workshops, certification programs, new releases, and more!