IBM’s Journey to Tens of Thousands of Production Kubernetes Clusters

IBM Cloud has made a massive shift to Kubernetes. From an initial plan for a hosted Kubernetes public cloud offering it has snowballed to tens of thousands of production Kubernetes clusters running across more than 60 data centers around the globe, hosting 90% of the PaaS and SaaS services offered by IBM Cloud.

I spoke with Dan Berg, IBM Distinguished Engineer, to find out more about their journey, what triggered such a significant shift, and what they learned along the way.

So take me back to the beginning – how did this thing get started?

It must have been 3 years ago.  At that time, IBM had been building customer-facing services in the cloud for several years implemented as traditional VM or Cloud Foundry workloads. One of the services we offered was a container as a service (CaaS) platform that was built under the covers using Docker and OpenStack.  With this platform, we were the first in the industry to provide a true CaaS experience, but in some ways it was ahead of its time, and people weren’t ready to embrace a pure container as a service platform.

To this day we are still seeing double digit growth month to month from internal IBM teams moving over to Kubernetes.

We found that to attract enterprise-level customers we needed to provide a better model for isolation and dedicated resources, all the way down to the host, and potentially down to the bare metal. Those customers were not comfortable with shared infrastructure, and the auditors were not either.  So we knew we needed a different solution for containers, but that was the limit of the scope we were thinking about. We weren’t thinking that we would be building the future platform that would run the broader portfolio of IBM cloud services.

Why did you choose Kubernetes?

We were looking around and at the time it seemed like a race between Kubernetes, Docker Swarm, and Mesos. We took a leap and chose Kubernetes. At that time Kubernetes was not a clear winner. The only cloud provider that provided Kubernetes at that time was Google, with GKE, and it was early days for them.  We chose Kubernetes anyway for a few reasons. There was the strength of the community – it was already a strong community and we saw the trajectory and felt it was going to continue to go up given its diversity when compared to the other technologies. We also appreciated the high quality of the codebase – it was well defined, it was well organized, and it was tested.  We knew it wasn’t complete and it still had bugs, but there was a process and rigor behind it. 

We were able to make substantial advancements because of the fact it was running on Kubernetes.

At the time we were making a decision on using Kubernetes for what is now the IBM Cloud Kubernetes Service (IKS). IKS was going to be the next rev of our container platform that we would make available to our customers.  We weren’t thinking we were making a statement or a decision for all of IBM Cloud. But, with that said, it did start the ball rolling. That is the moment in time when the shift started to happen.  

Having chosen Kubernetes, what was the next big turning point?

We had automated much of our previous container platform using Ansible. When we first started building IKS we used that same approach to build and manage the new platform.  We looked at what we were doing and said, hang on a moment, we are basically building lifecycle management capabilities for managing Kubernetes clusters, and Kubernetes is already doing lifecycle management.  At that point the light went on. Let’s utilize the same patterns, the same techniques, and the same API server to suit our needs. We scrapped a lot of what we had started with Ansible and turned to Kubernetes as the tool to manage our control plane for IKS, which in turn managed tenant Kubernetes clusters following a kube on kube model.

At that point, things just totally snowballed.

From there we never looked back! We went from initial design ideas through implementation to GA in six months. The choice of Kubernetes combined with an incredibly valuable upfront IBM Design Thinking project helped us focus our development activities on a true MVP that would be valuable for our customers.  The project moved so rapidly because of the technology selections, the organizational leadership, and IBM Design Thinking.

We were able to make substantial advancements because of the fact it was running on Kubernetes. Kubernetes allowed us to focus on the application we were trying to build (i.e. IKS) and not have to deal with all the underpinnings of the infrastructure and lifecycle management. On top of that, we developed a custom continuous delivery system that we later open sourced ( which allowed us to roll out changes to as many clusters as we wanted in under 60 seconds.  To give you an idea of how successful this approach was, we currently push hundreds of deployments a day and sometimes around 1,000 changes a week into those clusters with a level of reliability and stability that would have been hard for us to achieve without Kubernetes.  

So IKS was up and running in production, but that was really only the beginning of IBM’s shift to Kubernetes?

Yes, that’s right. We had some service teams who were struggling running on VMs, and they made a decision after watching us build IKS that they were going to go to containers as well. They used IKS as their platform to provide Kubernetes clusters for them.  It started off with the Watson services. And that was another turning point for IBM Cloud because the Watson service team were able to make rapid advancements by moving to Kubernetes and containerizing and running on IKS. They started by moving over two services, and within a month after that, they were bringing in five services. They were just ramping up very quickly, and before we knew what was happening, other IBM teams learned about this and shortly afterwards our identity management service converted over.  Then our billing service. Then the console team realized they were struggling to deploy into all the regions that they needed to, so they began converting over from Cloud Foundry.  

All of these changes and many others help us to provide a better experience for our users.

At that point, things just totally snowballed. We were giving individual teams that owned these services, converting over from VMs or Cloud Foundry into Kubernetes, the same operational agility we had experienced ourselves building the IKS control plane on Kubernetes.  And we were giving them the ability to roll their services out consistently into every region, with a standardized set of operational tools and processes required to meet compliance requirements. 

To this day we are still seeing double digit growth month to month from internal IBM teams moving over to Kubernetes.  I think we’re up to around 90% of PaaS and SaaS services in IBM Cloud now running on IKS. And even beyond that, the IaaS team is now starting to run their regional control plane on IKS. I think we may be the only cloud that provides a Kubernetes service where we actually also run almost all of our services on the same service that we provide to our customers.

So that’s a big statement because it introduces the scale challenges that we ran into as we started growing, and how we fixed those.

What were the key scaling challenges?

When we first started hitting scale issues the scale claimed in the community was 2,000 nodes, but those tests were quite rudimentary, focussed mainly on the number of pods and nodes. They didn’t push other resources such as the number of Kubernetes services. We learned this the hard way. We were pushing tens of thousands of Kubernetes services in some clusters, coupled with a lot of pod churn, and no one had really tested this at these scales before. 

At the time we were running on a pretty old kernel and a lot of work had already been done on optimizing the kernel’s implementation of iptables in newer kernels.  But the kernel we had was very slow at processing the number of iptables rules that kube-proxy was generating for such a large number of services. At the same time, we would start hitting bugs in Docker engine and containerd interactions. The kube scheduler was telling the kubelets “go do work”, the kubelets would tell the Docker engine “here’s some more work, go and do it”, and the Docker engine would say “done, I did it”, and it really didn’t do anything because containerd was dead and it had lost track of it.  So if we pushed scale just a little bit beyond what the system could handle then we would get cascading scheduling failures across the cluster which in turn put more load on the cluster leading to more and more scheduling failures.  

It took us a couple of months to figure out everything that was going on. Switching from Docker engine to using systemd to manage containerd eliminated problems we saw with the Docker engine and its interactions with containerd.  And switching to a newer kernel dramatically improved iptables performance addressing the kube-proxy scale issues that we had been seeing. We realized at that time how important having a modern tuned kernel was for Kubernetes. We continuously make improvements to the IKS architecture to make the platform more efficient and more resilient. For example we introduced more advanced tuning for the kernel, we optimized our system reservation algorithms, and we introduced a custom Kubernetes master vertical autoscaler. All of these changes and many others help us to provide a better experience for our users.

Did you consider switching kube-proxy to IPVS-mode for better performance?

At the time switching kube-proxy from iptables mode to IPVS mode was not really an option.  The IPVS work was not mature enough yet. We did some preliminary testing but, at the time, switching to IPVS introduced more problems than it solved. But now IPVS would definitely be an option if we were to need it. Load balancing is a mainline well-optimized capability for IPVS, whereas you could argue that the way kube-proxy uses iptables for load balancing isn’t something iptables is optimized for once you start pushing very large numbers of services.  It’s very different than how Calico uses iptables for policy enforcement which is very efficient.

But the reality is that we are running 25,000+ services on some clusters running kube-proxy in iptables mode and don’t see any performance or stability issues – so we really don’t see a compelling reason to switch. It would be optimizing something that doesn’t have a noticeable impact on the overall end-to-end performance of our system.

Tell me about your use of Calico – why did you choose it, and how well has it worked for you?

We knew that choosing the right networking solution was going to be important. IBM had extensive knowledge of OVN from OpenStack, and we really wanted to avoid the over-complexity of it as an SDN for Kubernetes. We evaluated several solutions, including all the usual suspects. As we prototyped, we found Calico fit really well into the IBM Cloud model in which we primarily wanted a lightweight container IP management system supporting network policies.  It performed great and was the only solution to support every feature of Kubernetes network policy API. Plus Calico network policy provided an even more feature-rich superset of Kubernetes network policy which proved incredibly valuable when it came to securing IKS for our customers.

Calico’s security capabilities have worked out incredibly well for us and our users.

It seems now that Calico for network policy is pretty much the de facto implementation everywhere. Regardless of the actual SDN or network fabric under the covers, Calico is still used for the policy support.  We found that as we implemented Calico it had great scale and performance characteristics, and it didn’t seem to get in the way. It was very transparent and easy to understand what it was doing. It isn’t trying to be more than we needed in a Kubernetes environment. Truthfully, it’s been a great decision from day one. 

One of the things we did early on was to build in-cluster ingress ALB and NLB load balancers which took advantage of the network capabilities afforded to us by Calico and kube-proxy. Really all we needed to do was manage the portable IP addresses which were provisioned as part of the service. It was incredibly fast to create an NLB which was sufficient for most customer needs. We even introduced a second NLB type that supports IPIP and direct server return for customers that required these capabilities to maximize throughput

The other thing worth noting is that Calico has been incredibly stable for us as we have scaled out IKS – which is impressive when you consider the number of clusters and scale at which we have been operating at for the last few years.

You mentioned leveraging Calico for securing IKS for your customers?

Yes, Calico offers one of the most robust and complete implementations of Kubernetes network policies.  But it also supports Calico network policies which provide even richer capabilities. 

You can secure the pods within the cluster using Kubernetes network policy (which is enforced under the covers by Calico). This was great for, for example, controlling communication between the pods we have for management purposes versus pods that are just the application code.  Basically you can very easily lock down your pods so the only traffic that is allowed to flow is the traffic you are expecting.

But one of the things that was important for our customers is to ensure that the whole cluster is secure by default, not just the pods.  So we needed a way to lock down the worker nodes. We didn’t want to force customers to purchase or use or configure an external firewall appliance in order to get default protection for the Kubernetes worker nodes.  So we leveraged Calico network policies, setting up a number of default policies that apply to the worker nodes themselves. We use these to block all public ingress except for the node-port range that Kubernetes uses.  And then we document for customers how to lock that down too if they want to, including limiting which external systems are allowed to access each node port and even restricting outbound traffic from the cluster. 

The other nice thing is that as we went through and documented the use of Calico for providing firewalling capabilities in IKS clusters, the usage of that grew a lot because teams realized they could manage all of their firewall rules like they manage the rest of their code, like the rest of their kubernetes resources, as yaml files in git. This allowed teams to meet their compliance and security requirements without having to have network device experts that knew how to manage firewall rules on traditional network devices. In contrast, managing Calico policies inside a kube cluster is actually quite easy for them to do. 

As one example, our internal teams started to use Calico to lock down egress for compliance reasons, blocking both pods and worker nodes from accessing the public internet, even though the worker node may be on the public internet, except for the list of whitelisted addresses they explicitly specify.  If there was an attack or a pod got compromised then it still couldn’t reach anything that wasn’t allowed. So Calico’s security capabilities have worked out incredibly well for us and our users.

The IBM Red Hat acquisition was big news for the industry – how has this impacted your team?

The big thing for my team is the introduction of OpenShift into the portfolio. In August we released the new Red Hat OpenShift on IBM Cloud service.  We manage the OpenShift Container Platform for you providing the same user experience and operational support that you get with IKS and Kubernetes but now with OpenShift as a cloud managed service. 

We were able to leverage all the same building blocks and the velocity advantages of using Kubernetes to build IKS when it came to building OpenShift on IBM Cloud.  We got to beta incredibly quickly. We opened up the beta in June and have provisioned and been managing around 1200 customer clusters during the first couple months. 

Coming back to Calico, even though OpenShift comes with its own default Red Hat SDN based on OVN, we decided to use Calico to provide consistency in capabilities across the board, including all the great security features I mentioned earlier.  Plus we were familiar and happy with the great performance characteristics of Calico that we didn’t see value in switching to use OpenShift’s default SDN. Plus OpenShift already supported Calico, so that worked out great.

Any parting thoughts?

For me, one of the key takeaways from this experience is how big a win leveraging Kubernetes to build out our IKS control plane was, and how big a win leveraging IKS Kubernetes clusters has been for our IBM internal teams delivering cloud services as well as our external customers. There’s nothing magical about our use cases – these same wins will apply to many enterprises making the shift to cloud-native.

Kubernetes has allowed us to move incredibly fast – abstracting away lifecycle management, the underlying infrastructure, and giving us a consistent operational model across all data centers in all regions, including huge wins in security without teams needing to be networking or infrastructure experts. There’s a learning curve, as with any new technology, but the benefits are very real.


Free Online Training
Access Live and On-Demand Kubernetes Tutorials

Calico Enterprise – Free Trial
Solve Common Kubernetes Roadblocks and Advance Your Enterprise Adoption

Join our mailing list

Get updates on blog posts, workshops, certification programs, new releases, and more!