In this installment of the Calico Community Spotlight series, I interviewed Ana Shmyglya and Josef Janda, who both work for Jamf. Last year, Josef wrote Migrating CNI plugin from kube-router to Calico on Kops managed Kubernetes cluster, and I wanted to dive deeper into his and Ana’s experience based on that blog post. We mainly talked about their respective teams, their responsibilities, and the challenges they have faced whilst using Kubernetes.
Q: What are your current roles and primary responsibilities?
Ana: I work in the Platform team. This basically means I am responsible for a team that maintains the core infrastructure, which includes the Kubernetes clusters that we run. We also own the underlying CNI of the clusters.
Josef: I work as a DevOps engineer on the team that maintains the internal development tools and other systems connected to the software delivery life cycle process.
Q: What orchestrator(s) have you been using?
Josef: We use Kubernetes. That’s basically the only orchestrator in our company.
Ana: Same for us as well, it’s Kubernetes across the company.
Q: What cloud infrastructure(s) has been part of your projects?
Ana: We use a couple of different providers, including AWS, but we only run Kubernetes on AWS. The migration to Kubernetes really only covered AWS.
Josef: Yeah, it’s the same for us. So to say, what Ana said about different providers, these tools are for our team as well.
Q: Could you please tell me about your Kubernetes journey?
Ana: Where we are at right now is, we run our production clusters on Kubernetes. It was that way before I joined the team, but I believe Josef’s team was the one responsible for starting the Kubernetes journey in the company. My team owns 20 Kubernetes clusters in production.
Josef: When I joined the company we didn’t use Kubernetes at all. It was around five years ago that our team, which was called the Delta team, was assigned the task of implementing Kubernetes and helping other teams to migrate all the existing services there.
We started the process about four years ago and it took us two years to completely migrate. Now I can say that we have most of our infrastructure in Kubernetes. There are still some ‘leftovers’, but maybe those parts of our infrastructure will not be migrated due to their nature.
Ana: Josef and his team were the ones pushing the parts of the existing infrastructure onto Kubernetes, then I personally started using Kubernetes after I joined here.
Q: What were some of the challenges that led you to search for a CNI?
Josef: We are maintaining quite a big cluster, which we call the feature instance. A cluster where each developer can spin up their own instance of our product. It has over 100 nodes. We started having some network-related issues at first, and then we found that our former CNI solution, Kube-router, wasn’t able to scale. So it was consuming a lot of CPU resources.
Ana: We also had some periodic pod termination issues in other clusters and they were quite mysterious. We weren’t able to pinpoint an exact cause until we started looking more into Calico Open Source.
Q: Was there a triggering factor for solving this challenge?
Ana: The problems were quite intermittent and so it wasn’t necessarily every day, but it would mostly require time to look into something that we weren’t necessarily able to pinpoint or solve straight away. So it just tied up a lot of time and we would need to re-run the acceptance test repeatedly. It was just sort of time wasted.
Josef: For our team, it was very similar. We were alerted by some users and they were complaining about the infrastructure functionality so we needed to take a look at what was happening. In the beginning, we didn’t understand it much because when you first look at it, you don’t know what the reasons for the network-related issues are. We also had a lot of people investigating the same thing and it was consuming all of their time. Basically, it wasn’t time well invested.
Q: Could you provide some details about this technical problem, whether it’s security, compliance, or troubleshooting?
Ana: I can’t comment on the security and compliance aspect but I can comment on the troubleshooting. The pod termination problem I mentioned earlier was happening intermittently and required quite a lot of troubleshooting each time it happened, because it messed with our acceptance tests on an infrastructure level. So it took quite a lot of time for a lot of engineers to figure out that was the problem. We found that it was an issue in the code of Kube-router itself. There is a function that keeps retrying when you try to terminate the pod, and each time the IP tables of the pod are busy—which happens quite often, especially on a larger cluster—it would just retry and take forever to stop. After we switched to Calico, because it works in a slightly different way code-wise, this problem went away.
Josef: Our cluster was also having some random general networking issues. In our biggest cluster, the most visible issue was the CPU consumption of the Kube-router. Since this issue was in our biggest cluster, it was quite a lot. We can show you some charts later on.
Q: How did you come across Project Calico? What madeCalico Open Source stand out from other CNIs?
Josef: Once we identified that we were having networking issues, we did some research. I went through some blog posts and documentation, and I came across Calico. It looked promising to us because it promised good scaling capabilities, and also a lot of features we could leverage. So we tried it and it worked.
Q: Are there any significant metrics around this project that you can share?
Ana: Looking at the charts below, we saw a very significant improvement in CPU usage compared to before, when we used Kube-router. It depends on how much and how many nodes are actually in the cluster and how much it’s scaled, but as a rough estimate, currently it’s maybe 10 times more efficient with the combination of Calico and kube-proxy. So across a lot of nodes and all of our clusters, you can imagine we save a lot of resources just by switching to Calico.
Josef: In our team, we were the first ones who were surprised by what Ana mentioned. It was even more significant for the bigger clusters that the CPU was consumed by the Kube-router. The results stood out more for our team due to this.
Feature instance cluster:
Q: Are there other significant, adjacent pieces of the technology stack that played a role? How was the integration between Calico and those pieces?
Josef: Yeah, one thing that comes to mind is that the Kube-router handled the CNI-related tasks on its own. But for Calico, we have chosen to use a combination of kube-proxy and Calico, which was the default option for the kOps. We knew that Calico could handle itself for this part, but we decided to stick with the default option, which we didn’t have any issues with. The integration from this point of view was flawless.
Ana: Yeah, it was pretty much the same for our team. I guess Josef’s team sort of dealt with the testing side and then my team, we dealt with pushing it out everywhere else. So a lot of what applies to Josef’s team also applies to mine.
Josef: We did kind of a POC on our development cluster and then Ana and her team rolled out into the other clusters, including production.
Q: How has the support been from the Project Calico community?
Josef: The only thing our team referred to was the Calico documentation, which is quite good. I came across a lot of blog posts that were also very helpful. Firstly, we needed to increase our knowledge about CNIs in general—what a CNI is, and what its role is in Kubernetes. Then we started looking at different solutions.
I would consider myself a basic user of CNI and Calico because we didn’t do any advanced settings and we just went through the quickstart guide and tutorials. They are really well done. Also, the support of Calico in kOps is really good. It works pretty much out of the box. This was really another nice surprise for us—that we used this option and it just worked. I think this is one of the best features of the tool: it works without any deep knowledge of it.
Ana: I’m kind of similar to Josef. Started looking into what a CNI is, how it works. The implementation with Calico was pretty painless, the documentation is pretty thorough and it’s understandable. I don’t think we needed to dive into the code itself to understand what was going on. It is documented well, clear and self-explanatory, which I appreciate.
Q: Any improvements we should be making?
Josef: I can’t currently think of any but something might pop up over time. So far we have had a great experience. I wish all the best to Project Calico and its future.
Ana: Yeah, I agree! From my side as well, I don’t think we’ve had a situation where we had to really debug Calico yet. It has pretty much just worked, as Josef said.
Get started with Calico Open Source by reading through Calico docs. For collaboration ideas, you can reach us at devadvocacy [at] tigera.io.
Join our mailing list
Get updates on blog posts, new releases and more!