Advanced Kubernetes Network Design


At Tigera, we work with hundreds of Calico and Calico Enterprise customers every year and have learned a very important lesson in the process: Designing networks and troubleshooting a broken network are difficult problems. As a Kubernetes architect, what you get from the network team is real estate (racks/compute infrastructure) and an underlay network (nodes that can talk to each other). You have to plan, architect, get the buy-in and implement the network for the actual applications (pods) running in the cluster. You can’t design something completely new if you are constrained by ToRs (top of rack switches), core network fabric, or compliance/security requirements. A successful network design should include:

  • Native or private IP addressing for pods: factors driving this are organizational constraints, and performance
  • IP address allocation and management
  • Ingress and egress traffic routing
  • BGP routing design
  • Integration with the existing network fabric

We’ll begin with a high-level overview of pod networking scenarios and packet path. Then we will do a deep-dive into IP address management and BGP routing design, with an example of each. As part of BGP routing, we’ll walk through various network design options. Finally we’ll conclude with a recommended template for on-prem network design.

Good morning and welcome everyone to the BrightTALK webinar, presented today by Tigera. I’m John Armstrong. Director of product marketing at Tigera. And today our presenter is Bikram Gupta who is one of our Solutions Architect on our customer success team here at Tigera. He’s going to be talking today about advanced networking with Kubernetes. And he’ll be walking through a number of network design scenarios as well as doing a live demo. If you have any questions that come up during Bikram’s presentation, please enter them in the questions panel and we will address them at the end of the presentation. Before we start, I’d like to mention that we have number of events scheduled online that provide practical information and guidance for DevOps, ITOps, Network and security operations teams who are currently using or planning to deploy a Kubernetes environment. And now it’s my pleasure to introduce you to Bikram Gupta our presenter today. Over to you Bikram. Hello everyone. Thank you John. Hello everyone. Thank you for joining this webinar. Just a brief intro about myself. I had been with Tigera for close to one and a half years now. And before that I had worked in various roles in networking and security. For the agenda, we are going to start with the component level architecture of Calico. Given that there are various traffic paths involved. We’ll do a quick recap of the traffic path between pod to pod and service to pod. And from outside the cluster to the service to the pod. And that will also include the things like NodePort and load balancer. After that refresher, the next one is going to be a very key topic for discussion. The criteria and various factors that are involved in a successful network design. That’s the planning aspect. After that I’m going to go through the specific features of Calico which are of interest and we’ll do deep dive on that. And finally I’m trying to do just one demo given that we have limited time. And what I’ll show you is we’ll start with the cluster not configured with anything. Just a plain qubadium cluster. No CNI configured. And we show how you can actually install Calico, configure required BGP configurations to achieve a highly available layer three setup. And all that will happen as soon as you just setup install Calico, everything should just work fine. And what we’ll do as part of the demo is quickly review how things are fitting together. Along the way if you have questions, feel free to ask those in the questions box. And we’ll take them as they come. Expecting some level of working knowledge, and I’m guessing all of you have that as part of the prerequisites. Now, some of the information that I’ll be walking through, because there are a lot of slides here. Some of those might be overwhelming if I go really fast. And so what I’ve tried to do is to explain why you need a feature and when do you need it. That’s important because if you have that in mind, then obviously you can dig deeper when the time comes. By no means I claim to be an expert in Kubernetes and Calico networking. But at the same time I’ve been here for one and a half years and learned quite a bit. And one thing that I learned that I want to really emphasize is that Calico networking is Linux networking. It means all the things that we have done in Linux and in the data center networks applied perfectly to Kubernetes and Calico. That’s something that I’ve experienced all along. Some of the topics that we are not going to cover are eBPFs, ipv6 and dual stack, Windows networking. If you need information, feel free to ask questions and I can just guide you to the appropriate sessions that would [inaudible 00:04:51]. But I’ll keep talking about eBPF along with as far as the data path once in a while. Looking at the Calico component level architecture, broadly you should think of three different components. Calico runs as a daemonset in each of the node. And these three components do distinct things. The first one is your data path. Data path is just doing one thing. That is pack it in, pack it out. And it can do that to IP tables, or it can do that to eBPFs. It’s basically serving as a CNI capability for Kubernetes. And as part of data path another thing that needs to happen, of course, that is the integration with the Calico management path. Is that as the pods are coming up randomly and going down, all these [inaudible 00:05:55] need to be maintained in the data path layers. That’s the data path. The second part which is somehow not reflected here is the IP address management. It is very, very critical. If you have thousands of pods and services, and these are the pods are particularly [inaudible 00:06:14], IP address management becomes a big task. And given that many of you have to deal with both on-prem and cloud clusters and networking, you’re not operating in a siloed Kubernetes cluster. You have to also think of integration with the existing network. And so, the IP address management in Calico provides lots of fine tuning features set that helps you achieve that. That would be the second component. The third one is BGP. BGP is pretty much proven routing protocol used in data center and internet alike. We use a daemon called, we use a well known open source tool called Bird. And this runs as part of the Calico-node. It does what any BGP software is supposed to do. But in effect, it is actually working as a BGP router for the pod networks and for the service network. And if you install a Calico default installation, that all that magic of BGP is happening between the nodes. Because nothing is going outside the cluster. And as the pods are coming up, Calico is using BGP as the engine to update the data paths. So the management path has the knowledge of which pods are where, and the data path is updated accordingly. Now, another component that I’ve listed here is confd. Confd is again a open source product and we have customized it for BGP and Bird. And what confd does is to take the configuration that is in Etcd and convert that to a configuration that’s understood by the application. I’ll give the example here, anything that you configure in Calico is stored in the same Kubernetes data store, which is Etcd. You don’t need a separate data store. We don’t create one for you. And so when you configure BGP, that data is getting stored in Etcd. And we provide you a certain configuration options. What confd is doing, is basically taking the data from Etcd and converting that to a config file that Bird can understand. Because what Bird understands is the Bird dot CFG file, which is the configuration file for all that routing information. Second thing that confd does is to monitor the configuration changes to the Etcd data store. And if anything changes then it dynamically updates the Bird. And just send a see CAP to Bird. So you won’t actually reset the Bird control plane. It just sent a see CAP so that Bird will just update the latest configuration. That’s the broad component level of architecture of Calico. The next step we review the inter-pod traffic and service inbound traffic. I am assuming that some of you are familiar with how Calico works under the hood. And I’ll try to simplify it as much as possible. As I started by saying that this is one thing I learned in last one and a half years that Calico networking is plain Linux networking. Of course, there are a lot of other things as well. And so if you log in into the node and if you just weigh IPA or IP address, you will see bunch of interfaces and some of those interfaces will be with Cali. [inaudible 00:10:05] with Cali. And if you just explore IP route and look for those cali interfaces, you will see that those interfaces are pointing to the pod IP addresses. So as you can see here, in effect what Calico is doing, it’s creating a virtual router on the node. And that virtual router is a plain layer three router which knows that to reach a pod IP address, like pod A in the traffic diagram here. It has to go through cali one, two, three, four interface. It’s as simple as that. So the way traffic will flow is the traffic will come in and from the pod, it will hit the cali interface through the VE sphere. Virtual ethernet sphere. Calico, well, Calico-node has already the data path set up, which is the, it will look at the data path. And then it will forward it to the next node where the, because that will be the next stop to reach the other pod which is in node two. And it’ll go to the node two and then it will get delivered to the pod C. It’s pure layer three routing in Linux. Now, just look over it. The traffic is going through the fabric. The switch that you see at the bottom, it’s going to the fabric. So now what happens is that, if that fabric is in the same network like here you see 10 zero zero 11 can reach to 10 one zero 12. We are just assuming it can reach. Now let’s say that the fabric we need to do some kind of a routing because this belong to different networks. In that case, suddenly we have a new problem. The problem is the fabric should know how to route the sourced IP address of the pod A which is 10 48 zero one to the destination which is pod C, which is 10 48 zero 65. How would the fabric do that? So it doesn’t know. And so what option you have is just build overlay network which is again proven in the data centers. There is nothing new in it. And the way you do that is you create an interface for IP or IP. Or VXLAN interface for VXLAN encapsulation. And what Calico does in the process is as the traffic comes in, instead of sending directly to the fabric, it routes that to the tunnel interface. IP or IP or VXLAN. So what the fabric sees is the IP or IP header or VXLAN header. And the source and destination IP address of the outer header being the node IP address. And the fabric is pretty much comfortable routing that traffic. So hopefully that gave you a very high level picture on how things work. And you can make your way around by simple commands like IP address and IP route and figure it out how the traffic is going through the VXLAN or the tunnel interface. And if you run into the problem and if your configuration is incorrect, I start with these basic things like IP address IP route, tracing the traffic. And then obviously, you’ll figure out that whether it’s getting dropped at the node or it’s getting dropped in the fabric. Now that we looked at the traffic path for pod to pod, one thing that we still kept unresolved is how does the traffic, how do you configure so that the traffic can flow through the fabrics? And we’ll cover that later. Let’s talk about the other aspect which is, this is East West traffic. The other aspect is when the traffic is coming into the service. And almost all of us have used the things like service, ingress, NodePort load balancer, et cetera. So I’m not going to spend a lot of time. But I’ll try to put a big picture in place. Pods are emerald IP addresses which keep changing. So you have the construct called service, which provides the static endpoint for the pods. The service can be cluster IP service by default. In that case that is reachable only inside the cluster. And when you are sending traffic to a service which is cluster IP, then the magic that happens is that you’re going from pod A to the service A. The magic that happens is that the kube-proxy gets involved and we’ll just see that in a moment. Go one step to the left, which is ingress. We will cover that as the last topic. One step further left, which is NodePort. What NodePort does is now, [inaudible 00:15:29] all of us have the requirement in which we need access to the service from outside the network. And when that happens, you basically need a way to expose the service from outside. And the standard construct in Kubernetes for that is NodePort. And what NodePort does is that as soon as you declare a service to be a NodePort, it exposes that pod on all the nodes of the cluster. And so from outside you can access the service using that node and that pod. When the traffic comes to that node and that port, what kube-proxy does is basically intercept that and redirect that to the appropriate service or appropriate endpoints, which is pods. We’ll talk about how that happens just in a moment. One step left, which is the load balancer. The NodePort as I said just now, is basically you access the node using the node IP address and the pod. And that gets very tricky because then you have to everybody have to remember what node what pod. Managing that becomes a nightmare. So the construct that the Kubernetes has is load balancer. And as soon as you declare a service to a load balancer, magically what happens is that appropriate NodePorts are created, appropriate load balancer is created. If it’s outside the cluster so be it. If it’s inside it is so be it. For example, if you’re using a cloud provider, it’ll be cloud provider load balancer. If you are using on-prem, it could be metalLB. Or if you’re using a controller like F5 it could be F5 load balancer. But what happens is load balancer will send the traffic to NodePort. Nodeport will in turn send it to the pods. Now if you have hundreds of services, you are going to need hundreds of load balancers. And this problem is very much solved in our world. What we do is we create a reverse proxy like engine X as an example. And then that reverse proxy all it does is it can host multiple services with a [inaudible 00:17:55] backend behind this. Behind the proxy. And exactly the same thing that happens when you set up the ingress. What happens is the load balancer will actually point to the ingress. And the ingress knows how to distinguish between different layer seven services or layer seven endpoints. And so you can put bunch of layer seven services behind the ingress. So that way you significantly reduce how many load balancers you are going to need. Let’s quickly look at how these things work under the hood. The way kube-proxy works, so I talked about when the pod sends a traffic to a service. Everything is happening inside the cluster. What happens is that kube-proxy will magically not do any kind of knotting and all of that. It will just quickly because it saw it happening inside the cluster, kube-proxy will magically manipulate the IP so that the pod will directly send it to the actual endpoint. If there are three endpoints, kube-proxy will do [inaudible load balancing. So the pod will actually send it to the endpoint. If you tie to the IP table’s trace, you’ll be able to see that. If you want to just get the steps on how do we produce that or how to experiment that, I have posted that in a blog and we are happy to share that if you need. Now kube-proxy can work in IP tables mode which is typical IP table or IPVS, which is again a 15-year-old proven technology. It is running as a proxy and it is doing a very efficient load balancing. If you have thousands of services, then definitely they’re coming and going with IPVS. If you have anything below hundreds of services, don’t even worry about it. And if you look at the link, we have done our testing, we have posted it here. Go with what works for you guys. Meaning whatever is easy for you to understand, troubleshoot and all that. The next part is as the traffic is coming from outside the network to inside. And when this is happening, so let me say one thing here. When the traffic is inside the cluster, what happens is that if pod A is trying to reach service A. Then effectively pod A is sending traffic to another pod which is behind that service. And pod B can return traffic directly to the pod A. There’s no issue there. It goes through the same conntrack. Conntrack is a connection tracking module in the IP table. When the traffic is coming from outside though, at that time the NodePort gets involved. And the underlying magic is that kube-proxy will intercept that. It’ll know that, okay. This traffic, because when you set up the NodePort kube-proxy knew it. So it put appropriate plumbing in place. It will create the DNAT structure so that the traffic that came in to the, let’s say the host A pod 30,000 will be DNATed to pod A. Because there is a destination NAT happening which is redirecting the destination IP address to the destination IP address of the pod A. Because that DNAT happened, if the pod A is running on the same node, that’s fine. But if the Pod A is running on a different node, then the pod will not see the actual source IP address of the sender. Yeah. Well just think over it. You repeat. As I said it might get a little bit overwhelming, but if you think over it you realize that pod in a different node would see the IP address of the node on which the kube-proxy did the DNAT. Because that became the sender IP. And when that happens the requirement is that the return traffic must flow through the same pod that it came in. If you’re thinking that can the pod in a different node can send the traffic Bird directly? That won’t happen because then the conntrack will drop it. Because there is no existing established connection for that. That’s the thing that you need to be aware of. The good news is you don’t need to do any kind of configuration, it just works. But if you come into the question like will DSR something like direct server return work for this type of a situation? You really need to be aware of that. If you are using IP tables or IPVS that won’t work. If you’re using Calico that will work. Because under the hood what will happen is that there is no kube-proxy or we have basically enhanced the kube-proxy. And the magic that happens is that we basically, the pod that’s running it’ll create a tunnel and effectively delivers that traffic. The incoming traffic into the NodePort inside that tunnel, through the destination node. And so when the destination pod will see that traffic, it’ll appear as if the traffic has come from the actual sender. So the advantage of that is the pod saw the actual sender traffic. And the traffic went back directly to the sender as opposed to going through to all these hoops. So that’s kube-proxy when it’s coming through the NodePort. If you guys believe that you are already familiar with all this and you really want me to go faster, just send a note. But of course, we won’t know because there are a lot of attendees. What happens in the load balancer is that the load balancer has a few target groups. And those target groups are nothing but the nodes which are exposing the NodePort. And if you go through a load balancer configuration, you can see that the target group are the nodes in the NodePort. And the traffic will flow exactly from the load balancer to the NodePort. And from the NodePort it will flow through how I described in the kube-proxy. The last one is ingress. So when you run ingress, the thing to be aware is that it is nothing but a set of pods. It is just a cute colored gate pods in the ingress controller’s namespace. It’s nothing but a set of pods which is running in that namespace. And so that if you go deeper, if you just get into the pod, if you are running engine X as an example, you can run anything. You will land into the engine X page. And if you’re familiar with that, you’ll be able to see the web server configuration and all that. It’s exactly the same reverse proxy. And the magic that happens is that you declared your service with the ingress, and automatically all the plumbing is configured. You don’t have to take care of any of that. I’m sorry, I should say that you create an ingress resource that does the configuration so that the ingress knows how to direct the traffic that’s coming into ingress to your service. That’s the plumbing that you need to do. And the way it works is load balancer will send it to ingress. Ingress will send it to the endpoint, which maybe running in the same node or a different node. It’s basically sending it to a service which is a logical construct, right? And with ingress involved, there are two things happening. One is that traffic is traversing through different nodes. There is a lot of east-west traffic involved and then you have to worry about the scalability of ingress, be able to also scale up and down. And finally by the time it reaches the pod, you don’t have legibility into what IP it came through. Unless you do some settings, in specific scenarios. Okay. So we went through the high level, we went through the architecture, we went through the high level traffic flow that happens in the cluster and how it happens under the hood. If you are in charge of architecting the network, the first thing that you need to worry about is how you go about it. And maybe some of you have already done it and you probably know better than me. What I figured was to identify some of the key objectives that you need to achieve. And by no means is this complete, but this covers most of the things that you need to cover as far as your planning. And this is necessary up front because you are dealing with various teams. Like dev team, network counterparts, platform team or platform counterpart and security teams. And they all are stakeholders. They are all going to approve the network design in some way. And by making the right decisions and getting that alignment, you can move forward quickly. The first part is the CIDR. All about the CIDR, your cluster, your service CIDR, your node network. You objective should be to come up with the CIDR list and justify why you need that. The second part I will say is the most critical decision you will make. Are you going to opt for a [inaudible 00:28:05] network? Or are you going to opt for a native part of your data center? Because remember that you may start with a 20 node cluster in one rack or two racks. And eventually as you start migrating your applications, you are looking at maybe tens or hundreds of racks and spread across multiple layers say subnets. And then different application team will have different requirements. Somehow you cannot… The decision that you make live through the life of the clusters. So I will say that this is a very critical decision that you’re going to make. The last part is high availability and our objective of that is what I’ve seen is while you can make your own objective, it’s often driven by what you get. Because if you’re building a brand new network, you can dictate what’s right and what’s ideal. If you are operating in an existing network, often the high availability will be driven by what you get. Let’s look at the factors that you need to consider in the planning process so that you can come up with a document. The first one is your existing network. Obviously, if you understand your existing network that’s great. Otherwise, how is your existing, how is your current network designed? Is the cluster going to be in a layer two segment giving it multiple racks? Or is it going to be different layer three domains in different racks? And I’ll say that most multi-rack deployment and which is what I see are layer three. And I should also say that with layer two it becomes less of a headache and it becomes much simpler. BGPs, again I have seen more of the data centers running BGP. Again, you’ll have to make sure that are you running BGP in the data centers. Collecting what kind of top of the racks switches you have, the high level BGP configuration that you have and all of that. And if you are particularly planning to integrate your cluster with a data center, it’s a very, very important thing to know. The third one is high availability. What kind of high availability do you need? And if you’re designing a new network or existing network, you have to… There are broadly three types of blueprint. One is you have a single top of the rack switch. And so you opt for a bonded interface. So even if one NIC goes down, that bonded interface guarantees that the connectivity is there. The second one is a duel ToR. There are two top of the rack switches. And all these nodes are connected to both the top of the rack switches with MLAG or VPC type of configuration. So that through the node, it looks like a bond interface. So even if a top of the rack switch goes down, the link is still up. The next one is you go all the way layer three to the a node. And in that case from the node what do you see? Is different layer fillings to the top of the rack switch. And you’re going to do the magic so that you have layer three CMB from the node to the top of the rack. And that means you’re basically getting your entire data center in layer three and you’re using BGP. As much as possible your network team already have the blueprint and you want to stick to the similar type of architecture, unless you have the influence to change it. Now I have engaged in the network design whether they’re building a brand new data center. In that case obviously they go with a proven blueprint. [inaudible 00:32:02] with ballpark calculation of your CIDR space, keep the auto scaling in mind. I haven’t seen a good ratio between pods and services. Would love to know what you see or what you have done. I tend to pick between one is to five or one is to 10 to be reasonable choice between pods and services. So the side of sizing will give you the cluster CIDR that you are going to need. And service CIDR. And you need IP addresses for both pods and services. So you need to take care of that. Keep the auto scaling in mind. Single cluster or multi-cluster the thing that you need to plan upfront is going to be driven by the application design. Are the applications going to be distributed, are they going to speak to each other? What if your dev services team come to you and tell you that, okay. I have service A in cluster A. Service B in cluster B. They need to talk to each other. Or what if they’re tell you that I have service A, well I’m going to deploy in cluster A and cluster B and cluster C. In North America and Europe and Asia. And as much as possible, if you give me a single IP address or single point or URL to reach that service, my life would be simpler. So if you run into those type of questions, definitely explore that and make it part of your planning process. I missed one thing on MTU which obviously, they wanted to put the link here because some of you maybe curious about MTU or VPF and IP tables and all that and that link describes that. Long story short, MTU of 1,500 is going to make a significant difference from MTU from 9,000 when it comes to performance. And the CPU utilization. It’s going to make a significant difference. I’ll say a 100%. And so if you can, if you have standardized MTU and that somehow happened to be on the higher side, that’s a blessing. And by all means, use that feature in Calico so that you get 2X more performance. Don’t discount it. And that is a big part primarily because you have to start thinking of network design from the tenant standpoint. Why do I say that? You will see that in some places you have multiple services which are using the same cluster. They belong to the same org. It’s a non-issue. You don’t have to worry. In some of your customers you will see the, in some of your end users, you will see that they belong to different organizations they belong to different network. And your network team had solid segmentation in place. That, you absolutely need to keep in mind because you cannot go out of your way to just make it flat network for all the tenants in Kubernetes. You have to honor that because that’s how your business processes had been designed. And sometimes that might go deeper as well. Where in they might, and I have seen this particularly for telco deployments. Where they might say that they need the network segregation. They need to belong to different virtual segments inside Kubernetes. That’s a different discussion that needs to happen on a one-off basis. Basically you’re looking at creating individual TRS for individual tenants. But most of the scenario I have seen is the segmentation and you have to definitely explore that and document it and plan your network design. The next one, just quickly checking time here. We have 20 minutes to war and John keep me honesty here. I’ll try to wrap up the slides so that we can spend 20 minutes on the demo or 15 minutes on the demo. So the next one is applications. Only thing to keep in mind is what kind of applications connectivity requirement. Is it only HTTP? Is it VoIP, gaming traffic, UDP protocol? And more specifically if laws are coming to play if you’re thinking of ingress. Because ingress only covers HTTP. So if there are additional requirements and ingress is not enough, then you’ll think of load balancer and then… Or you may have to just think of exposing the services directly using service advertisement in BGP. The last one is security requirement. I will say that this stumps quite a bit. You would see that often, us, meaning all of us have to go back and forth a lot of times. Because what happens is we are dealing with lots of stakeholders and we are learning things as we go along. That’s a good thing. Nobody knows everything on the day one. You talk to different people, you learn different things. Nobody has everything in their mind, all the things. And so what I’ve learned is that somethings you should definitely ask and try to probe deeper is, one is IP address visibility of the pods is a requirement. That’s going to change everything. If your security team is mandating that they need IP address visibility for the pods, again ask why, right? They might be sticking to the whole thing saying that, oh we have this tool that can only understand IP address in the network. Try to question that because as part of your Kubernetes security and network design, you may be thinking of the tools that give you the actual native pod level visibility. And if you share that information, maybe they’ll be open to it. So not only stop at asking the question, but taking that and going back to making your design. Even go deeper and try to convince them. The second we spot to IT for identity. The reason this is important is because I’ve seen it in very subtle situation in which customers have existing inventory tools and unfortunately those tools don’t understand pods. And building a Kubernete structure, you even build in 10 minutes with a automatic tool. But you’ll ultimately operationalize it. When you want to operationalize it, all this is going to come into picture. The operations team are going to use existing tools. Those tools don’t understand IP address. So unless as part of your platform design and integration, you put that pod integration in place into existing tools. You have no choice but to build a network design so that IP address is preserved in the network so that those additional tools can work. I’ve seen customers where they do this in stage one so that they can roll out because ultimately you’re driven by your own sprints, right? You cannot wait for one year to roll out Kubernetes. And so in the sprint one you’ll quickly roll it out. It is still products unready in three or four months down the line, you enhance on this design. But if you have got the answers to all these, if you put the network design in place, then you’ll have done the right thing. The last part as part of the planning is fabric restriction. This is a big thing. If the fabric has put restrictions so that they put highland road so that IP over IP is not allowed, then you are in big trouble if you have picked IP over IP as encapsulation. Or if they have put the VXLAN only on specific nodes again you are in trouble. So any network design that you are coming up, you need to make sure that the fabric can honor the traffic. And there is no black hole that you’ll get stuck into later when you are implementing this. That’s why I said that the planning process is extremely significant. And if you do this right things, just like we say to do the planning, right? You save 60 or 70% via pod by not going back and forth. You will save a lot of time. Now let’s try to go into the individual features of Calico and in that process, I’ll try to stick to why so that you can keep that in mind and go back to the individual or dig deeper into configurations when you need that. So IP address management very broadly, as we spoke about pods need IP addresses. Calico IP address management modules take care of it. By default, the nodes are given a CIDR block because you need to manage the IP addresses across the nodes. So by default, given that in Kubernetes, you may ballpark anything between 30 to 60 nodes per 30 to 60 pods per node. By default, we go with 31 pods, which is 26 CIDR block. And so we allocate that so that we can manage it efficiently. Now what happens if a node crosses 31 pods? We allocate one more block to that. Now we just got 62 pods or capacity. Now, so that allows us to aggregate this as a CIDR block into the node so that we don’t have to create the routes in the fabric. Even in the Calico for individual nodes, we can actually aggregate it. Now what happens if you have adjusted on the pools? We have a big problem there. Right? And so we use something called the IP stealing. I am not going to go deeper into it because it may so happen that some nodes are using 10 pods, some nodes are using 60 pods and you don’t have any more IP blocks. So use IP stealing so that your life can go on. And you’ll do all the plumbing in the IP tables and IP routing so that there is no black hole. But what happens if there is a black hole in the fabric because you’re not controlling that, right? And so, we provide the option so that you can disable the IP stealing. These are like way more advanced specific discussion that I see in one or 2% of the customer. If you run into it know that the plumbing exists for you, you can make your way around it. And then IP pool also allows you to dedicate IP addresses like static IP, floating IP. These are fairly simple names that you may correlate with standard nomenclature. The other part is if you are forced to design the architecture so that your existing network understands or your existing tools understand the IP addresses, because they might have allocated IP addresses for certain business applications. And all you’re doing in Calico is you are giving the IP for the namespace as opposed to default. Now Calico is taking care of, Calico is allocating for that namespace irrespective of the node is allocating IP from that pool. That’s a big deal because when the traffic is going into the network, the moment you see that IP address for troubleshooting, debugging, inventory management reporting, that IP belongs to that business application. I’ll run through this quick slides quickly in the interest of time so that we can move on to BGP. I talked about the… I didn’t talk about the IP pool how you can allocate it on a per node basis. Invariably, when you are doing a multi-rack architecture, most of the time it’s layer three. So how do you actually go about allocating different layer three blocks to different racks? It’s as simple as in the IP pool, use the node selector. That’s all that this slide says. And hopefully you understood why you need it, because the network demands it. The next one is I already spoke to you about this, so I’m going to skip this. The next one is static IP or floating IP. There are a lot of different scenario you’re going to need static IP like VoIP, gaming application. You’re going to need floating IP. Let’s say you have a database board and you’re going to tie the same IP to that even when it gets rescheduled. You use a floating IP which is accessible outside. And Calico provides you this annotation mechanism so that you can actually use both static IP and floating IP. Now let’s talk about the BGP to the node. It is the same BGP that you know of. And what’s happening here is that you have thousands of pod IP addresses, you have hundreds of service addresses. You’ve decided that you are going to advertise this to the rest of the world meaning your data center so that you know about it, so they know how to reach it. And so you have to do some kind of BGP design. Those of you who do it in the regular network, you know how complex it can get. Right? And I explained why you need it because at the bare minimum, when the traffic is going from back one to two back to BGP doesn’t know it’s going to drop that traffic. If the fabric doesn’t know it’s going to drop that traffic. But there are more things to it. Now that you are running BGPs, and you are good at BGP you’d like to do hundreds of things, right? And one of the most basic thing you’d like to do is, you have all these services running. Instead of running load balancer, ingress and all that. Can you actually advertise the services? The good news is with just one annotation, you can actually do that. I take it back. With just one spec, you can do that. And what if Calico doesn’t provide the configuration tool, and you want to do some kind of configuration? You just improvise something called custom template. I spoke about confd and all that. There’s a documented process in Calico enterprise on how to do that. It is actually straightforward after you do it for the first time. First time, it’s going to trouble you a lot. So it’s always better to seek the documentation and try to make it work. After that if you understand word, if you understand templating, it will be pretty easy to do that. Saying that I have seen fewer and fewer accounts where we need to do that. Now, I spoke to you about one thing when I said that you have multiple clusters, your services teams are your service owner comes to you and tells you that I am going to run my service in all these in Asia, North America and Europe clusters. And can you give me just one endpoint so that I don’t have to really worry about all these different services in different clusters? And the answer to that is BGP. You are simply running the… You’re simply giving the service owner one service endpoint, or IP address. And that service IP is available in all the clusters. You are using BGP to advertise that service IP through your network. In your network, in your DNS, you’ve configured a DNS name. And so that DNS name points to all those endpoints, meaning all those cluster IP services. And now anytime somebody tries to reach that service IP, or reach that DNS name, the network itself will route it to the nearest cluster, you don’t have to worry about it. I may have mixed up a few things here, you can do that using service cluster IP, you can do that using a single static external IP. And external IP is a Kubernetes construct that allows kube-proxy to route the traffic to the pods as soon as traffic comes to that external IP. So you can use external IP and advertise that using Calico, and you can map a service domain name to that external IP. So that now the service domain name effectively become anycast to any endpoint running in any clusters. The network will take care of routing. So that’s very, very powerful. Because if you think of trying to do that without BGP, you’re looking at trouble. Meaning it’s a lot of work. And I don’t know a proper way of doing and maintaining it. All the slides that describe here go through it in some more details, and I’m going to skip it for now because I want to spend some time on the demo. There is one thing that I want to talk about, which is sender IP visibility on the traffic. And if you look at this, I spoke about advertising the cluster IP through BGP. So your network knows that it can reach the endpoint through any of the nodes because the cluster IP is being advertised. So what happens if you recall the kube-proxy thingy, you will see that kube-proxy will do a… By the time kube-proxy will send to the pod in another node is going through one hop and the source IP would have been lost. But the pod with C is the source IP of the kube-proxy node. And this is definitely a problem when you are doing the default configuration. Now think of it, should you be doing it? Right? I leave it to you because it depends on how many services and how much of east-west traffic going on. Some customers have told me and I know for a fact that some large providers do it which they have posted in YouTube. Is they try to use a construct called external traffic policy to local because this inter-mesh traffic became overwhelming for them. There’s just too much going on. You can actually do something like. Set the pod affinity so that the pod scheduling happens in a uniform manner across nodes. And instead of doing all that you just do… I’ll just go to that slide so I can show it there. I think I somehow missed it. Yeah, here. So you can do it so that the route the… Only the nodes, you can see that the node three is not advertising the cluster IP. Only the nodes will advertise the appropriate IP to the external network. And so when the node receives the traffic, then it will only send it to the pod on the same node. And you’re taking care to see that by using affinity, anti-affinity, the pods are uniformly distributed. If that works for you, by all means do it. Because then your operations troubleshooting will become a cakewalk. The pod will see the original sender IP. If you’re using eBPF, you don’t want to worry about any of that. Okay, very good. Thank you, Bikram. That was an excellent presentation. We have a number of questions here. Some of these maybe lack a little bit of context. But one of our questions here is, and this is in reference to the Cilium in routed mode example. Where you need to instruct in advance the nodes or the other nodes CIDRs, does that mean that you don’t need to instruct the nodes with the other nodes, that is, nodes routes outside of Calico? Okay. Let me make sure I understood the question correctly. The question is, if you use Cilium versus Calico or if you use Cilium for routing? Was the question? Maybe I can see clear on the… Let me just unshare it so I can see it my… Okay. If I exit I hope it doesn’t… Yeah. And stop sharing I just see the questions here. Almost there. Okay. Does it mean that you don’t have any need to instruct a node with other node router outside Calico? Instruct nodes with the other node router up outside Calico? That’s correct. So I’d say that, I’ll take it back. So there should be a way for Calico node to be able to reach the fabric. So if you have the default gateway configuration, appropriately set so that the traffic going to any destination can go through the default gateway, then absolutely you don’t need to instruct the nodes to reach any other nodes outside Calico. Because it’ll use a default gateway. And if the question was about in between the Calico nodes, then absolutely you really don’t need to configure anything. Going to the next question, can you send the link to the IP table tracing blog? Yes, we can send that over as part of the slide deck. So John, there are two action items. One would be the demo how to replicate that. And second one is IP tables debugging and tracing. So the third question that I see here is what are the Calico containers in this? I don’t remember the top of my head, but if you just do a search on the Project Calico website, you’ll get… If you just described that Calico node. I believe there are two containers running in that, but I may be wrong. The next question is, does it work on Raspberry Pi? I have not tested it so I don’t know. And if you are able to set up Kubernetes on Raspberry Pi, then I don’t see any reason why it would not work. What I have done on my setup is I have a 12 Gig of RAM, four CPU and running VirtualBox. And I managed to just create Ubuntu VMs and all the plumbings and that worked. So if you have 12 Gigs of RAM, spare RAM and four CPU on your laptop, you can do that with Raspberry Pi if you can do the plumbing. And if you are able to run qubadium, I don’t see why you cannot. How do we debug when container networking doesn’t work? Good question. I have written a blog on that. So we’ll make sure to send the blog. Broadly the idea would be start with making sure that the network is appropriately connected. Or at a broader level start with making sure that your Calico components are running. Because if that’s broken, then everything is broken. Second is make sure that if you are new to the cluster, what type of encapsulation is configured? Because IP-IP will be different than VX LAN would be different than native. And then the obvious thing would be, you can quickly look at the IP route command to see that it or IP address command to see there’s a tunnel interface or VXLAN interface, or if there are no interfaces. Then your objective would be to quickly find out if the traffic goes out from the node, which path it follows. And you can do that using IP route command. If all of that seems to be set, then you’d like to go deeper into IP table and look at some of the basic things. Like sometimes that, it doesn’t happen always but if it has to happen, you have no choice but to go deeper. And so you would like to know that, okay. How do you map a container or how do you map a pod into the IP tables chains. There are explicit two or three steps to do that. Standard copy and paste, I can show you that. And that combined with tracing will give you enough ammunition to go through it. In what ways Calico is better than Flannel? Flannel is something that you want to get up and running in the development setup. And because it’s so simple, it uses an overlay encapsulation you won’t need anything it’s pretty good. But remember that it uses an overlay encapsulation. So what you’re getting in the end host is something which is an IP address that you cannot reach from your network. You have to go through load balancer. And what we have seen in customers is as you start scaling, then you run into performance bottlenecks, then you start getting requests from the application streams that requires you to rethink how you design your network. And so we have designed, we have offered a feature so that you can really live migration from Flannel to Calico in the running cluster, you don’t have to rebuild the cluster. We’re running out of time and I’m sorry to interrupt, but we do need to wrap up before the platform closes. I’d like to thank everyone for joining us today. And Bikram, thank you so much for your excellent presentation. We have a number of upcoming events online on Tigera over the next few weeks. Please visit our website at and the events page there to learn more. Once again, thank you to Bikram and to each one of you who has joined us today.