Istio – Enabling a Defense in Depth Network Security Posture

Are you looking into Istio? Many companies are investigating Istio to reduce the challenges of managing microservices as it delivers a uniform way to connect, monitor, and secure environments – especially used in concert with Kubernetes.

Did you know that Istio is a part of the Tigera Secure solutions, that we play an active role in developing Istio, and we co-chair the Istio security special interest group? We’d like to share our expertise to help you understand how Istio fits into a comprehensive network security model.

Watch this on-demand webinar and learn about implementing a defense in depth posture that enables zero-trust network security across L3-L7 and allows Application, DevOps, Platform Engineering, Network, and Security and Compliance teams to seamlessly work together.

Complete Transcript

Michael: Hello everyone, and welcome. Today’s webinar “Istio – Enabling a Defense In Depth Network Security Posture.” We’re glad you all could make it, it’s a hot topic, there are a lot of you here, so we hope that you enjoy this presentation and that you get a lot out of it.

Michael: First off, I would like to introduce today’s speaker, and that is Christopher Liljenstolpe, he is the original architect behind Tigera’s Project Calico, probably the de facto network security solution for Kubernetes. He speaks at over 60 meetups yearly, speaks at conferences educating people on networking and network security for modern microservice-based container applications. He also consults Tigera’s enterprise clients on security and compliance with their applications. So, without further ado, let me hand it over to Christopher, Christopher.

Christopher: Alright, thanks a lot Michael. So welcome everyone, and as you heard, we’re gonna be talking mainly about what we call Application Layer Policy here at Tigera, and it’s mostly done with the SBO community to bring network policy up and beyond this Layer 3 and Layer 4. And let me spend a little bit of time on that before we go into some of the details.

Christopher: So if you start thinking about network policy, and we talked a lot about network policy and security here on previous webinars, most of what we talked about to date has been around Layer 3 and Layer 4. That’s been things labeled “Bob” can talk to things labeled “Alice” on port 443. That works, and there’s lots of things that still rely on classical networking, opening applications, or open up ports and make connections using standard networking concepts.

Christopher: There is some benefits to the way that we at Tigera enforce those policies, by enforcing those outside of the pod, for example, and we talked about this before, enforced at multiple points, such that even if a pod, for example, in Kubernetes becomes compromised, it can’t update the network policy controls, i.e., those L3/L4 policies.

Christopher: However, as more and more applications move into or become more WebSocket-y models, more higher-level abstractions, where you’re not necessarily making a call to a IP address and a port, but instead you’re making a call to a service identified by a URL.

Christopher: As more and more of those applications, that application-level networking becomes predominant, you need to do things other than just L3/L4 network policy. That’s because, frankly, at the end of the day if everything in your application estate, for example, is using WebSockets, REST-type calls and TRPT calls, pretty much all you have is 443 traffic running around, most of that’s going to be encrypted.

Christopher: Your network controls for these more modern applications start becoming fairly constrained by the fact that the differentiation about what you are trying to connect to is really being defined at a higher level, that application layer, rather than at Layer 3/Layer 4.

Christopher: So, to date what we’ve done in enterprises, is we’ve done things like WAFs, Web Application Firewalls, et cetera, where you have a separate set of infrastructure, and a separate set of policies that define web tier firewalling, what URLs you’re allowed to access, what kind of verbs in HTTP POST/GET, et cetera, you’re allowed to do. And that’s separate from the network firewall infrastructure, i.e., this host can talk to this host on port 443.

Christopher: I’m sure for anyone here who has run these disparate environments, the WAF and the network firewall, you will have run into cases where, for example, the WAF rules don’t quite match up to the network policy rules, and you have interesting failure modes where, for example, one policy allows traffic and another policy denies traffic.

Christopher: In similar ways, again, we have the same problem. How do we identify what this thing is, what things should be allowed to make an HTTP GET to a given URL service, because applications are constantly changing. So we basically, at Tigera, decided that it made sense to take the network policy model that we had basically pioneered, and that’s now a key part of Kubernetes’ network policy strategy, and extended that to Layer 5 through 7. We call this Application Policy.

Christopher: The next part of this, though, is once we decide to do that, we have to pick what is the right tool to enforce these policies. And about the same time we were thinking about this, the Istio project was kicking off. So, we became early contributors to the Istio project, we’ve been very active in that community. Istio’s security working group co-chair is a senior engineer here at Tigera. We basically have taken the concept that a network policy should be able to refer to things all the way up the stack from L3 to L7, rather than, say, having separate policies for L5 through 7, and separate policies through L3/L4.

Christopher: So, we’re going to talk a little bit about that, and how we do that within the context of an Istio environment. If anyone wants to have a further conversation about why Istio, independent of network security policies, we’re going to be discussing that over time, and we’ve got some training and other things. We can also have an Istio overview at some point. If folks are interested, please just let us know.

Christopher: But, we’re going to really focus now on using Istio as a network policy control enforcement point, rather than all of the other goodness that Istio brings, like visibility and load balancing, and circuit breaking, et cetera, which are all great and wonderful things, and reasons you should deploy Istio independent of what we’re going to be talking about here, which is network policy. So, you sort of get a twofer, or actually about a five-fer, if you deploy Istio. You get a whole bunch of application layer semantics that make developing WebSocket-y, REST, gRPC-based applications much easier, and more rugged and resilient, and you get the ability to have a common network policy model all the way up the stack, from L3 to L7.

Christopher: Before we talk about Istio, particularly, let’s go on and do a quick reminder of what is an L3/L4 network policy in Kubernetes. What you see here is a network policy. Oops, let me turn off my phone, and since this is live you will get that little Chris forgot to put do-not-disturb on his phone. For those of you who know my phone number, you can check to see if I forget to do that again by sending me a text at the next webinar, and see if I repeat the same mistake.

Christopher: Anyway, so this is a Kubernetes L3/L4 network policy. This is a really simple one. This is basically saying things that are PCI-compliant should be able to talk to things that are PCI-compliant, and we shouldn’t be allowing traffic from non-PCI-compliant workloads to talk to PCI-compliant workloads.

Christopher: So, the first thing to set this up, this is a Project Calico API global network policy. We name it and we give a name of pci-isolation, and we optionally confine it to a given namespace, but the key thing here is we say that this policy that we’re about to get into, the ingress and egress section, the action section if you want to think of it, of the policy will be applied to any workload where there is a label that says “pci,” and a label where the key “pci” is valued “true.”

Christopher: So, basically, the two workloads, the red ones here, will have this policy applied to them, because they’re labeled PCI. The policy statements basically say that for any ingress traffic, we will allow traffic from things that are labeled PCI, and same thing for egress. We will only be allowed to send traffic to other workloads that have a label of PCI. Now, we haven’t said what ports, et cetera, that can all be scoped. Basically, if a PCI node tries to talk to another PCI node, it will be allowed in this case. And, if not, unless there’s another policy that would allow said traffic, the global default deny that Tigera implies by default would block that traffic.

Christopher: So, this is an L3/L4 policy. This is, for many things, in fact most workloads today especially in enterprises, this is the layer at the network where most applications reside. So, this is all your heritage applications, even fairly modern applications, and even, I would say, if this application is a WebSocket or a REST, or a gRPC application, primarily because of where we enforce this policy.

Christopher: So let’s talk about that on the next slide. If we [inaudible 00:12:21] how this is implemented within a given node in Calico, or in Tigera’s solution, we have on each node an agent called “Felix,” and that listens to a Calico policy repository that’s stored in Kubernetes key-value store, or an SDB directly. And if there is something that exists on that node Felix that attracts a specific policy, in this case, say, a PCI is true or for a labeled workload, then Felix will write on that node, into the kernel network filtering infrastructure, a policy that will allow traffic from other PCI-labeled-true workloads, either traffic coming in from or traffic going out to. And if there’s no other rules, then everything else is blocked.

Christopher: This is happening at the underlying, kernel layer, in the node. So, even if a pod becomes compromised, it will still be filtered because the pod, by definition, doesn’t have access to the root namespace on a given node in Kubernetes, or in any containerized environment.

Christopher: So basically this is an enforcement point outside of the scope of the node, and it operates at Layer 3/Layer 4. But let’s say now that we’ve moved to primarily in a state, or a cluster of applications, that are gRPC or HTTP-centric in their API calls. As we said before, if everything, if we take that to its logical conclusion, and say every single application, interprocess communication between different pods or different microservices, is going to be done at the gRPC or HTTP layer, the policy statement is already fairly straightforward at L3/L4. It’s going to basically say I’m allowed to talk to anything on 443. That doesn’t really provide much in the way of security.

Christopher: So, let’s talk about some extensions that we have done. These are available in both the open-source product, and now GA’d. It’s been in demonstration for almost a year, now, but it’s now available for GA in both Calico and our commercial product, Tigera Secure Enterprise Edition.

Christopher: Let’s talk about what we can do. So this is going to be an example called BOOKINFO SAMPLE APPLICATION. The idea here is a request comes into an ingress, which we’re not really going to talk about here, and then it’s going to come into this first front end of this application. That’s going to be this Product page web front end, and that’s basically going to list all of the products we have in our bookstore.

Christopher: What we’re going to do is have a couple of different customer back ends, where customers can get reviews of the book that they’re interested in buying, and the first version we released only was written reviews by other users, but doesn’t have star ratings. And then at some point later, we deployed a star rating system, and we basically in version two of this the stars were black, and their back end data store was from a node.js application that’s running a ratings code that basically calculates the average star rating for a given book. And then later we did version three, because A/B user testing said that red stars might be better, so we want to try both black stars and red stars. Both of them, again, pulling from a node.js node.

Christopher: And then, finally, we want to get details about the book, how many pages, who the publisher has been, et cetera, all that data comes from a Details Ruby script that’s running in another pod.

Christopher: All of these are going to be controlled by these black bars in each pod, which are the Istio proxy, which happens to be Envoy. Almost all Istio deployments today, if not all Istio deployments today, Envoy is a proxy that came out of [inaudible 00:16:42], and it’s a lightweight HTTP, gRPC, HTTPS, HTTP/2 proxy.

Christopher: In all of these cases, these are installed as sidecars in a pod, which means some other container in the pod, but it’s within the same pod’s namespace. There’s some advantages to this. First of all, it means that all traffic in and out of a given pod, even though it’s going through a proxy, still is coming from that pod, not from a shared proxy IP address. So, we still have full visibility, like Kubernetes originally intended, about which pods are talking to which pods.

Christopher: Also, since Istio is handling now the mTLS and encryption, and certificate checking, for all of these HTTPS connections, it is doing so within the pod, which means that the cryptographic material, the keys et cetera, are contained within a pod. You don’t have a shared proxy where all of the applications, or instances of an application that are behind the shared proxy, have to share with that shared proxy their cryptographic material, and therefore compromise of that shared proxy compromises a lot of workloads.

Christopher: In this case, a compromise of a key within or from a proxy is only going to impact the thing that that proxy is directly strapped to, the single instance. So a compromise has a much lower blast radius, it constrains the security perimeter and the cryptographic material within the security perimeter of a pod. So, we now have, again, this idea of a pod is a entirely self-contained thing to Kubernetes, and it doesn’t necessarily need to rely on some shared resource. So, in that regard, the Istio architecture’s pretty well aligned with Kubernetes, which makes sense because it came out of Google and IBM, but this is the application.

Christopher: Now, let’s talk about how we can extend Kubernetes network policy to protect these. This becomes [inaudible 00:18:55] because all of these are 443, so basically in a network policy standpoint, we basically say that things labeled “Product page” can communicate on 443 to anything labeled “Reviews” or “Details.” Then similarly “Reviews” and “Details” will allow traffic in from anything labeled “Product page,” and things labeled “Reviews-v2” or “Reviews-v3” can send 443 traffic to “Ratings,” and similarly “Ratings” will allow traffic from things labeled “Reviews-v2” or “Reviews-v3.”

Christopher: That’s sort of coarse grained, so let’s try and make this a little more specific. Last thing I want to cover here, as we say, is that if you turn on service mesh mTLS support in Istio, it’s a global setting, then if we’ve attached serviceAccounts to these things, Reviews-v1, Reviews-v2, Reviews-v3, et cetera, Product page, then by simply attaching serviceAccounts to these endpoints, Istio will automatically do mutual TLS authentication, or enable mutual TLS certificates, anyway, and we’ll talk about how we then drive the authentication.

Christopher: So, let’s talk now about a policy, going to the next slide. This looks very similar to the network policy I showed before, and these two things can be mixed, but now we basically say that app “details” … apiVersion, by the way, is incorrect. I grabbed an old version of the slide, I apologize. This should say global network policy in Project Calico v3, it’d be the same preface as we saw in the earlier network policy.

Christopher: Anyway, this is basically saying that anything labeled app “details” will allow traffic in. But now, instead of from a pod labeled “productpage,” it’s going to allow traffic in from-

PART 1 OF 3 ENDS [00:21:04]

Christopher: … it’s going to allow traffic in from anything that has a service account attached to it with a named product page.

Christopher: Since we’re using service accounts and we turned on MTLS in Istio, this means that this isn’t just going to be a line traffic from things with service accounts, which is another type of annotation, product page. It will only allow the traffic, the details, the Istio proxies in the instances of the details apps will only allow traffic from things that have a POS certificate that has been assigned to the service accounts with the named product page.

Christopher: So we basically now enforce an MTLS authentication of this traffic. So now it’s just not the IP address, it’s actually an MTLS certificate that’s being used to grant or deny traffic here.

Christopher: Similarly, if I have an applications ratings… do you remember that was the [inaudible 00:22:16], I think, application. Now I’m going to allow traffic from any service account that has a label on it, ratings readers. So if we go back up to a couple slides back, you would have a service account on the reviews instances. And they would each have their own unique… so they might have their own unique service account names, but you would attach a label to the service account you’ve added to reviews week two and reviews week three. You’d attach a label to that that it is a ratings reader or ratings consumer. So now you can do the same thing we talked about in the first instance except now, we’re going to say that we don’t care what the service account name is. We’re just going to [inaudible 00:23:18] the service account has a label attached to it as ratings reader. So you can now start attaching capabilities to service accounts and they can be shared across multiple service accounts. Most likely talk about things. Many things might be [inaudible 00:23:34] clients, but you want to have a single policy that says, “Oh, [inaudible 00:23:37] servers will allow traffic from [inaudible 00:23:39] clients once [they’re 36 00:23:41].”

Christopher: Here, there are many things that might need to read from the rating service. So instead of having a bunch of rules that says the service account ratings V1, service account ratings V2, service account ratings V3… or V2 and V3, now that’s attached to ratings V2 service account and ratings V3 service account, the label ratings reader so you have a single policy to accomplish that same thing.

Christopher: One last topic I would point out here, service accounts are a back-controlled resource. Well, based off [inaudible 00:24:17] controlled resource in Kubernetes. So unless Kubernetes has given you… or within the [inaudible 00:24:25] environment, it’s usually pinned and refers back to a corporate active directory system, or LVAP, or something along those lines unless you have the permissions to manipulate that particular service account, you’re not going to be allowed to do things like attach labels to [inaudible 00:24:42] versus something that is then traceable and logable back in our back. And if you looked at some of our previous… I forget how many weeks ago the compliance webinar was, Michael, but we did a whole discussion about how in [Tigaras 00:25:00] solution, we logged all our back-controlled [inaudible 00:25:06] in Kubernetes. So if you’re Tigera’s commercial solution, you would then see that the user, Bob, tried and was successful to attach a label to a given service account, the ratings reader. And you can see that Alice also attempted to do that, but she was denied because she doesn’t have permissions to make modifications to that particular service account.

Michael: Yeah, I think that was about a month ago. And if people want to go back and look at that webinar, it is on our BrightTalk channel. So you can always go back and review that one.

Michael: We have a question, though. You were talking about service account. What it is. Is it just another label, though, in what we’re discussing here?

Christopher: So service accounts in Kubernetes are just an annotation. It’s another type of label. It is unique in that it is our back-controlled, ie you identify which users or groups of users are allowed to use a given service account, which is different than a standard Kubernetes label. Anyone with permissions to affect changes to a pod in a given name space can attach labels. This is an instance where, if you think about it, the label itself is our back control. In Kubernetes, that’s the extent of its use. It’s basically in our back-controlled label.

Christopher: One thing to keep in mind, you can only attach one service account to a given pod. So this is where attaching labels to service accounts makes them a little more usable because you can only attach a single service account to a pod. A pod can only be in a single name space. If the [inaudible 00:26:51] then builds on that primitive of a service account and if you turn on [inaudible 00:26:55] with an Istio, then service accounts act as a stand-in identifier for the POS certificate that will be generated for each pod. So you can then refer to that POS certificate by referencing the service account that’s been assigned to that particular pod. So in this case, we’re using that service account, really, as a proxy for MTOS.

Christopher: If you haven’t turned on MTOS, we still wouldn’t allow the traffic, but it wouldn’t be an MTOS [inaudible 00:27:29]. It would just be a straight up deny. So if you’re concerned about security, etc., I strongly recommend you use MTOS. It comes for free with Istio. You might as well use it. So any other questions at this point?

Michael: No, nothing now.

Christopher: Okay. Mike, is there any way you can hear snoring from our [laughter] [crosstalk 00:27:50]?

Michael: Well, you know something? I’m in marketing and that was pretty clear to me.

Christopher: Okay. So now… pardon me. So now, let’s put these two things together. So we now have this… remember we talked earlier about pod selectors? And pod selectors are that L3, L4. But pods are labeled product page. So now we can say… or if your app reviews were going to allow traffic if it’s coming from a main space with labels that’ll label reviews reader and from a pod that is labeled product page. So in this case, unless those two components come together, you’re coming from a pod or label product page and you have a service account, which is our back-controlled reviews, is what they label in the service account that’s reviews reader, the traffic will not be allowed.

Christopher: There’s two interesting things here. One, you now have to have two pieces of data. It’s sort of looking like two-factor authentication. You have to be coming from the right place and you have to have the right credentials. But two, the enforcement of the pod selector component is going to happen at L3, L4. And the underlying hosts [inaudible 00:29:25] for this pod labeled app reviews. And the enforcement of the service account MTOS authentication is going to happen within app review’s pod in the Istio proxy. [inaudible 00:29:43] I’d be doing enforcement at two different places. So even if one of those two enforcement points has been compromised, this policy will still be effective or mostly effective. And this is when we start talking about zero trust when you have to assume, if you remember me banging on this a month or two ago, you have to assume your infrastructure, your applications, every component of your infrastructure very well… or any component of your infrastructure very well might be compromised. If it’s every component, well, then, the gate is up. But you have to assume that any given point may be hosting a compromise.

Christopher: So in this case, we’re going to do enforcement based on two credentials in two different locations. And we’ll show you what the [R sector 00:30:30] of this looks like in a minute. So let’s keep…

Michael: The question that says, “Is this policy just a Calico policy or [inaudible 00:30:37] policy?” So this is a Calico policy, but there’s a benefit of using a Calico slash Tigera policy, correct?

Christopher: Correct. So this is not an Istio policy. This is something we’ve done here at Tigera. So this a Calico and Tigera policy. It’s driving behavior in some of the Istio components. There’s a bit of difference between Istio policy and Tigera policy. The Tigera policy… well, let’s park that aside for a minute while we go over [inaudible 00:31:12] for the architecture. And I’ll show you how this is maybe a little different.

Christopher: So if you go to an index, [inaudible 00:31:18]-

Michael: [crosstalk 00:31:18] probably for later in the end.

Christopher: Okay. That’s fine. So if we go to the third one, we can now [inaudible 00:31:27] point and another piece of behavioral, in this case, of behavioral analysis, we can basically say, again, this is applied anything labeled apps ratings. And we’re only going to allow traffic from pods labeled app product page. And only from things that have a TOS [inaudible 00:31:55] from a service account with a label ratings reader. That’s [inaudible 00:32:01] in two places like I said. But now we’re going to say that traffic from those app product page ratings reader selected subset of traffic is only going to be able to do a HTP guess. And we can only do that on a URL that starts with a path of reviews.

Christopher: So at this point now, you have to be coming from the right pod or set of pods, you have the right POS certificates, and you have to be behaving correctly. You have to be just doing [inaudible 00:32:34]. You’re trying to do a post to CGI [inaudible 00:32:38] or something along those lines, you’re going to get bounced. So now, we’re not only doing enforcement in multiple places, we’re actually looking at that L5 through 7 behavior. And specifically, here, at the L7 session behavior to say, are you going for what you should be going for? So there’s all sorts of interesting things here, for example, if you’re making a database [inaudible 00:33:05] in the [inaudible 00:33:06] database, and you’re trying to get customer record information when all you should really be able to get is their shipping address. You can bounce that. So the shipping application shouldn’t be able to read the order history of a customer. They should only be able to read, for example, their shipping address for this particular order.

Christopher: So you can now constrain application behavior while still enforcing it at a [coarser 00:33:36] level. If this thing’s coming from a [inaudible 00:33:38] instance of your shipping application, it shouldn’t be able to even connect to the production-side databases. So we now have a single policy that enforces sort of network center pools all the way up to behavior. And it’s all in one, single policy statement. If you remember my comment earlier about having [inaudible 00:34:01] rules not quite aligned with network policy rules and then trying to troubleshoot that, again, at three o’clock in the morning, you now have one artifact that describes the entire behavior I expect from one microservice talking to another microservice. What should it be able to connect to? What POS certificates? It needs to [inaudible 00:34:22] to be able to make that connection and even if it is an HGDP, GRPC, etc. session, what actual action is it trying to do, and on what object?

Christopher: So how does this magic happen? If we go back to the first architecture slide, you’ll remember… it would be Calico architecture slide if you go back up… we’re slowly scrolling back up. The architecture… there. That slide. So Felix was riding those L3/L4 policies. Pods labeled Bob can talk to pods labeled Alice on 443. Those are being written to the host underlying IP tables’ and IP sets’ infrastructure.

Christopher: Now, if we go further down and look at the architecture, now, we’ve added just a few additional components. So in this model, Felix is still getting that policy from that Calico policy store, that key value store. But now, that policy has statements that’re L3/L4-based and policies that will apply through [inaudible 00:35:41]-based, ie service accounts, MTLS matching, HTTP verbs and targets, etc. So Felix, then, makes the decision for each of the rules with any given [inaudible 00:35:52] the components of any given policy, is it going to write into the underlying host’s kernel? Or is it going to write it into the Envoy Proxy within [inaudible 00:36:06] that is part of the Istio infrastructure?

Christopher: So in this case, that L3/L4 match of pods labeled product page, can talk to pods labeled ratings. That would get put into the kernel [inaudible 00:36:28] filter that basically says pods with this label are only allowed to receive traffic from pods with this other label. But they must have a service account with the ratings reader POS certificate and must only be able to do HGP [guess 00:36:46] towards a specific target. Those policies will be written into the Envoy Proxy itself. So now, we write into two different endpoints. If you’ll notice the Envoy Proxy, we’re writing into this locally. We’re not storing this centrally so Envoy doesn’t have to go up and ask something. In Istio, for example, straight up Istio policy, you have to eventually go back and consult something called the mixer. Which is sort of a centralized controller for all the ongoing nodes. You might have thousands and thousands, or hundreds of thousands of pods. And that’s scalable, but you’re still making a request off board of the host. In our case, we proactively write the policies directly into the Envoy Proxy so the Envoy Proxy can make decisions locally when it sees traffic. And Felix takes care of updating those policies both in the network filters as well as Envoy, are the posted [inaudible 00:37:46] based on changes coming from Kubernetes, and indirectly Istio, that we receive through the Calico policy environment.

Christopher: So in that case, the workload, then, when it sends traffic, or when traffic adjusts to that pod for delivery to the workload, first of all, it has to get through the current number of filters to make sure that traffic’s even allowed. And if it does, this is in case of a pod receiving traffic, the traffic will get delivered to the Envoy Proxy first which will then make a decision if all of the other characteristics on that service account MTLS authentication and HGDP [inaudible 00:38:27] in this case.

Christopher: The MTLS certification is driven by certificates which are granted by the Istio CA. The Istio Certificate Authority grants every pod running Istio a certificate, and that’s where the service accounts come in. And when we tell Envoy that we’re interested in authenticating using service accounts to get a transaction, then Felix will tell Envoy to make sure that each MTLS is enabled for connections between these two types of workloads. And then standard checking against the Istio CA is [inaudible 00:39:08], is done by Envoy. So it’s driving that behavior.

Christopher: We go into a little bit more on how this works. So what first happens, is when a pod comes up that’s using Envoy or using Istio, Envoy gets inserted as a… the Istio proxy, Envoy, gets installed as a sidecar. The developer doesn’t even have to do that. The developer can or it can just be forced in as part of the deployment process. First thing that happens, is Istio will assign that individual instance of a pod with a client certificate. And similarly, we will get a root certificate that validates the Istio CA is also installed in that pod.

Christopher: What happens, then, if you’re running Calico or Tigera’s solution for L5 through 7, we install another sidecar that talks to Envoy that plugs into a filter mechanism that we developed along with the Envoy team. And that filter mechanism listens to Felix on the local host. So in this case, if policies come in or inventory changes, we now have more ratings, reader endpoints, etc. But Felix will update that filter that’s now resident in Envoy. That’s transaction three.

Christopher: Now what happens, when traffic comes in… the arrow’s going the wrong way on this slide. I just realized. So now, for example, if this workload needs to send traffic out, it will send this traffic… the traffic will get first through the Envoy Proxy. The Envoy Proxy will do the POS certificate check. Will check to make sure that the activity is correct. And then if it is, we will allow via traffic… by checking against the filter, and if it is, it will then send the traffic out of the pod. At which point it will have to go through the kernel network filters that we’ve installed to say, is this pod even allowed open communications on this port to the destination pod? This is reversed for traffic coming back in. And that’s traffic, then. If it all matches, it goes off the host, it goes across the network, and then goes through this in reverse on the receiving side.

Christopher: So this is a little bit of a flow on how this works and how we’re doing this enforcement. So if the pod gets compromised, for example, and somebody takes out the Envoy Proxy on the pod, first of all, you’re no longer going to necessarily have the right certificate…

PART 2 OF 3 ENDS [00:42:04]

Michael: First of all, you’re no longer going to necessarily have the right certificate. You may or may not, depending on how you did that. So, [inaudible 00:42:09] certification on the other sites may not work.

Michael: Even if you manage to key past, you still would get blocked, potentially, by the network filters on both sides. If the network filters on the underlying host get compromised, you’ll still have CentOS authentication, and the HTTP behavior checks, within the envoy proxy.

Michael: So now you have to pop the certificate authority, the envoy proxy on both pods, certs and desk, and the underlying host kernel, on both hosts, supporting those two pods, for certs and desk. In order for you to get full, unvarnished access between these two pods.

Michael: So, if you think about it, depending on how you count, four, five, or six different points have to be compromised, in order to get this all to work. That becomes a very hard hill to climb for anyone other than the most determined aggressor. Or the world’s worst set of mistakes as far as configuration and configuration management.

Christopher: Alright, well, this slide is opened up for some questions, so let’s get to them.

Christopher: The question is, is there a practical tested limit to the amount of policies, pods, possible, does the number of pods impact performance? Or the responsiveness of the policy updates, and [inaudible 00:43:40] to share about large production.

Michael: So, the first thing to keep in mind here, is the Calico Policy stores all the possible policies in a Kubernetes cluster, and certs all the inventory, I.E., what pods pods exist, with what labels, what service accounts exist, etc.

Michael: That’s on the same order of magnitude, it’s not less data than the rest of Kubernetes stores, actually, it’s a smaller subset than Kubernetes. If we were to encrypt the API server, this is an incremental load. It’s not substantial.

Christopher: Felix, on [inaudible 00:44:21] There is no central controller in Calico’s [inaudible 00:44:24] solution. Each host acts as its own local controller. So Felix, each Felix is responsible for tracking if it has a new pod, et cetera, and looking back, at the Calico policy store, and saying, is there a new policy that I have to install. So, let’s say you have 10,000 policies in your infrastructure. This isn’t like a legacy firewall, where I have those 10,000 policies rendered everywhere.

Christopher: Felix selects the subset of those policies that affect net layer three, layer four, that are applicable for all of the pods it hosts. In reality, we see maybe tens to hundreds, and those get installed in the Linux kernel via [inaudible 00:45:14] access mechanism. So it doesn’t have a performance impact, there’s literally no performance impact based on number of policies, or terms in a policy, at layer three and layer four.

Christopher: Similarly, Felix will only install policies in a given envoy filter that are relevant for that particular pod. So in this case, you might only have three or four policies, that effect a given pod. If you’re writing your policies correctly, and use your microservices, you’re not going to be having a hundred different policies, you’re going to say this thing is providing the rating service, so therefore it’s going to receive any traffic from things that are appropriately [inaudible 00:45:53] in the service account.

Christopher: The policy is going to be maybe four or maybe five, within a given pod, so the envoy filter will only have to search through a very small subset of policies in order to decide whether to allow the traffic or not. This is, basically, taking the same principles of loosely coupled, highly distributed environments, it’s the only way to make things work at scale.

Christopher: We test this infrastructure, from three or four, out to [inaudible 00:46:26] Calico is the underlying network policy environment for all of the major Kubernetes service offerings in public cloud. For some of those, we’ve tested up to 3,000 nodes, which was their max scaling at this point, in hundreds of thousands of containers, with lots of policies per container, there is no performance impact. Because this is all very distributed, and we’re borrowing the same concepts that Kubernetes uses to distribute its control plane, we expect to see similar scaling characteristics to Kubernetes in this regard. And we haven’t really seen an issue where, other events just attempting to really seriously break things for our own amusement; we have not seen in practice any issues with scale.

Michael: There’s question around MTOS, what are the recommended methods for pinging a cert for polling, monitoring, dev, with an endpoint? From an outside cluster?

Christopher: From an outside cluster, there’s going to be a couple of different ways of doing that, normally what people will do, right now, Istio does use the Istio CA, there’s been people talking, there’s work being done on being able to import other CAs into this mechanism. What Istio actually uses, is Spiffe or Acme to request a certificate from the Istio CA when a different pod comes up. And then, usually what happens in an organization is you will delegate a CA cert from your cert CA, down to Istio, and the Istio would issue certs from underneath that immediate CA.

Christopher: So you could start using certificate chains to say I’m going to accept anything from corporate CA, or not, as far as we don’t have a corporate CA. CA certs are pretty reasonably priced, well, not reasonably. They’re only mildly rapacious, from the various certificate vendors out there. You can also use something like Let’s Encrypt, potentially as well, to generate a CA that you could then use to potentially issue an intermediate CA to Istio.

Christopher: Similarly, you could just do a self-signed cert. That’s usually what you see, for example, in dev, or even in enclosed environments. There’s plenty of examples out there on the web as to how to create a self-signed certificate, and to act as a root CA, and then generate intermediate CAs off of that, one of which you would probably grant to Istio.

Michael: It looks like, so, can the Tigera policy and the mixer policy and/or telemetry be used simultaneously?

Christopher: Sure. The mixer policy, I’m not quite sure which one’s going to take precedence between the filter that’s installed in envoy or the mixer policy … I’m not sure which one takes precedence, we’ll have to go back and get an answer for that. As far as getting the telemetry through mixer, et cetera, yes. We don’t interrupt any of the envoyed Istio flows, we just add this local policy sidecar to the local envoy.

Christopher: Everything will continue to work, again, I’m just not sure if the mixer driven policies or the Tigera derived local policies coming from Felix would take precedence. We’ll have to get an answer for that, and you can remind me.

Michael: Alright, this one [inaudible 00:50:37] we’ve already answered it, but how can these policies be used as secure communications with non-HTTP GRPC services, postgres? Now I’m assuming that postgres is the legacy app, did we cover that?

Christopher: The way we would be doing [inaudible 00:50:55] if that’s in the case cluster or not, you would then revert back to the L3, L4 policies, right? So, we’d use L3, L4 policies to say, thing is labeled, DB server, [inaudible 00:51:09] DB server will allow traffic in from DB server clients, and the reverse policy, DB server clients can send traffic to things that are DB server on the postgres port.

Christopher: And then, using Tigera’s host endpoint mechanism, which again we talked about two weeks ago, and go back and look at the hybrid use case environment, we can attach labels, and even install Calico if you wish, on non-Kubernetes endpoints. So if you have a legacy database server in a VM, or even bare metal host, or on whatever it might be, you can install an Oracle rack, you can install Calico on that, you can attach labels to that non-Kubernetes workload and it will become part of the Calico environment, from a policy standpoint. That doesn’t matter if it’s a Kubernetes pod, or a bare metal host, or a VM, so on the playbook, the policies will still fire, through layer 3, layer 4.

Christopher: Now, envoy does support other protocols, other than HTTP and GRPC. So, there are a couple of people adding other session mechanisms to envoy. Right now our plugin derived HTTP, GRPC’s next, but if there’s substantial demand from the customer base for enabling, potentially, filters for other things like postgres et cetera, [inaudible 00:52:47] we’d love to hear that as we do product development going forward, for other protocols that people would be interested in hearing.

Michael: That kind of sounds like it answers the big question people have around why not just use Istio only for security, what else do I need? And the answer really is that a solution like Tigera, Calico, not only gives you a single place to manage both layers, 3, 4, and through the app, but also it gives you that ability to go into non case environments; in bare metal and VM, and apply policy across all those.

Christopher: Correct. In fact, we can also federate this across multiple clusters, which is not something you’re going to get in Istio today. We federate this policy, again looking at the webinar from two weeks ago, we started talking about federation a little bit, the policy federation will federate L5-7, as well, across infrastructure. With some caveats right now, but we’re working on those. It’s too deep in detail to go into here.

Christopher: The other thing to keep in mind is what I discussed earlier, is we do enforcement also outside of the pod. All Istio enforcement happens within the pod, so if the pod becomes compromised, a number of Istio filters can be removed. We have this [inaudible 00:54:23] approach, where even if the pod becomes thoroughly compromised, the network policy writer, things labeled Bob can only talk to things labeled Alice on 443, would still be in place in the underlying host. So it would prevent the pod from being able to talk to things that aren’t labeled Alice, even though the Istio policy which you were using to enforce that has now been removed.

Michael: What would happen if Felix crashes, or otherwise goes down?

Christopher: Felix never crashes, we don’t write any buggy software here. You should know that answer! In reality, Felix is not [inaudible 00:55:03] here. Felix is writing updates as they come from Kubernetes, or other inputs into the Calico policy store. So, when new pods come up, these policies get changed, et cetera. That’s when Felix fires, and updates, potentially filters, in envoy or in the kernel. It is not referred to in real time, in packets. On a packet or session flow basis. It writes the rules in the envoy, and then in the kernel, and then they act on those rules. So if Felix goes away on the host, the policies that existed at the point Felix went down will continue to operate. So the traffic will still flow, existing flows will still flow, new flows that meet policy will still be allowed. Flows that don’t meet policy won’t be allowed.

Christopher: If policy changes in the interim, those policies will not be updated on the host for Felix’s path. However, Felix is very lightweight, Felix doesn’t have, frankly, a lot of state. So, if Felix crashes, Kubernetes will restore Felix usually within a second or two. Because Kubernetes, we deploy as a deployment, as a [inaudible 00:56:27] so Kubernetes will denote when Felix goes down, it will restart Felix, and the first thing Felix will do when it comes up, is it will look at the current state of its filters on the host and its pods, compare that to what it should be doing, based in the Calico policy store, and then make changes.

Christopher: Traffic won’t be interrupted, and as soon as Felix comes back, it’s not like we’re going to drop all traffic, and relearn everything. We’re just going to make those changes to those policies based on updates to the packets and Felix flows getting rebooted.

Michael: This is more of an Istio question, it looks like, what will it take to get [inaudible 00:57:10] headless services to work on an Istio mesh?

Christopher: That one, stateless, headless services, that’s probably going to take more times to answer than we have here. I don’t have a good answer in the time remaining for you, but maybe this speaks to, we should have a more general Istio overview at some point, Michael.

Michael: Alright, let’s keep moving on.

Christopher: I think I anticipate benefits. We already sort of covered all of the [inaudible 00:57:57] data benefits. Labels are significantly more flexible than service names. Labels can be patched to multiple service names. They can refer to many different things, they can be scoped across [inaudible 00:58:10] spaces. The other thing that we’ve discussed in Calico, in Tigera, and especially in Tigera commercial product, is, multi-tier hierarchical policy. The security feed can have, say, compliant policy by PCI policies, or geographical policies, and then the development teams, have very fine grain policies that reflect the application graph.

Christopher: All of those tiers, we discussed this earlier as well, in the previous webinar, all those tiers also work for this L5-7 policy as well as L3, L4. So you can have PCI policies that affect layer 5-7, as well as layer 3, layer 4. And I think everything else, we’ve already discussed.

Michael: Actually, we’ve come to the top of the hour, we took the whole hour. There are some questions we didn’t get to, those will be noted, and we’ll get back to you with the answers to those questions. One last note for everybody. We have some upcoming events that I think are relevant, so the next webinar we have coming up in about two weeks, is about leveraging Kubernetes services and DNS.

Michael: And also, if you’re going to KubeCon in Seattle next week, please stop by the Tigera booth, you can see where all the information is, we are a platinum sponsor for KubeCon, and you can see work with [inaudible 00:59:34] and come by and say hey, and get a demonstration and talk more in depth about any of the questions you might have have had.

Christopher: And I’d love, anyone here who’s going to KubeCon, please find me at the booth, I’d love to get feedback about the webinar series and what I could be doing better; that would make things more useful.

Michael: Okay, well, thank you everyone for attending, that’s the end of our webinar. Like I said before, it will be made available right after we close it on Bright Talk, and we will be pushing out copies of the Powerpoint slides if you prefer them. We thank you for attending, and we’ll see you next time!