Enforcing Compliance in Dynamic Kubernetes Environments

While the Container/Kubernetes revolution is starting to deliver on its promise of making application development and delivery more agile and responsive, it does so by changing some of the traditional characteristics and behaviors of the development and delivery model. Control and compliance regimes have assumed that these would continue to be constant going forward. That set of assumptions is no longer entirely correct. Attend this webinar and learn about what’s changed, how those changes weaken your compliance and control environment, and what you can do to not only adjust to the new reality but actually have your security team being a key enabler of the new agile model.

Complete Transcript

Michael: Hello everyone, and welcome to today’s webinar Enforcing Compliance in Dynamic Kubernetes Environments. I am pleased today to announce our speaker, Christopher Liljenstolpe. He is the original architect behind Tigera’s Project Calico, the open source version of our security software. He speaks at over 60 meetups yearly educating on networking and network security for modern applications. He also consults Tigera’s enterprise clients on their compliance for their modern applications.

Christopher: Thanks, Michael. Today, thought we’d talk a little bit about a very riveting and exciting topic which is compliance, but more specifically how does compliance change? How to insure that you’re compliant in an environment like Kubernetes, a very dynamic environment and as you might expect or not expect there are some things you need to take into consideration that you probably didn’t have to in previous compliance routines.

Christopher: If we go off to the next slide, our graphic artist didn’t get me the halloweeny look here that I was hoping for with spiders and cobwebs over the servers, but imagine this is your, we don’t use the term legacy, it’s called heritage environment. We’ve had for a long time, since we’ve started doing compliance in this industry, in IT, we’ve basically been working with fairly static environments, right? We started off originally with racks of servers, each server ran an application was wired into switches, had static firewalls between them and frankly the environment didn’t change very frequently.

Christopher: Even if we went to VMs, that’s supposed to be more dynamic but still VMs are things that existed in calendar time, basically and servers most certainly in calendar time. Things changed measured in calendars and quarters and years, so, therefore, you could do a compliance report. You could evaluate your environment. You could see what’s plugged in to what. You could see which firewall rules existed and you could do that sort of a thorough audit of the environment. Spend some number of days or weeks collecting the data and since things don’t change all that frequently you could be reasonably assured that compliance report that came off the back end of that process indicated that you were compliant and you would be compliant until the next compliance report which sort of mimicked IT drop dates et cetera. You might have six month change management cycles or something along those lines. So if you do a compliance check once during there you can be assured that mostly it’s accurate across that entire time. Again, measured in wall clock time.That’s the way we’ve built compliance. Compliance has been built auditing and reports and assuming the longevity of those will be sufficient to say that you’re compliant over the course of your cycle.

Christopher: Now, we go to the next evolution though, from VMs we’re now moving into a new called cloud native, or microservice or elastic cloud based application deployment. There’s a number of reasons we’ve done this. We’ve talked about this before on this webinar, but to recap for folks who haven’t been listening before. Every business today, for the most part, is a software based business where we’re using software to deliver our services to provide services for our customers, interact with our partners et cetera. Software drives most of modern business. People expect software to be able to be easily deployed, rapidly changed. We’ve gotten used to having software fixes and new features added at a fairly high clip, so we’ve had to go back and revisit how we do application deployment and delivery.

Christopher: We’ve talked more about this in, like I said, other webinars, but basically what we’ve done is we’ve decomposed these applications into microservice. Instead of having one big monolith that handled say payroll or customer orders, we now have multiple microservices that together make up those capabilities, customer record look up, inventory look up, shipping tool, et cetera. All those things are different microservices that get composed to deliver different applications.

Christopher: In order to package those microservices up, today we primarily stick those in containers. Each microservice is a container or a set of containers for resilience or scale et cetera. In order to orchestrate those we then use dynamic orchestration tools like Kubernetes and you tell Kubernetes or another dynamic orchestrator that these are the microservices that compose your given application and this is how it should be deployed and managed and then your orchestration system takes care of the grunt work of doing that and will deploy it on any infrastructure you want, be it on-prem or be it a hosted path environment or a hosted containers of service environment and that changes over time. As we all know corporate life means that those environments are gonna change. Today you’re doing it in house and next year you acquire someone who is Amazon based and now you’ve got stuff on Amazon or you get a new CIO and they decide that they wanna be cloud centric and move things to Azure et cetera. These things are gonna move around.

Christopher: Kubernetes makes it somewhat easier Kubernetes and containers but it does mean things are dynamic. So the overall view here is we have a lot more things i.e. microservices than we had monolithic applications and they’re much more dynamic and where they reside is also dynamic. What does that mean in reality? Going to the next slide. What that means in reality is that containers, being lightweight have much faster start times and they have much shorter lifetimes. Whereas before we measured VMs and servers in periods of calendars, containers live by wall clock time. Containers might exist, literally, for only a couple seconds or minutes or days before someone pushes a new version or they scale up or scale down to meet demand, like on a … your company is an e-commerce company, you put out a special offer and you attract a lot more orders and everything autoscales up and then when that’s done autoscales back down. So you have much shorter lifetimes and they start much faster. So you’re looking at 900 times faster start up on average for a serving container versus a VM and they live much shorter times and you have much higher percentages of workloads.

Christopher: So before you had monolithic VMs, now you’ve got microservices. What this means in reality is where before we used to see eight, ten, twenty VMs on an average server, you now see people talking about 80, 100, 200 I’ve even seen people talk as much as a 1,000 containers on a server. We can reasonable assume that you’re gonna have at least an order of magnitude more workloads, maybe two orders of magnitude more workloads than you had before or end points.

Christopher: When you do the math that comes up with a couple of secondary effects. One, you have much greater churn in the network. Your endpoints are changing much more frequently which means there’s all sorts of knock on effects on networking and isolation, but this also brings up a point around compliance. If my environment is changing this frequently, how am I gonna judge if I’m compliant or not? It also means we have a much greater attack service. We have a lot more workloads and each of them have potentially variable prominence so we have to assume that we’ve got a greater attack area or a greater area to make compliance mistakes in, either intentional or unintentional.

Christopher: We have this very variable environment, very dynamic environment but we still have a compliance model where we go through and audit every endpoint a couple times a year and make sure it’s wired into the right static firewall rules, running on the right hosts and that’s an interesting concept when we go to the new model. When we start thinking about this model in reality what this means is, let’s say we ran this compliance report today, on the 30th of October, we’ve finished our audit. It’s all done. We’re all compliant. That’s great. 10 minutes after we run that audit a developer in your company pushes a new version of the code. That new version of that code might have an accident with regards to say PCI data handling et cetera. So you were compliant, as I say here, Houston we’ve got a problem with this model. I know I was compliant for at least 10 minutes between the time I completed my audit and the time the next code version got pushed. After that, maybe I still am maybe I’m not. So it’s a little bit like a broken watch. I know I’m compliant twice a year for the time periods between when I do the audit and the next version of code gets pushed. After that it’s sort of a guess.

Christopher: So, how do we do this? How can you go back, if you’re managing compliance or providing the tooling to manage compliance for your company, how can you go back to the board audit committee and say that you are compliant in this kind of environment? Are you gonna do continual audits? You’re gonna have the auditors there constantly check

Christopher: Do continual audits, you’re going to have the auditor sit there constantly checking everything and checking where those containers are and the firewall rules and all that, that’s not really feasible. So you really do have a problem with being able to say, answer the question is, are we compliant? So we have to think about compliance differently. So let’s think about what compliance might, what you need to do to do compliance in this kind of environment. One, compliance needs to be enforced by policies and enforcement that’s directly tied to the functions or personalities of the code. You can no longer have a compliance being enforced, ie. firewall rules based on IP addresses of workloads because these IP addresses are changing. Even if you can somehow figure out how to beat Kubernetes into submission and assign a given microservice the same IP address each time, what do you do when you need to scale it from five to 100?

Christopher: All those ones that aren’t normal base load are gonna have different addresses. So having compliance tied to something that doesn’t actually relate to what the thing is and what it does is going to be problem in this kind of dynamic environment. So we need to figure out how to tie our compliance enforcement, not necessarily to the endpoint itself, but to its behaviors, its personalities or its functions, and we’ll talk a little bit about that. The second one is you need to now do a sort of a continual audit that shows what traffic was allowed in your infrastructure, what traffic was denied in your infrastructure, and for what reason it was allowed or denied it, ie. what the policy paths were. And we’ll talk a little bit about that as well. And if you put these two together, you can achieve something that we’re calling continuous compliance.

Christopher: You can basically take a look at what actual traffic flows happened in your network, what allowed them or denied them, tie that back to the enforcement that’s tied to the actual functions, personalities or behaviors of the code itself. And you can actually get up and say, yes, we were compliant because at 9/27 on the 30th of October, all the traffic that handled PCI data was going through the PCI policies, and we know that, and here’s the way we can prove it. So let’s talk a little bit about how we do this magic. So first thing is, let’s talk a little bit about a network policy in Kubernetes, or it doesn’t necessarily have to be in Kubernetes, but in this case, for example, we have, this is a very simple policy. We have things that are PCI compliant and things that are not PCI compliant. In this case, we have three workloads. Two are PCI, one’s non-PCI.

Christopher: So we can create a network policy where I’ll walk you through each of these. It’s a global network policy, gets applied everywhere, next stop, or it’s available everywhere. And the name is PCI-isolation. It’s put in something called a Kubernetes namespace that further scopes it down potentially. Next slide. And the big thing here is we say that this policy is applied to anything that is labeled PCI is true. So basically what you end up doing is identifying that certain workloads are indeed PCI compliant, and other workloads aren’t. How you do that is sort of a development cycle process. It could be done via code reviews or other things, but basically, once you’ve done that, you can say yes, this workload’s PCI compliant. You can have other labels on the same workload saying that this is located in Germany, or this is a in-stage prod or whatever.

Christopher: But for right now, we’re focusing on the fact that this thing is a customer facing application versus a non customer facing application. But what we’re interested here is the thing has a characteristic of being PCI compliant. So anything that is PCI compliant, will have this policy applied to it. So what does this policy say? It’s actually a pretty simple policy, and the policy says that things that are labeled PCI compliant will only allow traffic from other things that are PCI compliant, or PCI is true, and they will send the traffic to only other things that are PCI compliant. So there’s no policy, there’s nothing here that says about traffic between PCI or non-PCI endpoints, so that basically means that with this policy, if this is the only policy applied in your network and there were no other policies, the only communication paths that would be allowed would be between things that are PCI compliant, and no traffic will be allowed between non-PCI and PCI workloads.

Christopher: Now, in reality, you do need traffic between non-PCI and PCI workloads, and you can handle all that in policy as well, but this just gives you an example of it doesn’t matter what this thing is. It doesn’t matter if it is a front-end application or if it’s a database or if it is some settlement microservice. All we care about it is it has the characteristic that it is PCI compliant or not PCI compliant. So you can basically say that if this policy is deployed and we have appropriately labeled our workloads, we can say that all PCI workloads will only be able to talk to other PCI workloads unless there are other policies that carve out exceptions. So that is a way of attaching a policy, not to an IP address or to a specific server, but to the capability or the behavior or the personality, or a characteristic of the workload.

Christopher: So we’ve now decoupled the concept of a specific atomic workload and the policies attached to it. And now, policies are attached to the characteristics and the metadata of that workload. So this is a really powerful concept once you get your mind wrapped around it from terms of compliance. You could label, for example, something that is PII contaminated, and location Europe, or persons Europe, and then you could have the beginnings of a GDPR policy that says things handling data of persons from Europe, entities from Europe, and PII contaminated data can only be accessed by things that have been deemed to be GDPR compliant. So there’s beginnings, for example, of a PII and GDPR policy, SOCKS. All of these things we can address similarly. So when we start talking about policy pounds, these are the kinds of policy pals we’re talking about.

Christopher: So I now have a policy called PCI isolation. What I next want to be able to do is see if, indeed, my traffic is flowing through PCI isolation, the appropriate traffic is flowing through the PCI isolation path. So I can look at my flow logs in a system like Calico Tigera, Tigera’s enterprise edition, and I can see exactly what workloads were communicating, if it was a traffic allowed or denied, what policy made the decision to allow or deny that traffic, as well as other things about metadata, etc. I don’t think we’ve got that all turned on on this screenshot of the flow logs, but I basically have full details about what is that thing, what characteristics was it expressing, was the traffic allowed or denied and what policy path or what policy made that determination. So I can now say for example, that at 9:27 on 30 October 2018, all the traffic that had PCI compliance labeled on it was allowed to actually, and the other end had the PCI compliance label attached to them.

Christopher: Traffic was allowed via the PCI compliance policy or some other policy, but the PCI compliance policy was involved. Or I can see if traffic was between a PCI and non-PCI workload. I can see that the PCI compliance policies stopped it. I can also see, for example, if I had a policy failure, if I had something that was maybe not identified as, misidentified as PCI compliant and wasn’t, and that was allowed, I can actually see that I had that policy violation, and I can go fix the problem, log it as a violation, etc. So having your logs, being able to show the metadata, the characteristics of the workloads, and so which policy affected them, you can actually assess at any given point in time, was I in compliance or not with my PCI or GDPR, etc.

Christopher: Now you might want to further make a decision about, you find you had a policy violation. You might want to figure out how that happened. So the other part of the client is who’s been making changes. Did somebody label a workload that previously hadn’t been labeled as PCI compliant as PCI compliant? Maybe that person, maybe that change was made by somebody who is the compliance officer, and they finally decided that workload was PCI compliant and elevated its status and now it can communicate, or maybe it was somebody else that somehow got the right permissions and made that change, and you can capture that in the audit logs. You also, of course, capture the attempts people make that aren’t successful to change things. so this is basically all governed by our back credentials and the ability to log each and every activity.

Christopher: So when I combine these three components, the audit logs to show me who’s been making changes and what changes were made and were they allowed or denied, the flow logs that show me where traffic was actually allowed or denied, and what policies did that, and the set of policies that enforce behaviors, allows and denies and other behaviors based on the characteristics of the workloads themselves, we end up in a situation where we can actually create almost a compliance dashboard. At any point in time, you can go to the executives and say, or even in real time, the executive and say, “Hey, we’re PCI compliant, or we’re GDPR compliant, and here’s the dashboard that’s showing you that.”

Christopher: So at any point in time, somebody says, “Are you compliant or were you compliant at this point?” You can go back to the logs and the policies that were installed at that time and answer them with reasonable surety that yes, we were, or no, we weren’t, and this is why. So this brings an interesting idea overall to compliance. Is compliance now a quarterly or semiannual audit, or is it something that you continually do, and continually perform this evaluation and have that automated as part of daily operation, and all of that audit might be a couple times a year just to make sure that

Christopher: And all that audit might be a couple times a year is to make sure that this is still working, this process is still working, rather than actually, instead of auditing the infrastructures, auditing the applications and everything, you’re just actually auditing to make sure the infrastructure is indeed continuing to provide this data and maybe provide a report that says we were out of compliance for 10 minutes in the last six months, because we had this mislabeling of a workload or something along those lines. So this not only changes how you do compliance, but even how you think about compliance. Is this an audit check list or is this something that you do continually in your organization? We believe it’s the latter because the only way you’re going to deal with this dynamic environment or on a minute by minute basis your application infrastructure is going to be changing. So with that I’m going to open it up for questions and then turn it back to Michael but hopefully this is a little glimpse into how you might want to think about compliance going forward in Accubrineti’s or other orchestrated environment.

Michael: One question set around the complexity of having to comply with multiple compliance regimes, and we can talk a little bit about that because it seems like a complex problem.

Christopher: Sure, and I think this is one thing for example where we have, or we need to think about policy a little different than we have before. If we think about policy in your firewall rules, what rules do I need to apply to this thing to enforce all my compliance? That is, we basically then make each enforcement rule and each endpoint the composite of all the potential compliance and other regimes I need to apply to that thing, and that becomes a very complex hairy mess. Instead, in order to make this somewhat simpler, is you write fairly simple policies that reflect each compliance regime that you need to apply to. So you write the policies that you need to write for things that are PCI contaminated, things that have payment data, car data et cetera, and how they should be able to be accessed by things that aren’t PCI compliant and things that are.

Christopher: Same things for GDPR. You write again a simple policies that embody the spirit of the regulation, and those are all attracted as you saw here by selectors, by labels. Then it’s the simple matter of identifying what the characteristic of that workload is. This thing is PII contaminated, and it is PCI contaminated. So that means that it’s going to attract, and it’s handling data, it’s contaminated by a European person. Not to say European person contaminate things, but it’s contaminated by European persons PII data.

Christopher: Then that’s three labels, European contaminated, PII contaminated, PCI contaminated. Then this is going to attract both your GDPR policy and your PCI policies. Similarly Fox et cetera, so instead of trying to compose for every workload exactly what policies should apply, you instead write policies around … the main thing about it is the personalities that you might have in your environment, then you attach personalities to your workloads. That indeed means that your workloads may have multiple personality disorder, but they have separate personalities, or characteristics and those policies and those compliance regimes are then just subtracted depending if they express that trait or personality or not. This allows you to separate these and treat each of these independently in the system will compose the correct enforcement on the fly based on that simple per regime sets of policies.

Michael: We have a question around how do you ensure that containers get labeled correctly and consistently to make sure the right policies are being applied? I would first say that we did a webinar on this exact topic around taxonomy best practices, and we did that in the summer so if you would go to Tigera.io and go to our webinar session, there is a full 30 minute webinar on this topic.

Christopher: I can do a quick synopsis. First of all you need to come up with what the labeling scheme or taxonomy, and the policy scheme and taxonomy are going to be in your organization. Better to do that up front than do it on the fly and end up in six months realize you made a wrong turn. So this is one that when we go deal with a lot of our commercial customers, we actually for a good chunk of them we actually have done security workshops. In fact I’m going off to do one next week, where we work with them to think about what their labeling taxonomy and scheme and policy scheme should be.

Christopher: However, a key point is labels that we’re talking about here drive many behaviors in an orchestration system, more than just network policy. They drive what volumes you can mount, they drive what services you can expose, how you get load balance. There’s a huge amount of things that are driven by labels, which is actually a benefit of doing the labeling correctly, because we all know that if you do security separately from everything else, developers will never give it due attention until the very end and it will probably get done slap dash and wrong. Instead of making the developers think about the security labels versus all the other labels, you just say here’s a set of labels you have to apply, and that will drive everything for your application. Your application is not going to work unless it’s appropriately labeled, in many dimensions. Storage, exposure, et cetera.

Christopher: Once you’ve done that, you’ve assured the labels, now you need to make sure the right people are putting the right labels on things. Because this affects so many things in Kubernetes, what we’re seeing is basically two approaches the customers have taken for controlling labeling across all of Kubernetes not just in relation to network policy. One is enforcing that as part of their CIC pipeline, and said who is allowed to assert what labels. The other and automating that as part of their CIC or CI close specifically, the other is to deploy something called an admission controller in Kubernetes so writing an admission controller that basically takes a look at who’s submitting that for example pod or service and are they allowed to assert the labels that are asserted on it.

Christopher: It’s either an admission controller or a CIC, there’s other mechanisms as well, but those are the two that we see people most commonly use. Kubernetes does provide some level of our control of labeling, so you can use name spaces and service accounts and our backs are tied to them to say that certain groups can only apply certain labels in certain name spaces, or service accounts. That’s another ability to do control. This is something you need to work out as an enterprise, not just again for network or compliance enforcement, but for everything from again storage mounting to storage access to service exposure to load balancing, all those things are driven by the same set of metadata.

Michael: We have a question around the audit logs. Were those audit logs, are they Kubernetes API server audit logs or were they Calico/Tigera audit logs?

Christopher: Since Tigera and Calico are using the Kube API server for most things that we do, those are actually coming off of Kubernetes. We augment them with some of our own behaviors but basically, we’re catching everything that is going through the RBAC system in Kubernetes. That includes Tigera and Calico infrastructure.

Michael: Well if there are no further questions, I would like to thank everybody for attending and also let you know that our next webinar is coming up Thursday, November 15, same time, 10am Pacific, 1pm Eastern.