Improving Security Forensics in Kubernetes Environments


The success of Kubernetes has made monitoring and alerting more difficult for traditional Security Information and Event Management (SIEM) tools. Attend this live webinar to learn how implementing the right network security and compliance solution will improve the accuracy and completeness of security forensic monitoring and alerting when using Kubernetes.

Michael: Hello, everyone and welcome to today’s webinar, Improving Security Forensics in Kubernetes Environments. I am pleased to introduce today’s speaker, Christopher Lijenstolpe. He is the original architect behind Tigera’s Project Calico. He is a speaker at over 60 meetups a year, so he’s educating lots of people on networking, network security for modern applications, microservices and containers, et cetera. He also consults Tigera’s enterprise clients on security and compliance for their applications.

Michael: Okay. Well, Christopher, it’s all you.

Christopher: Welcome everyone. Today we’re gonna talk a little bit about forensics and how forensics change as we go into a containerized, specifically a Kubernetes environment. But a lot of what I’m talking about is gonna apply equally to any kind of containerized environment.

Christopher: And we’re gonna talk a little bit a out not only how you do forensics might change, some of the things you need to consider, some of the things you need to adapt. For example, to make forensics data in your CM tools more useful, et cetera. How you deal with doing actual forensics on suspect workloads, et cetera.

Christopher: This is a very deep topic, so we’re not gonna be able to cover everything in the allotted times. So, we’re gonna, sort of, give you some thinking to be thinking about. And potential areas that you might want to think about your designs and maybe design differently going forward. Especially as it relates to forensics in containerized environments.

Christopher: So, let’s go ahead and talk about the first case which is how we used to do forensics on VM or even bare metal workloads. In this case, we have a VM or a bare metal host and it’s pretty much a black box. From a forensic standpoint, in the legacy days, you could take a look at what was coming in or out of that black box. Coming in or out of that VM. Coming in or out of that bare metal host. And the Sally, the security engineer, could observe that and make forensics decision about is that black box doing the right thing or doing something that we didn’t intend for it to do.

Christopher: So, that’s a pretty limited amount of visibility. Those were big monolithic apps. There’s lots of things going on within that black box, but we didn’t have much visibility other than it goes in and it goes out. And that was mainly north, south traffic, too, by the way. There wasn’t a lot of east, west traffic, so mainly this is the request comes in and the response goes out. Or the request coming in turns into a request going to the next thing down in the stacks, the database server. So, [inaudible 00:04:51] invisibility.

Christopher: So, we came up as an industry with APM. And APM allowed us to instrument these black boxes and look at what’s going on inside the black box. And that might be catching system calls. That might be instrumenting libraries and API level calls within the application.

Christopher: So, that gave us visibility, at least at the module level or the library level within this black box. Now sort of a semi-transparent, a smokely transparent box. However, this had quite a bit of implications. If you wanted to deploy this kind of visibility, you had to compile it in or build it into your application. The developers had to all have this as part of their build process or deployment process to push this capability actually into the code. To actually make it visible.

Christopher: So, you could get greater visibility, but you were doing so at an additional cost and complexity to the way you built your application. And, for anyone who’s done it, before, gee, I want to go deploy the latest version of an operating system image or module. And then you discover that your APM tool doesn’t support that yet. So, you have another dependency in your build cycle. Waiting for APM vendors to catch up with current states. But that’s the tools we’ve had and people have successfully used those, even with the complexity.

Christopher: But what happens is we’ve gone through now, almost like a big bang model of taking and exploding this big monolithic app into lots of micro services. This is the whole model behind Kubernetes and the other orchestrated environments. So now, what had been one big monolithic app with a bunch of library calls, it got compiled into it and the library calls, now becomes individual entities in the infrastructure. So, instead of one PM, we now have lots of microservices and potentially multiple instances of each microservice.

Christopher: So, that is, and actually, well, that brings some complexity. And we’ve talked about Kubernetes and how Kubernetes orchestrates things before and how Kubernetes corrals that complexity. It actually opens up an interesting avenue for forensics. And that is, that now what had been API calls, within a given executable, or library calls within an executable, now become network calls. All of those, what had been libraries are now separate microservices, or can be. And now, the linkage between those different components is an API call, usually an HTP rest space or GRP or equivalent call across the network. So, we’ve now exposed, on the network, the actual traffic between the components of our application.

Christopher: So, you can have that built, you can manage all of those API connections yourself and build in the mechanics to do back offs and retries and failure detection and rerouting, all of that, back early on, everyone wrote all that code into each of these modules themselves. These microservices. That brings with it it’s own set of issues. Every developer having to basically write an API call framework. And all of the trolling that need to go around that.

Christopher: So, IBM and Google rolled out a project called Istio, which is an instance of a service mesh. There are other service meshes out there, but, taking for example, Istio here. Now, Tigera and Red Hat and other folks are actively contributing to this. This is an industry, a fairly fast moving industry approach to service meshing. But one of the things that you get off of these service meshes isn’t just retries and back offs and rerouting due to failures, et cetera. You get a lot of visibility. One of the things these service meshes do is kick out a lot of data about each and every one of these API calls. Between the individual components of your application. And it’s doing this in a way that the developers don’t necessarily have to know much about the service mesh or implement the service mesh. Basically, the service mesh component inserts itself into each pod or container. Pod in Kubernetes case. And instruments this set of connectivity, this connectivity drop, all these arrows.

Christopher: So, we’re collecting, now, a lot of data about who’s calling who and what the [inaudible 00:10:20] of that was and all the characteristics of that call. So now, what we used to use is very intrusive APM infrastructure, tells a lot what’s going on with a service mesh across a standard network. You now have visibility, very fine grain visibility, into all of the inter modules traffic or inter component traffic in your application.

Christopher: Instead of try to collect your forensics data post fact, after you’ve had an incident. If you can collect more of that data upfront, before you have an incident, allows you a better detach when there is an incident. You know, indicators are compromised. Also allows you to establish a base line and allows you to see when changes happen. So, if you detect a compromise later, if you have all of this API call data, basically available, you can see when behavior started changing, maybe giving a better indicator of what was compromised and when.

Christopher: So, do your forensics in advance of an event, rather than after an event and collect this data and use it and normalize it. That becomes much easier in a Kubernetes environment using a service mesh. So, no questions so far, it looks like. So, we’ll continue on.

Christopher: Next is, waiting for [inaudible 00:11:50]. Okay, so, now let’s say we have Eve, evil genius Eve and she is going to attack your infrastructure. So first, let’s have her attack that heritage VM, sort of a smokey box. So, she attacks that VM and compromises that VM. So she now has compromised it. It is not acting as a commanding control infrastructure for a bot net or is going out and harvest data out of your customer records database or whatever other nefarious things she’s going to try and do. And we’ve gotten an indicator of compromise. So, some alarm was tripped and Sally realizes that smokey gray box has been compromised or believes that it might be compromised.

Christopher: So, what Sally has done, classically in this environment is, it’s a bare metal server, the server gets shut down and isolated and then moved into a forensics clean room, so to speak where it can be analyzed. Similarly, if it’s a VM, you would pause or freeze that VM and, again, move it, logically, into a clean room where you can analyze and evaluate what happened.

Christopher: There’s two different ways of response to an attack, to a compromise. One is to get rid of the compromised things. You’re not interested in what happened, except for, you just want it off your network. So, you can basically shut down or kill that instance and restart and hopefully, you restart with a clean instance. So, that’s one approach. That’s just the I want it gone approach.

Christopher: The other is let’s try and figure out what Eve was trying to do. And that’s when we start getting into actual forensics of an attack.

Christopher: …to do, and that’s when we start getting into actual forensics of an attack. Now, as has been mentioned, I think, before in the series, I was a park ranger in the park service at one time. And as part of our law enforcement training, I remember very well our officer survival instructor said, “When you find a weapon on someone, don’t stop looking because people who carry one weapon usually carry more than one weapon.” So, let’s say if I take that to heart, if I find Eve has comprised a machine, if I assume, do I stop there and just assume that was the extent of the compromise, kill off the machine, and reinstantiate and I’m back in business? Or do I assume that Eve potentially, what I’ve detected may not be her first or only thrust of attack in my infrastructure.

Christopher: So maybe what I want to do is, instead of just killing it off, I want to examine it and try and figure out what else she was really trying to do, what else she might have already done, so I can actually try and more completely contain the compromise. So, what was she going after? What was the data she was going after, or what was she attempting to do? Oh this looks like it could be a command and control center for a botnet in my infrastructure maybe I need to figure out what the control panel is so I can go find the nodes that might have already been infected and are acting as a bot net already. So there’s advantages, if you have the skill set, to actually go and do some forensics on these components when you find them.

Christopher: So Sally’s done that, Sally’s figured out what Eve was trying to do and countered that and life is good. But now we’ve gone over to a kubernetes environment, so Sally’s company is now deploying microservices in a kubernetes environment, and Eve, keeping up with the times, attacks one of the kubernetes’ pods for a given service.

Christopher: We can just freeze this pod and shift it over to a forensics clean room and start looking at it, right? Unfortunately, not quite. Containers aren’t really separate instances of operating systems. Containers are a memory space, or a main space, within the underlying host’s operating system. There’s really not a good way of freezing a container and moving it. Technically, it might be possible but that’s not something that the orchestrators today really do. I can’t really pick this up and move this somewhere else and reinstantiate it, [rehydrate 00:16:55] it and examine what it’s doing.

Christopher: You might think the only thing you can really do is kill that pod that’s been infected and reinstantiate a new hopefully clean one. But as we said there’s some value to actually trying to figure out what’s happening. So instead of picking up and moving it, why don’t we just isolate that pod? By using a combination of what we can do with the service mesh and what we can do with network policy controls in kubernetes, like’s solution, you can actually put a label or a piece of metadata on that pod that says that pod should be quarantined, that pod is [suspect 00:17:38].

Christopher: At that point the service mesh, the underlying network infrastructure, the kubernetes network infrastructure can actually isolate that pod or set of pods that has been essentially compromised and prevent them from actually communicating with anything else in the infrastructure.

Christopher: Then you take the [inaudible 00:17:58] where you can say, instead of being diked off from everything, which might be a signal to the attacking code to self-destruct. Maybe what you do instead is you cut off its connections to the production environment and replace it with connections to a honeypot environment or an evaluation environment or the attacking code may not actually notice that it’s been removed from production and it continues to function. And then you can evaluate what it’s doing.

Christopher: Again, maybe using service mesh, if you move it over and you point it at a honeypot infrastructure and you’re still using service mesh, then the service mesh is now pointing to this honeypot infrastructure, you can now see all the API calls it’s doing what it’s listening for et cetera.

Christopher: In a containerized environment you may want to quarantine or change the service mesh topology, such as that pod can be evaluated, the suspect pod or pods can be evaluated without [knowing 00:19:06] them or without shutting them down. Again, try and figure out what Eve’s actual and point goal was trying, what was the employee trying to do rather than just showing off the thing that she infected.

Christopher: To reiterate, before you would’ve frozen and moved into a clean room, here you sort of bring the clean room to the pods rather than moving the pods to the clean room by adjusting the service mesh and instituting network policies.

Christopher: Before I go on with the next topic that we want to talk about, let me pause here if there are any questions. If not, we will continue. Doesn’t appear that there are any questions right now so let’s move on a little bit.

Christopher: Now let’s talk about Eve and the way containers get deployed in something like a kubernetes environment, the way microservices get deployed in kubernetes. [inaudible 00:20:12] most people are going to deploy their applications via some kind of CIP pipeline, containers integration, containers delivery pipeline, normally anchored to some sort of source code control system like [Id 00:20:25] in various fibers that could be github, gitlab, your own implementation. But it’s some form of source code [inaudible 00:20:34] system and burdening system.

Christopher: You put your definitions and your code and your container definitions into your source code control system and a CIC pipeline then deploys those instances into your infrastructure. CIC pipeline is secured and checked to make sure what’s getting deployed is what’s in the repo itself and have some amount of security control surrounding. Idea being that what gets deployed should reflect what’s in the actual repo itself.

Christopher: Let’s say Eve attempts … Sorry next film. Now, I wanna talk a little bit about what those pods could be doing to support forensics we should be collecting all of our logging data coming off of those pods we should make sure that our actual individual components are generating sensible logging information. Just like the service mesh is generating logging information about the API calls between the pods, we really want the things on the pods, the actual microservices themselves to be generating the logs. And those log should be going potentially to a couple different pieces of infrastructure. One should be going to some sort of EFK or ELK stacks, that’s elasticsearch and Kibana or elasticsearch logstash and Kibana you might be using splunk and something else as well.

Christopher: But the idea is all of these logs should be dumping into a large searchable store that you can then use to start building, characterizing, then we see all of the behaviors of this given type of pod. So you should be able to search through things, look for things in the logs in an easy, sensible manner. And that’s where the Kibana and elasticsearch components come in.

Christopher: More importantly though, you should be making sure that those logs are sensible and have [inaudible 00:22:58] data, and I talked about this in a couple of webinars, a lot of logging systems use IP addresses, for example, as indicator, as an identity of the endpoint.

Christopher: In an environment like kubernetes, things like IP addresses or nodes that something’s on are very ephemeral and actually, maybe reused multiple times in a day let alone in weeks or months. All of my logs are pinned to an IP address for a given pod, that’s why it’s sort of useless because that IP address over time can refer to very different things in your infrastructure, so you really need to make sure those logs are annotated with data that comes from orchestrative or ICT to uniquely identify what is the thing that is being logged, what generated that log, et cetera.

Christopher: And that allows you to make sure that you can actually have sensible log data if you go do a lookback to see, for example, when Eve got in and what she did.

Christopher: Make sure your logs are sensible, make sure that [inaudible 00:24:04] is searchable. The other thing, you might also wanna make sure that you’re onto something that’s time series-based, like Prometheus. [inaudible 00:24:12] is you definitely want to build up baseline data. You wanna be able to say that this kind of log shows up 20 times a minute across my infrastructure per X number of instances of this particular service or microservice. And then be able to set triggers and say hey if that log is showing up more than 20 times a minute in number of microservices, then I’m wanting to look because we’re now exceeding what was normal baseline.

hristopher: From that perspective we probably want to be all using a time-series database as well as my searchable log store. So different things you want to think about how you want to manage these logs.

Christopher: Other things that are obvious but worth mentioning make sure that you’re on accurate time across all your components so that logs can be correlated, et cetera, all the standard practices for an accurate logging system.

Christopher: We’ve now been logging all this data, and let’s say now Eve wants to attack one of our microservices again, she’s not giving up she’s gonna make an attack. If we’ve done the right thing, what we’ve done is we’ve made these pods, these microservices, immutable when they’re deployed and the idea being here in a containerized environment, troubleshooting should not be going into a given microservice at this [inaudible 00:26:04] or injecting into it and making changes to the one microservice for a couple of reasons. One it opens a pathway for attack for Eve, which we’ll talk about in a minute. The other is this means you now have a snowflake. So let’s say I have built a platform that is auto scaling and normally I have three instances of a microservice and something’s gone wrong with that microservice, I haven’t gotten the configuration quite right.

Christopher: In a [inaudible 00:26:34] it’s very likely that you would’ve SSH’d into each of those VMs and changed the configuration. And that’s okay. A container [inaudible 00:26:42] because I will have changed the configuration on those three running instances but not back in the github repo where the assets go, so it’s five minutes later and kubernetes says I need 100 more of these to be able to deal with the flash crowd, it’s gonna go essentially 100 more that won’t have that fixed.

Christopher: So I now have three snowflakes that work correctly and 97 that don’t. It’s time to troubleshoot that one. Operationally, you should make sure everything is fixed in github and the code repo and then you redeploy so let’s say I’ve got a problem I go fix it in the code repo and then it gets pushed up to the CIC chain and the three bad instances get replaced to three new good ones, and five minutes later when you get the crush of users, an ace deploys 97 more instances that are correctly configured. So you really shouldn’t need to FSH or inject into workloads, except for maybe trying to debug a problem, as in what happened and what’s in the pod. But that’s where you shouldn’t be doing in a live system.

Christopher: So, however, people tend to do what they’ve always done. So, instead of saying, we won’t do that, you probably actually want to put colors on that and say, we’re not going to block SSH into our pods. We’re going to use our back to control such that most user or all users can’t exec to a pod. Kubernete is our back control.

Christopher: So if we have done that, then it becomes very very difficult for EVE to actually inspect what I mean workloads. Because those workloads have become basically immutable. Kubernetes isn’t gonna allow them to change or some external factor to come in and change them.

Christopher: So if EVE launches an attack, and her little [inaudible 00:28:51] is blocked by those controls, and her attack is unsuccessful. Those pods are still running. And they were deployed out of GitHub. And we’re assuming… GitHub is known, has been evaluated, is trusted.

Christopher: Let’s say that that’s not quite a true statement. And actually, your developers got Spearfish, and somehow from that Spearfishing an attacker, say EVE, gets around and actually writes bad code into your stack for a given microservice.

Christopher: And that is in GitHub and then it’s Kubernetes deploys it’s work codes through that secure CIP pipeline that attack is now distributed within the pods. If I’ve been doing logging, et cetera, I’ll notice for example the new version of those pods is actually different from the previous ones that might be an indicator of compromise.

Christopher: I know that those pods themselves can’t really be mutated, can’t be changed. So Sally, our security engineer goes and looks the underlying GitHub code and finds the anomalous code that got infected. Maybe a new version of a docker container, it’s a different NGINX docker container than we normally deploy and it happens to be a compromise NGINX container.

Christopher: So Sally figures that out, Sally removes that. She exempts the codes, she removes that security issue from within the GitHub repo. That also might have been caught by the way in the GitHub repo via repo scanning and those kind of things.

Christopher: But somewhere, now that we found out about it, we removed that from the GitHub repo and then we do a redeployment of those same services. And what happens then is we now end up with uninfected pods. So back in business again.

Christopher: So Ethan here, from a forensic standpoint, we noticed a change in behavior. And we went and found the code. If we know that those pods can’t be mutated then we know it’s gotta be in the origination, the source so repositories go up there. Find it much easier to look in first the repository for something that’s bad, then to try and reverse engineer one encode in a pod.

Christopher: So we find it in the GitHub repo, we clean it up, we put new versions and we’re back in business. Let’s say however, we have not blocked SSH or AWOW. All our developers trace back into pods, we haven’t gotten that memo yet. So in this case, EVE’s attack to those pods is successful and those pods are compromised.

Christopher: So what do we do in this case? So first of all again, we have some indicator of compromise. And part of forensics we notice that [inaudible 00:32:03] that there’s log changes, that these things are doing different things. Our service messaged saying they’re making different API calls. Call trying to come up with different databases or it’s the stuff they usually do, et cetera.

Christopher: So I have noticed a change in the environment that looks suspicious. So I go look in the GitHub repo and nothing has changed there. There has been no change in the GitHub repo between before and after the indicator of compromise floated up. So I now have to assume that someone has actually been able to change my pods. So again, now I can go through what we talked about earlier.

Christopher: I can isolate those infected pods and start evaluating them. But at the same time I don’t like service to go down, right? So if I isolate all those pods that were supposed to be doing a useful service, useful function in my applications. So let’s say I now isolate those so I can evaluate what was done, but I need to get back to a known good working state.

Christopher: So the first thing I’m going to do is I’m going to put in the rules I should have put in in the first place. Which is I’m going to block the ability of SSH or [Ex-Axin 00:33:15] to otherwise modify one pods in the infrastructure. So I put in those features and at the same time I redeploy with the known good in GitHub set of instructions and code base.

Christopher: So I’m actually… We redeploy, we now have uninfected, un-EVE tainted pods. And I’ve put in the filters in Kubernetes and with network filtering, network policy to block things like [inaudible 00:33:51] and have very strong our back controls around exec. And now I know that you can’t reinfect those pods, I don’t think she can. And my application is still running, while at the same time I have the isolated pods that were infected that I’m now doing analysis on so I can maybe find what else EVE might have done to my infrastructure.

Christopher: So that gives you a little bit of a workflow on how you forward protect. How you can identify compromise, what you do when you’ve identified compromise in the Kubernetes environment. And where you might wanna start collecting data to develop baselines on what is normal versus abnormal behavior in your infrastructure.

Christopher: So before I… So I’m go on talk a little bit about some of the things that we can do to help here in Tigera. I don’t wanna set us into a product definition, [inaudible 00:34:42]. But I do wanna talk just a little bit about some other things that Tigera has that can help in this regard.

Christopher: So I guess before I do that though let’s talk about best practices. So configuring new laws in the central, searchable space that I talked about. EFK, [inaudible 00:35:05] stack, an [inaudible 00:35:07] or something along those lines. Also use TimeSeries infrastructure. For medias to detect deltas and identify when behaviors change over time in your infrastructure.

Christopher: Make sure your logs are sensible. Make sure that you… Kubernetes meta-data and mutations are possible so you can actually know what that thing was that threw that particular log. Use service meshes to do a number of things like make it easier to develop restful and TRPC, API-based infrastructures but also to collect API call data.

Christopher: STO’s a wonderful example of a service mesh that can actually do that for you, help you do that. Don’t show things that are interesting ’cause if you do you lose any chance of actually trying to recreate or figure out what was done. You’re lying in the dark. Instead isolate them using service mesh and network policies and Kubernetes. Isolate them such that you can examine them but really can’t do any more on them

Christopher: Ensure that, and the last thing is the mutability. Ensure that the running pods are from a known source and you know the prominence of that code. And you’ve examined it and you’ve done static testing on it. All those things that you should do for codes that you’re gonna deploy. And make sure that the pods are coming from a code base that you’ve done all that work on.

Christopher: And if they are, try and make them immutable once they’re launched. So that EVE can’t infect them. Giving EVE the only path to infect is through your code base which would be easier to detect. And if you must allow mutability on pods, make sure that everything is logged. Because Kubernetes [R back 00:37:01] controls to know exactly what changes were made to running pods, who made those changes. And when you see those, go back and ask why. Or have some kind of proactive logging facility where if somebody’s gonna make a change to a pod, has to log why they did it. So that in case something does happen you at least have traceability as to where it came from.

Christopher: So those are some basic best practices to support forensics in a Kubernetes environment. Some of the things that Tigera can help with in this space. We can provide very detailed meta-data labeled flow logs from network data in the infrastructure. That not only indicate the standard IP address, twelve two-fold, was it allowed, denied, accepted. But indicates what policies were used to allow or deny the traffic. But main spaces, the endpoints, the source and destination we’re in, what [inaudible 00:38:09] policies were used, like what better policies were used to allow or deny, and even what meta-data labels were attached. What Kubernetes labels were attached to the source in best.

Christopher: And the Kubernetes name. For unique name for that source and that. So it gives you actually a very detailed log so you can see exactly what things are involved and what the configuration was that allowed or caused traffic to be a little bit more denied.

Christopher: We also give you the ability to dive in and explore your connectivity and your infrastructure. And we can scope this down by allowing or denying traffic by main spaces, by pod types, or instances. So I want you to visualize very clearly all the flows of traffic that are in a either application space or a given application.

Christopher: So it allows you to visualize the connectivity that your application components are performing. And lastly, we can filter these by main space, main status, et cetera, as I said. And lastly, we can give you policies that allow you to control traffic all the way from layer 3 to layer 7. Such that only endpoints labeled Bob can talk to endpoints labeled Alice on port 443. But may also only do so if they’re using the right TLS certificate and they’re making the right kind of HTTP or TRPC query.

Christopher: So this gives you the ability to control those API all the way up to layer 7 now, as well as layer 3, layer 4. And this is how you would implement for example, quarantine, as well as just controlling the service graph to begin with.

Christopher: So those are some capabilities that we give you. There’s others. There are other platforms out there as well, but I think here the idea is get you to rethink a little bit how you do forensics in the Kubernetes environment. And take advantage of some of the capabilities that Kubernetes has that may not have been present in more heritage kind of environments.

Michael: Once again, thank you guys for attending.