Microservices Security Summercamp – Session 2

Microservices Security Summer Camp – Session 2: Best Practices for Securing Microservice Infrastructure

Microservice applications are packaged as containers and require orchestration by infrastructure management software like Kubernetes. Kubernetes will be the focus of this discussion on best practices to secure these dynamic environments. Join us to learn the tips and traps for securing the infrastructure for modern applications.

Complete Transcript

Speaker 1: Good morning, good afternoon everyone. Thank you for joining us today for the Microservices Security Summer Camp session two. We’re going to be talking about best practices today for microservices security. So we started with the introduction and that is now available online on demand. If you happen to miss it, I’m just going to www.tigera.io/webinars and you can see it in the lower right hand corner of the screen as a link. Today we’re going to talk about best practices and about how to secure your microservices environment. In a couple of weeks we’re going to talk about based on those best practices, how do you evaluate a solution to put in place?

Our camp counselor for today is Christopher Liljenstolpe. He was the original architect behind Tigera’s project Calico, which many of you may know about. He speaks at, speaks for Tigera around 60 meetups per year. I think he’s doing around 14 or something this month. So you might have seen him around. He speaks about networking and network security for these modern applications. He also is our chief consultant here, helps our clients secure their modern applications and as a fact, he was actually a park ranger at one point in time.

Christopher: All right. Again, welcome everyone. As Andy said, I actually was a park ranger in the National Park Service at one time and I think the most humorous line I got to say was pulling up to someone and then who had a flat tire in the park or whatever and saying that, “Hi, I’m from the government and I’m here to help you,” and that usually it would be … Everyone would get a chuckle out of that. So, I’m not from the government anymore. I’m from Tigera, but I’m here to help. Let’s have a little bit of a talk about microservices security.

As Andy pointed out yet a couple of days ago, we talked about an introduction, what’s changed in the microservices world, why microservices are going to change the way you think about security in your environment. We raised some challenges, some things that have changed that might make current practice for info sec have to at least reassess the way they go about, you go about solving problems or solving these challenges. I’m just going to go over what some of these challenges are. One of the things we talked about in the earlier session was code provenance. Basically, it’s no longer just your vendors code or code you wrote in house.

Code could be coming from many different places. It could be that your developers are writing glue code that’s sticking opensource projects together. You could still get vendor code or homegrown code. Your developers still could be writing that code using libraries and modules drawn from a huge number of resources and repositories. So how do you know what code you’re using? People think about scanning for known CVEs for static code analysis to make sure that the code that you’re bringing in from elsewhere is it follows a reasonable practice and you need to do all of that, of course. Is the repository even where you’re getting that code from trustworthy?

One thing I don’t think a lot of people think about is license contamination. If you have your developers going out and pulling code from all sorts of different resources because it gets … It allows them to get their job done quicker, how are you managing the fact that you might be having code coming in with many different licenses? Maybe you have a corporate policy that certain types of code can’t use safe GPL or Oracle Open Source or some other open source license or licensing construct. If you’re doing this automatically and all your developers are pulling code, how are you going to know you’re meeting that requirement? So you need to start thinking about, not just the safety of the code, but the legal safety of the code.

Automated deployment for good or ill, as we said earlier, manual waterfall process isn’t going to cut it. If you want to be responsive and agile, going through a waterfall process that might take you weeks to get code pushed is not really going to meet the business demand. So you start looking automation and automation is great except you can also automate a disaster. One of the terms you use is, “Are you going to automate a flat line event for your business by pushing something really bad that tanks the entire service?” So you need to start thinking about you want to automate things but automate things in such a way that there are circuit breakers, other things in place that will prevent you from doing a large amount of damage with an automated process. You might want to think about that as blast radius containment.

Tweaking her patching code, to date, most times when I discover I’ve got a problem in code or I need to address the CV, I’ll log into the server and I’ll make the change to the server, to the running environment. That’s not immutable. One of the things we talked about in the previous session was that the cloud native pattern is immutability, that what I push, what I commit is will always be the same. So if you’ve got people logging in to edit running servers or running services, et cetera, that’s not immutable. It’s not re-creatable. So you need to think differently about how you’re going to … To do this, to make sure things are always immutable, always repeatable.

One of the worst things you can do is have somebody log in and tweak one instance for example of say a container because it’s got a problem or they’re trying to troubleshoot something. Let’s say they do that and then you get a scaling event and the orchestrator fires up 499 more of those containers to handle the scaling event, well, the orchestrator is not going to know about the fix you made to the one running container, it’s going to fork off 499 more of the things that you committed into the repo, into the GitHub or whatever. So guess what, you’re going to have one thing, one pet out of a herd of 500 cattle that’s doing something different, trying to figure out that failure mode, why it fails one in every 500 times for a customer is not going to be particularly an enjoyable experience.

So you really need to either I’m going to treat everything as cattle or everything as pets, don’t mix pets and cattle in the same field, you’re going to have problems. The other thing to keep in mind, if it doesn’t exist in Git, it doesn’t exist or whatever you’re using for source code control. Do not push things or commit things that aren’t in whatever you’re using as your repository of record, it makes it almost impossible to trace where things came from, what things are running, et cetera. So if it doesn’t exist in Git, it just doesn’t exist.

Some more challenges. Changes on the anchor. As we said, these are very ephemeral environments and they’re very rapidly changing environments. IP address is not an identity anymore. What the identity of the workload is metadata and labels. We’ll talk about that a bit. It’s no longer possible to use an IP either to anchor policies to, nor as a pointer that you are, as an indicator that you were in compliance because IP addresses are completely ephemeral.

Blast radius. We now have a very large infrastructure. It’s very dynamic. Perimeter security is necessary, but it’s no longer sufficient. If you elect to have just a perimeter security, you have to assume in this kind of dynamic environment, you will have the opposition get past that perimeter security, myriad ways that that can happen. Once you’re in, there’s no way to control a lateral movement. You’re already in the front door and I have no locks on the interior doors. A common phrase in the industry or a statement in the industry is there’s two types of organizations. Those that have an advanced persistent threat and those that don’t know that they have an advanced persistent threat and those that don’t.

If you rely on perimeter security, those advanced persistent threats which you do have in your organization will have free reign. That’s a bit of a problem. A corollary to that is an explosion in East/West traffic. Basically, what used to be function calls within a monolithic application are now API calls over the network. The network is now the interconnection bus for the components of your microservices. So where used to have 95% of your traffic going north to south, but from queries, from your users and responses to those queries with maybe 5% traffic going east-west in the data center that is now inverted and now the vast majority of traffic within your cluster is actually east-west traffic, not traffic that is actually responding directly to or receiving instructions from a end user.

As a general rule that east-west traffic is now not logged. So that means that 95% of your traffic is opaque. That also is going to be a bit of a problem when it comes time for compliance because most of it, you’re not actually recording most of the stuff that’s happening in your infrastructure.

Infrastructure as code. Your infrastructure can now be managed just like your application code base. Policies, service discovery or or service exposure, storage mounting, et cetera. All of this storage setup, all of this is now even server configurations themselves are all now part of code artifacts. The good thing is you can use that same CI/CD pipeline to manage your infrastructure just like you manage your applications. The negative to this is that conversation we had earlier about automating disaster. If you make a change to infrastructure and commit the code and it automatically deploys to 5,000 servers and it was wrong, that will be a bad day.

So you need, again, need to figure out the same care you use for managing how your applications get pushed. You should administer that care or more in the way you push infrastructure. All of these things relate back to changes that can happen in seconds. Can you react in the same timeframe manually? Odds are probably not. If you can, even if you try and react manually to a problem, by the time you figured out what’s going on, the event has already ended or worse yet has morphed into something else. So you’ll be playing a catch up game trying to catch up with this automated system that is now off the rails. So you need to come up with, you need to take a step back and start thinking about how you’re going to automatically respond to issues, go back and find what’s actually the problem and fix it at root and let automation rep take care of cleaning it up rather than you trying to keep ahead of an incident.

So let’s take a look at some of the best practices now. So we’re going to talk a little bit about infrastructure hardening, registering container scanning, intent based models, segmentation and policies, multiple enforcement points in layers, TLS, RBAC and audit and meaningful logging, the keyword here is meaningful logging. So infrastructure and hardening. Key thing here, containers bring their own things they need-

Section 1 of 3 [00:00:00 – 00:13:04]

Section 2 of 3 [00:13:00 – 00:26:04] (NOTE: speaker names may be different in each section)

Christopher: Containers bring their own things they need to execute. You do not need to have all code known to man, all capabilities known to man in the servers that the container is running on. So, if you don’t absolutely need it, don’t install it or don’t enable it. Basically keep this stripped down as much as possible. Your underlying infrastructure. The containers will bring with them what they need. Keep it simple. This is corollary of the first one. Simple makes it easy to assess if your hardening is correct, etc. It makes it operationally easier, but just keep it simple. Automate security patch application. You don’t want to go around and hand patch every server when patch Tuesday comes around or whatever you’re using equivalent thereof. The folks at Core OS, now Red Hat have a really good mechanism for automating deployment of operating system updates. Other folks have explored the same area.

Investing in something like that is a real winner. You might think that having automated deployments of operating system updates would be problematic because your servers are going to be rebooting, but remember that orchestrator that you put in place assumes that servers are going to die and will just assume that a server rebooting because of an OS update is just another server failure and it will gracefully handle those. You won’t see an interruption service. So, automate that security patch applications. Otherwise, you will end up like certain credit rating bureaus. Use mandatory access controls in the kernels. By this we mean things like AppArmor, SELinux, etc.

Root doesn’t need to be able to control everything on the server. So, a concept of least privilege. Only allow entities that should have specific access. House specific access to those capabilities within the kernel. So, this is good practice. Don’t allow direct CLI access to your infrastructure unless you’re in extremis. You should, this goes back also to immutability and repeatability. If somebody logs in, makes a patch to a server to get it working again, they’re going to forget to check that back in or forget to apply it to other servers. Now, you have one server that’s different from all the others. Things are going to start breaking and it’s going to be difficult to trace why.

So, don’t treat any pice of infrastructure as a pet. If you need to fix something, go back. Fix it in the manifests and then push the change. Don’t SSH into these things. You might want to consider a management control plan isolated network for things like the control plane for the orchestrator, etc. Putting that on your management network, your upper management network, or at least using some form of host isolation policies to control who can get access to that actual orchestration control plane. Ship your logs and manage your clocks. Logging on a server is not going to do you any good if that server goes south either because of an attack or because of an outage. So, your logs have to go somewhere where you can look at them. A corollary of that, make sure everyone has the same clock. The same concept of time. It is going to be impossible in a fleet of thousands of servers and hundreds of thousands of end points to figure out what actually happened if everything has even slightly different concepts of time.

So, deploy NTP, etc. but make sure everyone is chiming to the same clock. That way it gives you the ability to correlate events across multiple end points. Container registry and scanning. Any containers in your registry of records should be scanned on every commit and fingerprinted. You can be using Docker Hub, you could be using [inaudible 00:17:08], you could be using any of the other registries, but make sure that things are scanned every time and updates made and that they are fingerprinted. Scan for CVEs. Scan for libraries or inclusions you don’t want. Maybe you don’t want a specific version of Java for whatever reason. Then make sure that you scan to make sure that things checked in to that registry don’t include those libraries.

License leakage. I referred to this earlier. Maybe you have a policy that for production code that has corporate intellectual property. You don’t want, say, GPL licenses. LGPL is fine, but GPL is not. So, you want to make sure that you’re scanning for a GPL code in those modules. Containers and artifacts that don’t have a matching fingerprint should be scanned as part of the CICD. So, if you’ve got a container, an artifact is part of a manifest and the CICD chain is building it and it matches a fingerprint of something that’s already been scanned, you know it’s good. Let it go through. If it doesn’t have a matching fingerprint, then you need to scan it at CICD at the integration time. Then hopefully check that back in and register the fingerprint, but you need to make sure that things have been scanned before they go out the door.

Let’s talk a little bit about intent models. The system should allow you to express your intent as to how it should behave rather than telling it what to render or what to do. Bob should not be able to send [inaudible 00:18:46] traffic to Alice should be the way you express your intent. Not a specific IP address allows all that traffic from another specific IP address because guess what? In a couple seconds, those IP addresses will be different. So, you really want to have this more as a intent. I don’t care if there’s a Bob or an Alice, but if there’s a Bob or an Alice, this is the way they should behave.

Another example. EU backend should be able to connect to EU databases, but not US databases. Not, again, based on exactly what network rules I want to put in place. This is a high level intent. It makes it easier to understand later what you were trying to do and it makes it easier for the orchestration system to actually render something that does what you want it to do versus trying to second guess what you were trying to accomplish because you were using ephemeral data as an anchor for a policy. This goes for more than just network policy and security. You talk about network and policy and security here in Tigera, but this applies to things across the board from storage to service exposure, etc. Intent is the way you should be managing these environments.

You should not have multiple segmentation mechanisms. I.E. coarse and fine-grained. Coke can’t talk to Coke. Pepsi and Coke can talk to Coke. Pepsi can talk to Pepsi. It’s a coarse grained policy. Fine grained policy front ends in Europe can only talk to to back ends in Germany. That’s more of a fine grained policy. The problem is those aren’t binary. That’s an analog distinction between zero and one which means you will never always make the same guess. Should I do this in the coarse model or the fine grained model? Which means at 3:00 in the morning when it’s broken, you will be confused as to what you did and why.

Similarly, what’s coarse today, Coke can talk to Coke, Pepsi can talk to Pepsi, tomorrow is going to be fine grained because there’s going to be other requirements that come on board, but if you model that relationship and the coarse grained model that can’t do fine, you have to unmodel it one side and model it in the other side. Again, you’re introducing chance for confusion and misconfiguration. Workloads should be able to belong to multiple disjunct topologies without interference or coupling. If you use, say, V Lance and say Coke can talk to Coke and Pepsi can talk to Pepsi, what about the thing in Coke that needs to talk to something in Pepsi? There’s a myriad of things where I have multiple relationships on the workload. Not a single relationship and I need to be able to support all those relationships in a policy model.

Policy should define a capability or a personality. Not a specific workload. Don’t write your policies thinking this workload means these rules. Because the next time somebody ships the next version of that workload, it might meet a different set of rules. Instead, think in terms of personalities or characteristics. This thing is an LVAP server. This thing is an LVAP client. LVAP servers should be able to receive traffic from LVAP clients. This thing is a PCI compliant front end and should be able to talk to things that are PCI compliant back ends. Those are capabilities or personalities. Then any given workload might be a LVAP client and a PCI backend.

The same capability might map to disjunct workloads. The same workload may map to multiple capabilities. So, this allows you to mix and match and as code evolves over time, you can adjust those policy mappings just by saying this thing is now an LVAP client. It wasn’t before, but now I’m having my backend do an LVAP query so now it’s an LVAP client as well. In a zero trust model, you probably shouldn’t put all your trust in just one point. What’s the point of saying I have no trust if I’m actually saying I have absolute trust in just one point? It’s not a zero trust model. It’s a one trust model. Multiple point bookend enforcement is very hard to subvert in an ad scale dynamic environment. I won’t necessarily go into the details on why, unless we get a question but if the opposition can do something to subvert bookended, I.E. ingress, egress policies at multiple layers in your infrastructure, say at TLS at layer five through seven and at layer three, four, then they already own your infrastructure at such a deep level that this is a pretty much academic discussion.

In order to do this, they already have all the keys to the kingdom they need to take over your infrastructure anyway. You have a bigger problem in other words. Different layers have different metadata and other characteristics that have meaning. The relationships between those constellations are very high quality signal to base policy on. If I say that Bob’s should be able to get [cuss 00:23:55] records from things starting with a [inaudible 00:23:59] record [inaudible 00:23:59] URLs start with cuss record from Alice, that is an interesting data point. If I say that these TLS certs tied to this user or service account can make queries to things with this service account, that is another interesting and disjunct piece of signal. I say Bob’s can talk to Alice’s on 443, that’s a third signal. So, I now have a network layer, I have an encryption layer, and I have an application layer bit of signal.

I put those three together and I have a very clear and distinct fingerprint of what should be happening between Bob and Alice. I can now tie a policy to that that allows that to happen versus just keying off one bit of metadata. I can now make these decisions based on multiple independent pieces of metadata from different layers in the stack. Love mTLS, don’t fear it. I know that a lot of folks out there are scared about what’s going to happen when all their applications are encrypted. They’re not going to be able to do any traffic analysis or DPI on their traffic. Everything is going to be encrypted TLS. However, TLS is a wonderful identity layer. It gets you pretty good surety that the identity between peers are the ones that you think they are and it’s a transitive across multiple transports. I don’t care if you crossed a five mat gateways between A and B.

If someone hasn’t terminated the TLS, you actually still have a pretty good identity if you’re using mutual TLS and client side authentication and server side authentication that the two endpoints that are talking are the two end points you think are talking. This is a very, very powerful construct. So, use mTLS. What that means, so as I said, I have non-repudiation, authentication and secrecy come along in the bargain. So, I know who sent it. I was able to authenticate them based on that and I actually provided some privacy for the data flow as well. It will play hell with your middle …

Section 2 of 3 [00:13:00 – 00:26:04]

Section 3 of 3 [00:26:00 – 00:36:44] (NOTE: speaker names may be different in each section)

Christopher: … for the data flow as well. It will play hell with your middle-box DPI, however. There’s a way of solving that, which means give the DPI box all the TLS keys and have it decrypt and re-encrypt in the middle. What you’ve just done is given a ready-made Man-in-the-Middle attack to all those APTS you have in your infrastructure. You’ve given that middle-box all the keys to your kingdom, so you might secure it, but it’s a really juicy target.

The opposition’s smarter than you, smarter than all of us. They will get onto that, and remember, we talked about minimizing the blast radius? This is more like maximizing the blast radius. This is not a good idea, folks. Don’t do it.

So, what you probably want to do is do this inspection before or after TLS. Don’t break the chain. So, one of the nice things, you can look at Istio. Istio does this encryption directly adjacent to the workload, the keys are saved just within the workload, blast radius is minimized, and Istio will expose a lot of this DPI capabilities natively to you before it gets encrypted, so you minimize the blast radius, you don’t share keys around, and you still get DPI while you get strong TLS security.

RBAC and audit. Role Based Authentication Control and audit. Remember that blast radius? Please don’t give one set of individuals all the keys. There shouldn’t be someone who, if you have this concept of super user or in router terms, privileged level 15 user for your entire infrastructure, that’s going to end really, really poorly.

You probably do need for every component a get out of jail free card, a things have gone horribly, horribly wrong; we need to get into the platform. Those credentials should be unique to each platform or even each instance of the platform, and they should be locked up physically in a safe and inspected, if you ask my opinion or some electronic version thereof. Users should not have access to those keys unless it’s a break glass here problem. Normally people should have access, and applications and coach should have access to just do what they need to do for their specific function.

Audit logs just aren’t for HR action. You’re going to not only want to use them in case somebody did something bad, figuring out who that was and as part of an investigation. You also just want to know who’s doing things any why, and it could be the system changed maybe for good, maybe for bad, you don’t know why it was done. You want to find out who did that change and ask them. You may, as an operations team, want to see how people are interacting with the system and see if there’s things that you can automate. This is a good signal for seeing how people actually use your system.

So, audit logs tied to users and therefore their roles, be it an automated user, IEP software, or a real user, give you a lot of good understanding about how the platform is being used. It’s not just for persecuting somebody; it also might be for making the system better, so these are all important characteristics.

Systems should tie into a site-wide RBAC. You don’t want a YAAS. You don’t want yet another authentication system. If you have separate disparate authentication systems, you will forget to update them after those HR actions, and again, it will all be over but the screaming, so you should have a common RBAC authentication model repository, be that LDAP, be that RADIUS, be that Active Directory, be that whatever you want to use, but it should be common, and you should have everything tie into that so there’s one point to administer who can do what in your system.

That’s a good place to put the guy with the gun, and lock it behind locked doors and motion sensors, et cetera because that is a bit of a sort of place bomb here thing, but in this case, centralizing that data dramatically outweighs the secure disadvantages of doing so.

Audit logs should be immutable and non-reputable. I want to make sure that what’s in the log actually happened, was signed by the thing that injected the log, and then I can tell … Preferably no one should be able to change those logs, but if they change them, I at least want a note they’ve been changed. I need to know that they’re no longer reliable, but really you want them to be immutable, so if you don’t have those two things, you don’t really have an audit log. You have a nice, fictional piece of text that might make you feel good, but it’s not a log. Definitely not an audit log.

They need to capture both human and machine actions. Most of the actions in your infrastructure are going to be machine driven. They’re not going to be human driven. You need to capture what the machines are doing as well as what the humans are doing. Remember Skynet from the movies? You want to know what the machines are doing.

Also like to look at logging in general. Logs are not only good for forensics and break/fix input, but they also show a trend of behaviors over time that can inform on the health of the system and what you might want to optimize. Everything still may be going great, but in the logs you see that over time, my APIs are degrading and taking longer and longer to respond. Maybe I need to look at fixing that before it becomes a break/fix problem, or I never intended for that API to be called as frequently as it’s being called. I thought it was going to be an uncommon use, and it’s being used all the time. Why is that? Let me go find out why people are using that. Should I redesign that API to be more efficient because it’s being used more frequently, or should I go figure out why people are using this API? Maybe I need to go fix the thing that they’re trying to work around, and again, that could be a machine or a human.

So, this gives you lots of feedback on how your system is being used, not just on its overall health. So these things are pretty useful for more than just break/fix. Again, broken record, non-reputable and immutable. I want to make sure that what’s in there is what was put in there and that it hasn’t been changed after it hit the log.

Those two things are slightly different, folks. One is I want to know that the thing that went in the log, the message that went into the log, came from the thing that’s authoritative for knowing what should be going in the log, i.e. the output of an API call logging should be coming from the thing that should be able to generate that. Somebody can’t go in and just fake that. Secondly, once it’s in the log, I want to make sure that the log hasn’t been changed. So, the first is non-reputability; the second is immutability, so you want to make those two things go together, but they are slightly different.

Logs should be searchable and retain any meaning as they slide back in history. If you’re using an IP address as a key in a log, it’s going to be useless if those IP addresses are recycled every few hours or less. Yes, four hours ago 191.51.100.37 did something bad. I have no idea at this point in time who 191.51.100.37 is. Now that address is being used by something else. So, you need to start using something else to be able to make these logs be usable and meaningful.

In a microservice environment, that’s metadata, that’s labels, that’s commit IDs. There’s all sorts of things that go into the things that are identifiers within a metadata system. They’re just not IPs, host names, et cetera, really. So, that metadata itself should be a key in the logs, not something ephemeral like an IP address or a host, et cetera.

A logging system should be able to trigger alerts based on thresholds. In a large system, you will always have a run rate or background noise level of alerting. If you don’t do something to mute those, you will go crazy responding. A good example, I was at a large bank very early on in the days of firewalls, and I got called in to help them with their firewall installation, and the previous contractor had installed something called a deck talker, which was a serial device that would read text over a serial line and speak it out. It was actually what was used by Hawking early on when he would talk; It was a deck talker early on. This guy wired up a deck talker to the firewall. Every time the firewall got any policy hit, it would scream out, “Help, help. I’m under attack,” and I was in that data center, and there was nothing but this constant scream of, “Help, help. I’m under attack,” coming from the firewall.

Obviously other than the five minutes of humor value versus after five minutes you want to take a hammer to the darn thing, the more important thing was it’s always going off, so no one’s paying attention to it anymore. So, you need to be able to set thresholds. Some events are more important than others. Some are only more important when they hit a certain threshold. So, you should be able to mute and trigger alerts based on threshold events, and you should be able to integrate your logging system into larger logging systems, not yet another logging system. There’s no use in having yet another pane of glass to look at yet another set of logs. You won’t be able to correlate the logs between different types of events anyway. So, you’re logging system that you pick should all cooperate and dump into the same place where you can do correlation, et cetera.

So, with that, thank you for your attention. We’ve got a little bit of time for Q&A, but I’ll turn it back over to Andy here.