This is the third part of a four-part series discussing the microservice revolution, and how it impacts the InfoSec community. This is a not meant to be a definitive guide, but more of an overview that will start you on your own Microservice journey. The first two parts covered an introduction to microservices and what’s changed. If you’re jumping into this in the middle, I suggest that you go back and read those two posts now, they’re not long, and I’ll just go grab a beer while I’m waiting.
You’re back – that’s good; now let’s talk about what this evolution just broke in your InfoSec infrastructure model. Sit down, take a deep breath, grab a beer, put your feet up and relax.
Declare your intentions and let the magic happen
The actual deployment of code, and the management of how it reacts to changes such as faults, configuration drift, scale events, etc., is now automated. Someone defines an intent-based model of how the system should behave, but it is the orchestration system that actually does the work. Think of this a bit like Captain Picard saying “Make it so.” He doesn’t really do anything: he just states what he wants to be done, and the crew does the work. In this case, Captain Picard is the author of the intent, and the crew is replaced by the orchestrator. That doesn’t really sound scary, does it? However, once you realize that a modern orchestration system re-evaluates the world in milliseconds to make sure it is “still so” and makes adjustments in the same timeframe if it’s not, it might sound scary. Especially when it comes to security.
Now, if that means that instances of the application code need to be replaced with a new version, scaled up or down, relocated to meet demand, or re-started in-case of hardware failure, or maybe even due to a power failure, it will happen, and it will happen literally before you can finish reading the paragraph. That is unless part of the process is for the orchestrator to raise a ticket for a firewall reconfiguration and block while it is waiting for the reconfiguration request to complete. Not very likely. Not very pragmatic.
The fallacy of reserving resources
Let’s say that I just spun up 10x the normal number of workloads to deal with a flash crowd responding to an offer. They came up in seconds and handled the sudden spike. Five minutes later, the “limited time offer” has expired, and the infrastructure has scaled down back to the normal size. This has all happened while you are still reading this article. Is this something your current InfoSec model handles? Now you may be saying that you will just assign a block of addresses to each application and pre-provision your firewalls to deal with this. Hmm.
Remember, the application architecture of today, may not be the architecture of tomorrow. The flash crowd I just mentioned above may happen after a World Cup ad runs. However, for a different event, another fleet of microservices will need to spin up. Does that mean you ‘reserve’ resources for each possible event combination to handle peak load? To do that you will need an obscene amount of resources (like IPv4 addresses or trust domains) reserved and unusable by other workloads for the majority of the time. This isn’t an efficient use of resources, least of all financially viable.
Let me give you the benefit of the doubt and say that you have cracked that code for your applications today. How about tomorrow, when your application architects decide that a different set of application policies are necessary? Then what? Now you need to redo your reservation planning all over again, and again, and again…ad nauseum. All the while making sure things are secure.
Manual code reviews and unfounded trust in a secure perimeter
So, you’ve decided that all code deployed must undergo a thorough review. I salute your dedication to your task. However, you are now the tall poppy between the developers and deployment. This will probably not end well.
Now you may decide, rightfully, that some code (say that which touches PII or PCI data, for example) needs a deeper level of analysis and more stringent controls. That’s probably reasonable. However, there are lots of bits of code that the developers will want to deploy and change regularly that do not rise to that level of scrutiny, and yet require stringent access controls as well. Guess what, they all inhabit the same infrastructure. How are you going to scrutinize everything manually?
Even if your automated code and container scanning systems catch all of the bad code that developers unintentionally bring in, there are still spear phishing, and other social attacks to contend with. You will have to assume that out of the 10’s or 100’s of thousands of containers you are hosting, there will be occasional bad ones. The question is, which ones? Since they are continually changing, answering that question is mostly a rhetorical discussion. Therefore, you have to assume that anything you have deployed may be compromised, or otherwise misbehaving. The wolves are already inside. This means that deploying a perimeter security model will not work. You have to adopt a zero trust model where you assume that no element in the infrastructure, nor the codebase is completely trustworthy. Trust must be something that you establish, and security must be enforced at multiple levels. Security must be embedded in your CI/CD DNA.
Immutability and ‘fixing’ things
Ok, so let’s say you get lucky and detect a suspect instance of a microservice. Your inclination might be to isolate it, diagnose it, and fix the issue. However, if you remember from the earlier parts of this blog series, everything assumes immutability. If you, a developer or anyone else just ‘logs into’ the container to fix it, you will almost certainly, just make it worse. You will now have one ‘fixed’ instance and a number of ‘non-fixed’ instances that might not have been compromised. However, let’s say the fix changes some other behavior. You have now just introduced an ‘irreproducible’ fault into the infrastructure. You will not be popular with the DevOps team. They will probably take your ssh keys away.
If something has gone rogue, re-deploy a known-good version. You don’t fix it – you terminate it. In fact, you may just decide to upgrade all instances in the fleet for good measure. Go ahead, say “make it so” and let the system carry out your orders; let the magic happen. Just don’t try and fix it yourself. It will just go badly.
‘Well, this is great, what should I do?’
In the next and final installment, I will talk about the flip-side. What security problems are fixed by this design pattern, and how you can actually leverage these characteristics to achieve a higher level of security than you have now while being a hero for ‘getting out of the way’ of the development teams. Trust me: this story ends on a happy note.