Micro Segmentation Container Ship 17 Aug 2017

Micro-segmentation in the Cloud Native World – Part 1

I spend a lot of my time talking with organizations that are starting their journey to cloud native, or container-based environments. Some are greenfield environments, where there is no legacy infrastructure or incumbent suppliers. Many, however, have legacy architectures and vendors they are comfortable with (or at least are in the ‘devil you know’ camp). From this latter camp, I will often hear “why can’t we just continue to use our existing network and firewalls in this new environment? It’s worked for virtual machines, why not containers?”

That was the question that I was going to address in this blog, but the more I thought about it, the more I realized there was a more fundamental underlying question, namely:

Why does anything need to change in this new cloud-native/container/ micro-service world? Why can’t we just continue on as business as usual in the infrastructure?

I’ve decided I’m going to tackle this question in this blog by explaining why the existing model won’t work, and then follow it up over the next few weeks discussing what characteristics are necessary in the new infrastructure. I’ll use networking as the example, but the underlying concepts will be equally applicable across storage, scheduling, service discovery, etc.

“Virtualization was driven, primarily, top-down to achieve greater efficiencies in the utilization of servers… In contrast, the container / micro-service revolution is being driven bottom-up by the developer community.”

The Container/Microservice Revolution is Fundamentally Different from the Virtualization Revolution

First let’s consider some of the evolutions that have driven us to the new model of delivering applications, and why this is inherently different than the previous revolution of virtualization. Warning, this is quite a simplification, but it serves the purpose.

Virtualization was driven, primarily, top-down to achieve greater efficiencies in the utilization of servers. It’s an optimization, but the overall operational model is still the same: a server dedicated to a given application, operated as a complete, fully-functional platform.

In contrast, the container / micro-service revolution is being driven bottom-up by the developer community. Developers are responding to demand for more rapid development iterations and deployment ubiquity, with artifacts that are self-contained, but as light as possible (containers), a development model that is agile, continuously integrated and deployed (CI/CD), and leverages re-use of not only internally developed code but also the wealth of external FOSS (Free & Open Source Software) software components that now exist, with each bit of code expected to do a very constrained, well-defined function that is loosely coupled to other components in the infrastructure (micro-services).

In short, the days of monolithic apps developed in waterfall development cycles are, if not entirely disappearing, becoming the exception rather than the norm.

In order to make use of these features, and meet the demands of the business, the developers need an environment that can deploy and manage the life-cycle of those deployments. This environment must match the characteristics of the development cycle, or there will be a great disconnect between development and deployment/operations that, at best, will prevent the developers from meeting their goals, and, most likely, cause chaos, confusion, and contempt in the organization.

“The historical model just isn’t tenable in a large scale micro-service environment where you want (or need) to implement zero-trust or least-privilege protections.”

Open Source, Cloud, and Third Party Code Expose Security Concerns Leading to a Zero-Trust Computing Model

Another related concern is that, with the rise of open source, cloud, and SaaS, most of the code comprising a typical enterprise application is not going to be developed in-house and may not be deployed on secure on-premise infrastructure. In the case of open source, the organization will be somewhat dependent on the original author/project that developed that software to deliver secure code. While the organization could do code review on all code that is deployed, that will slow development down, and still may miss issues in the code due to complexity of the code, or the skill-set of the developer. Banking on that kind of review may open the organization to importing code that is either vulnerable, or potentially outright malicious.

To combat this, we are seeing more and more organizations going to a zero-trust and/or least privilege model. Instead of saying “blue can talk to blue”, they want to say things like: “only blue LDAP clients can talk to blue LDAP servers, and any other blue component should be prevented from opening an LDAP connection to a blue LDAP server (and alerts should be thrown when they do)”.

An Illustrative Example

So, let’s take a hypothetical case and see how the historical model might (or might not) be applied.

Let’s say we have three applications and we’ve called them orderInput, geoLocation, and custRecord. However, we’ve ported the components over to a container orchestration system like Kubernetes and set them up for CI/CD deployment with auto scaling, auto failure recovery, etc that means that any component may end up on any server, with any IP address.

The starting requirement is that all of the components in each of these silos need to communicate with one another, but not outside of their silo. So, we put each one in its own network (in this case a VLAN or VXLAN), and don’t allow traffic between the network segments. To make this work, we also need to tell the Kubernetes network plugin that there are three address ranges in use, one for each silo and which components belong to which silo. Once we have done that, we’ve met the requirement — all is (mostly) good. What we have had to do is take what is often a precious resource, IP addresses, and reserve them for the maximum size that each silo may grow to — reserved resources do not lead to efficient use — but that shouldn’t be a huge deal if the number of silos is low, and the spread between lowest use and highest use in each silo is fairly tight (say 128 containers on the low end and 256 containers at the high end).

Some time later, there is a requirement for the orderInput application to check to see if the customer account is in-arrears before accepting the new order. So we now have a requirement for the creditCheck components of the orderInput application (let’s call this OI(cc)) need to communicate with the creditStatus components of the custRecord application (CR(cs)). However, we don’t want any other orderInput components to be able to connect to any other custRecord components.

In the historical model, we would assign the OI(cc) and CR(cs) components specific IP address ranges and then setup one or more firewalls between the orderInput and custRecord network segments that only allow traffic between the OI(cc) range and and the CR(cs) range. However, we’ve said that cloud orchestration systems don’t normally support assigning specific addresses to specific workloads, and in fact, that concept was considered and specifically rejected when Kubernetes was originally designed. You can mimic the behavior if you really wanted to: Kubernetes (and similar orchestrators) have a concept of address pools or ‘networks’ that consist of a block of addresses, which then will be used to assign addresses to workloads that are ‘assigned’ to that pool. In fact, the way we would have assigned the orderInput, geoLocation, and custRecord address ranges this way.

However, these systems were designed to have small numbers of large pools, not large numbers of small pools.

To support the OI(cc) to CR(cs) communication, we need to create two more pools or address ranges, one of the OI(cc) containers, and one for the CR(cs) containers. We also need to deploy some firewalls between OI and CR. All traffic between OI(cc) and CR(cs) has to be directed through those firewalls, so we have to scale those to handle the traffic.

Now we have 5 address ranges.

Later, we get another requirement, we need to geo locate customers who are ordering certain content to insure they are in an region where rights are available. This is going to require geoConfirm containers in the orderIntry (OI(gc)) silo to talk to ipMap containers in the geoLocate (GL(im)) silo.

Now we need two more pools and firewalls between GL and OI. So, for three applications, we now have 7 address pools and two sets of firewalls. If we continue with this insanity we end up with needing pools to cover the complete cross-product of all potential policy interactions. For those of you keeping score at home, that’s N². For 100 functions, each of which that might need to talk to any other function, we end up with 10000 address ranges, keeping in mind that each of them need to be sized for the maximum number of instances that may exist at some point in time.

If you need further convincing, let’s assume some subset of the OI(cc) nodes also need to make a call to another service in the GL cluster. Now we need to install firewalls between OI and GL and we need to divide the OI(cc) pool into two pools.

.    .    .

By now, I hope it’s obvious that the historical model just isn’t tenable in a large scale micro-service environment where you want (or need) to implement zero-trust or least-privilege protections.

However, there are solutions out there that will work. We’ll cover those in the next blog in this series.

Please feel free to post questions here or on twitter @liljenstolpe