Michael Kopp: Welcome everyone to today's webinar, Kubernetes: Anatomy of the Shopify Attack and How to Defend Your Infrastructure. With our speaker today, Garwood Pang, he's a security researcher at Tigera. He has around three years previously with GE Digital as a vulnerability researcher for the industrial control systems, and around three years at Fortinet for Williams where he was protecting IT networks, and that gives him the experience to talk about this subject today. Without further ado, I will hand it over to our speaker, Garwood.
Garwood Pang: All right, thank you Michael. Glad to be here, what are we doing today? We'll walk through a possible attack scenario that I've made based on high profile breaches, bug reports, and blog posts. We will look at each stage of the attack and then we can see the different ways we can mitigate or prevent it from happening again. We will also try to see why a zero trust approach would benefit in this situation, if I'm looking at the several high profile attacks and breaches, the most common attack factor is via something called a server side request forgery attack, also known as a SSRF.
From here, the attacker can leverage the SSRF to access the internal APIs, cloud resources that are connected to the instance, and the cloud provider's metadata service. With the access to the metadata service, the attacker can now retrieve the cloud identity and access management role credentials, and then finally, with the stolen credentials the attacker can then impersonate the resources and then gain access to the Kubernetes cluster.
What is a server side request forgery attack? It is abusing the functionality of the server to read or update internal resources. Usually it involves modifying a URL or a form, which then the server will then process and then make a request on your behalf. This may sound a little bit confusing right now, but once I show it to you in a demo, it should make more sense.
The metadata service is used by the cloud operations to retrieve info about the running instance. It usually provides host name, IP address, the IAM role credentials, and other useful data to help manage and scale instances, and to access the metadata service, it's actually quite simple. You just have to query from within your instance. The 169254169 IP address for both AWS and GCP or the domain Metadata.Google.Internal for GCP.
In Kubernetes, the interesting part here is that actually every single pod actually has access to this metadata service by default. This is because, for example, on GKE, Google's manage Kubernetes service. It's actually still running on GCP, and in every instance in GCP, you are able to access the amount of data serviced. This is also the same for AWS, and their EKS offering because EKS is also running on EC2, and then EC2 has their own metadata service. It is also important to note that the operator of your cluster or your cloud, you can also add your own custom attributes and data into the metadata service.
The cloud identity and access management service is another service that is provided by the cloud providers. The IAM service manage access to cloud services and resources by mapping role and group access and deny permissions to different cloud provider's offerings. This means that you can actually create new GCP or EC2 instance. You can assign a role to a instance with those specific permissions. The instance can now access those other cloud resources. You can add your EC2 instance to talk to S3 buckets, load balancers, or even other EC2 and computer engine itself.
Well, if you have the credentials, you can actually grab these credentials using the metadata service and then someone can reuse these credentials, but it's for this, is that this credential is actually rotated really frequently so that you can't really just keep it unlike how you can do it with the... What usually some people do is they store the air AWS access key into your instance, or using it as an environmental variable. This is actually, yeah, as I said before it's better than just using Atavist keys, or storing it in your environment variables, because you can actually actively revoke or change, or log all the permissions on the fly.
Our set up today is basically a very basic GKE cluster. We're going to run on GTE 1.14.7. I've enabled the Legacy Endpoint for this demo and I've also enabled the Intra Node Visibility setting. We're going to install Tigera Secure 2.5.1 and we are going to use Calico as the CNI, and other than that everything is pretty default.
In our demo we're going to have three pods with three surfaces. We have something called a Shop-App, a View-App and a Screenshot-App. They're all pressed in the default name space. There's no network policy applied at the moment and everything is accessible by everything else. All the stuff can talk to the internet, it can talk to each other, and anything else.
Let's take a quick look and see what our Shop-App actually looks like. Here is my Shop-App. It's very, very basic as we can see. That we can use to look at and see what it is, right? What we can do here, is that we can go through the different stores that I've already created. We go to store A, we can see oh, it's still in ASoft, very basic. Same as B.
Now, we're at this little form where we can actually add our own store. We can try to, oh. Let's try creating a store named C and we're going to sell C stuff, and then once you click submit it shows you the new page that you have created, or the new shop that you have created. Now that we know this, we can now start to play around with what we can... We can poke around and see what kinds of issues this stuff actually has.
The first thing I'm going to do is open on my terminal the next command that we're going to use. The first thing we're going to check is to see if there's any sanitization involved in this form. The most basic thing you can always do is copy and paste, or try to add in a HTML header, or sorry, a H1 header. This presents how well to see they are sanitizing anything, or if it actually processed the H1 header, and once I submit it we can see it gets bold and we can see that this site is actually pretty vulnerable. We can actually add in our own HTML headers, or sorry, HTML tags into this website.
Now that we know that, we know that we can actually start doing some redirections. This is by itself pretty useless, because for example, you can... I mean, it's actually kind of dangerous, because for example, we can pull a redirect to a malicious webpage. Then if someone come to your shop they will get redirected to malicious webpage and then they may get attacked.
Now, let's take a look at the View-App. This is now the second site. It's similar to before. It's very, very basic. All we have is we can add a new site to our list of site for sale and we will then see it on our bottom part. We can try to add a new site, right? We're going to add back our original site A, and when I click submit, what happens is that the View-App will talk to the Screenshot-App, which will then go to our site A and then take a picture of it, or shop A, sorry, and take a picture of it.
Here we can see, oh it took a picture of our site. That's pretty interesting. We're going to go to our site B and then the bot will come and take a picture of the site B. Now that we've had this thing, which is kind of interesting we might be like "Hey, let's see if the redirection works." Right? And if I remember correctly our redirection site is site F, and when I click submit, hey look. Interesting, right?
Before we go further out we can also think about how since we know that we can control where the Screenshot-App goes why don't we go to, for example, Google, right? We can change this to www.Google.com and then when we submit it, it will actually block us from using this to access external URLs. We can really, instead of skipping the Shop-App get the Screenshot-App to go and grab Google instead. There's some sort of mitigation's to prevent you from acting resources.
Back to the slides. Let's review what happened. The View-App sent the request to the Screenshot-App to take a screenshot of the targets front page. The View-App can only access internal websites and since we can control where the Screenshot-App goes via the shop redirect our originally hindsight becomes an SSRF.
Now that we have this we can go to the next app. I've said this before, a lot of recent breaches and reports, people like to go straight for the metadata service, because it has a lot of useful information's in it. We're going to try to do that too. Yeah, on our right on the terminal we can see that the script that we're going to run next is the, we're going to redirect to the Meta-App, the metadata service.
We're going to see the same thing as the redirect that we did to Google, but instead we're going to go to the metadata.Google.internal, compute metadata, B1, and we're going to try to hit the meta. We're trying to hit the service account called Service Account Default Token. Now we're going to go back to our site where we're going to have our store called store G, and then the store is going to redirect to the metadata service, and when I click submit we can see that we went to the site G store very briefly and then once it hit the metadata service it will then show off this 404 page.
The reason why it's 404 is because on my laptop there's no such thing as the metadata service, because I am not on the GCP infrastructure or Nork. Hitting the site on my laptop will not do anything. Yeah, but when we go to our View-App it will tell the Screenshot-App to go to the store. We will actually see. We'll go again. We can see that when the Shop, sorry, when the Screenshot-App goes to this page we're actually something else that's different than what we just saw, right?
We see that there's actually this Google reply saying that we have a 403 error, and it's telling us that "Your client does not have permissions to access this URL, because we're missing this metadata flavor.Google header. This what Google has done to prevent this kinds of attacks.
A lot of times when you have a SSRF you do not have header control, when you have this header it blocks the attack from happening, but another way to bypass this is to use the previous version of this API, because see here on the right, instead of having it in the compute metadata version one we can go back to the version one beta one, and what this does is that back then when they had this version one beta one they actually didn't require this special header and it worked.
Let's go to site H. We're going to create a new store called site H. We're going to replace our previous commands to the new one, and we can see what happened, and when I click submit of course, when it redirects we are not going to reach anywhere because we're not part of the network, but since our Screenshot and View-App is we can use our app, and when we click submit the Screenshot-App will then follow the redirect, goes to the metadata service, goes to the version one beta one endpoint or the API, and then it is able to access the data, and here you can see we have the access token, Y, blah blah blah blah blah, and then we have this. With that, the attack again has gained control and retrieved the cloud identity and access management role credentials token.
Heading back to the slides we can briefly go over what we just did. Oh! Sorry about that. We're actually going to continue on. A step back. Now that we have this token, what we can do now is to look around to see what other interesting stuff we have inside this metadata service. To do this I'm going to play my commands that I've done before. We're going to do the first part and we're going to look around.
The first thing we're going to look at is the metadata service as itself. The way I'm going to do this is I'm going to actually Q code and exec into my Screenshot-App, and then we're going to crawl to the metadata service instead of using our website where we're going to have to create a store, and then set a store on sale, and then look at it and wait for the reply. We're going to directly hit the endpoint and see what it is.
Here you can see when we hit 50 metadata service, let me quickly check. Yeah, when we hit the API viewing data one we have various different folders. From there we can actually get the project name off the cluster. We can see we're at Tigera Secure Research, which is a special project that we are made for this demo.
Next, we're going to let out all the instance contents that we can find always work on the metadata service, and when we hit it we can see that a lot for interesting stuff, right? Attributes, description, host name, network interfaces, all these are all available for you to use when you want to access the metadata service.
The first one we're going to look at is the host name. That one could obviously, where are we once we got into the cluster? We're in GKE my Garwood demo. Then next, we're going to look at nether complex. Once we hit the nether complex we can see there's some very interesting stuff, like the external IP. This is the external IP of the node that's running the Screenshot-App. Now we know roughly the IP address, or we know the IP address of the Screenshot-Apps nodes. We know and we can ping, or we can start doing more stuff and try to access this node.
We're going to grab the service account token like what we did before, but instead now we're on the command line so we can actually later on hard set the stuff that we want, and here you can see, right? We actually get the same reply as the one on the left. We're going to see, is the token is actually different, if you look at the last few digits it's actually one EXF, whereas here is a new [inaudible 00:21:06], as soon as it's up, as I said before, these credentials rotate really quickly to prevent people from finding the token, grabbing the token and keeping it, and then reusing it forever. They rotate it quickly. Then you had to continue to see graphics tokens to then do malicious stuff with it.
Now that we have the token and the project name, what we can do now is to see what kind of permissions we have. First what we're going to do is we're going to grab the token again and the project ID. We're going to do the same thing where we stuck into the pod and then do a curl, but this time with a bit more scripting at the end to basically pull off relevant part that we want.First there we got the token, and now we're going to grab the project.
Now that we have the tool we can now talk to the Googles own API, where we can use it to check out what kind of permissions that we have for this token that we found. Here we can see after we hit the Google APIs in point it tells us that our scope is only the user info email and the cloud platform, if this credential, if it has more permissions you would see compute engine or storage, and with those permissions is one you can actually use this credential to spin up more instances, or you can access all the storage buckets and then take a look at those, but unfortunately for us this token doesn't have much, as I said before you can always put in data into the metadata service if you have the permissions.
Here, this is a command app. We'll pose into the Google APIS, and then try to inject some basic test value into it. The reason why you do this is because some people will like to pull in their SSHT into their instance. Then they can do passwordless login to the SSH, but from here we can see that we actually have no permissions. Everything is getting forbidden. We can see that this token is pretty useless. With that in mind, we're going to now continue and look around, because the token we found is not very useful. We need to look around, see what else we can get.
Here I'm going to hit the instance attribute part, and we can see where there's a lot more very interesting information. They cluster location, they cluster name, and at the bottom there's a very interesting file called Kube-ENV, Kube Labels, and Kubelet-Config. Of course, we're going to go and look into it.
First we're going to look at the HT label, which doesn't have much information in it. All throughout these files are basically what is used by GKE to spin out the nodes, and then once the nodes are spun out these send out credentials and files, copying files that they need to then talk to the master node to join the cluster.
Here, this is the Kubelet-Config, which is, as I said, the [inaudible 00:25:11] file. There's a lot of interesting stuff that you can use. There's the memory and stuff that you can look at, CPUs, and lastly the cuENP file is the most important file that we can find. You can see, right? There's a lot of information that's spat all on my terminal. Scrolling back up we can see there's a lot of permeated things apart.
First thing is the CA cert. The certificate authority certificate. We can see our customer name. We can scroll down and then we can see a Kubelet certificate, and then the Kubelet key, and then another big thing we want to know is that there's the Kubernetes master name, which is the IP address of the master node. With that in mind we now have enough informations to talk to the Kubernetes API server.
We're going to head back to the slides, and we're going to review what we just did. We queried the metadata service, but it was blocked by Googles mitigation. Then, because the current implementation of the API requires the metadata flavor.Google header to refile the request or it be blocked. Luckily in our case the old metadata APIs are above their level, which still need these special header. With that we were able to access the metadata service.
Google has since disable this old API and by default that's why I had to manually enable it, and this will actually, this old metadata service API will eventually get removed. We were able to then grab the token using this API, but from looking at the permissions we can see that it was pretty useless, if there were more permissions, as I said before, the attacker can now link all the buckets and then sink them elsewhere, or they can position new instances and then install crypto-miners for example.
You can actually even, if there's permissions you can actually give your more roles or more permissions, and grant yourself more privilege access. After that we continued looking around and we found several interesting Kubernetes files. We found the Kubelet-Config and the Kubelet-ENV file, and within the Kube-ENV files we got the Kubelet Credentials and the IP address of our master, as well as the Kubernetes API server.
Now we're going to see what we can do with the credentials that we just got. The first thing we're going to do is we're going to pull the different sections we want of the Kube-ENV file, and the premium parts that we want is the Kubelet certificate, the Kubelet key and a certificate authorities key, a certificate authorities certificate, and to do this is same thing as before we're going to curl into the metadata service, do some scripting to pull the different sections out, and then do a Base64 decode, because by default the Kube-ENV file store everything in Base64. I can see in the top right this is Base64.
Next, we're going to actually take a look at the files and see what they look like, right? And you can see the first one is the certificate of authority certificate. The obviously private key, which is the Kubelet key, and then lastly is the Kubelet Certificate. With that we also want to grab the cluster master IP address so that we know where to talk to, or who to talk to, and now that we have all three of these data we can now, what we can do is now try to log into or talk to the Kubernetes API server, but first we're going to have to remove my current credentials, which I'm using, because the current credential has admin and you're not supposed to have that.
I'm going to remove my Kube-Config. Then we can no longer talk to it properly, if I do a Kubelet [inaudible 00:30:04] we can see that we're getting the standard, they're sending the comeback error. Now we know we don't have the credentials anymore, and then now we can do the Kube cutoff, and now we can put in our new set of credentials, and to do this you have to put in the specific flags. For example, client certificate is the Kubelet Certificate, the client key is the Kubelet key, and then the certificate of authority is the CA cert or the certificate authority certificate, and then we try to get caught. We can see that we actually get an error.
Well, you can see this error is different than the one we saw. This one actually tells us that we are still good in some accessing the pod resources, because our user is Kubelet, what happened is back here Google has now since fixed this credentials that we can grab from the metadata service. It can no longer have much permissions, and that's why we have no longer have interesting permissions on it.
Heading back to the slides we can do a quick review on what just happened. We pulled apart the four different parameters that we need. The master node IP address, the Kubelet key, the Kubelet Certificate, and the certificate authority certificate. This then allows us to talk to the Kubernetes API server remotely, and previously these credentials let us exec into the pods and then do more, but Google has since reduced the permissions.
Luckily for me the credential we stole, none of them was actually useful. Now, we take a look and see what we can actually do at each stage of that hack. For the SSRF, this type of vulnerability is actually quite hard to detect, because it usually is a implementation or design flaw of pure application logic. It is only when it is exploited or if the vulnerability is found that you can really get IDS and IPS signature, when that happens usually you should fix your application logic.
Another thing we can do is we can use network policies to reduce the attack surface. There's actually different ways we can restrict communication between the pods. For the Screenshot-App we have three different ideas. We can make it so that the Screenshot-App can only talk to the Shop-App. This will mean that if our Shop-App have images that source elsewhere it will not able to reach it by using this policy. We will then have to add another separate policy to allow it to talk to the storage server or something like that.
The View-App, I showed you that the View-App actually blocks external URLs, if you remember. This role's actually done by the application logic only. This means that if the pod gets compromised that attacker can still reach the internet on other pods since there's no network policies or anything used to prevent it from happening all of them, the application logic. In order to have defense in that we actually should add a network policy to the View-App to block egress to external resources.
The recommended way is to introduce network policies to start, and we're going to do a default deny all first, and then slowly enable the different policies so then the policy [inaudible 00:36:58] interior works. This is the least privileged model you want to follow. The metadata API, we wanted to make sure that the legacy endpoint is disabled. Unfortunate for either of us there's actually no special headers. You might have to look deeper and see what other ways you can prevent this kind of attacks.
On GKE, sorry. Yeah, on GKE pretty reasonably I think they're going to permanently remove this legacy endpoint. I actually got an email the other day while setting up this demo from Google telling me that this is going to get removed. Soon it's going to be permanently removed so you'll be fine. We also want to set the token to have minimal permission like what we just had in this demo. On GKE this is the second account.
Using metadata concealment on GKE would also be recommend, what this does is it hides the sensitive information's, for example, the Kube-ENV file, and the way they do it is having a proxy between the metadata service and your cluster, and you would go to anything that is sensitive. It is also recommended to use the Workload identity for Google and Kub2IAM or KIAM for AWS. These allow a finer gained control of letting you set individual or group permissions of pod resources within your Kubernetes cluster.
This means that you can actually set specific pods and give them certain access to certain resources like storage buckets, because for example right now, right? The token we set, it basically applies to the whole cluster, whereas if you use these kind of stuff it shrinks your permissions into individual pods or individual services to have certain permissions.
For the Kube [inaudible 00:38:59] part of the attack, where I pull all the different data and then log in, I'll try to access the Kubernetes API server. It really depends on what kind of credentials the attacker was able to get, right? If they were able to grab the ending credentials for example, then they have full control using Kube cut all, and you can't really do anything about it, but in our case the attacker, me, grabbed the credentials that was pretty useless, what we want to do is we want to audit the role based access controls.
We want to make sure the names and everything between the pods and service are correct, and we want to check all our Role Bindings for our Service Accounts. Especially the Bootstrap account. We also don't want to have our API server available to the internet. By default when you set up a GKE cluster you set up a public file, which means your API server is remotely reachable like what we just did. We want to save to the product cloud and then use something like a Bastion host, or a jump host to then connect to the API server securely.
Lastly we want to always monitor our Kubernetes API server and use GKE auto locks to see if there's anything suspicious going on. Next we're going to look into Tigera Secure and how we can use Tigera Secure or Calico to set up some basic network policies. Tigera Security has basically a nice GUI for you to use to set up the network policy. I can show it to you and see how it works.
Okay, before that we can actually show you the flow lux that Tigera Secure also provides. While I was doing my demo we can already see that we actually have different flows that will actually mark. We can see for example here the Screenshot-App is talking to the Shop-App. We can see the Screenshot-App talk to the QPNS. We can see the View-App talking to the Shop-App, and now you can see these two where it's the Screenshot-App is talking to the public, which means the external, when we were reaching Google for example. With that in mind we can now create some network policies.
We're going to go through the Tigera Secure policy board and this is where we can create a new policy. The first policy we're going to add is a very basic policy. We're going to try and block egress. To block egress, what we want to do is we want to first set our scope to name space, and since our Screenshot-App name face is default we're going to put it as default.
Next we want to set it to then we are replying specifically to the Screenshot-App. We set it as the app is equal to Screenshot-App, and then we face. Next we want to set it to block egress. We would unset it to type egress, and we're going to click add to egress rule, and first what we're going to do is we want to enable any connections with any protocols to any name space. This will allow the Screenshot-App to talk to anything that's within our Kubernetes network.
Next we want to add another rule to deny everything else. With that we need to apply on the right and now we have a new network policy. This GUI is actually basically create the YAML file for you, if you're using Calico. Next we're going to go back to our primary prompt and we're going to now and try to see if this actually works. We're going to do a few cut out. Exec into our deployment. Screenshot-App, and then we're going to throw Google, and we can see that we're not connecting. It's continuous loading.
To make sure this actually works we then can also try to hit our Screenshot-App. I mean, we can try to hit our Shop-App by accessing the shops surface, and I'm not sure, going to site A, and we can see we actually got some reply, which is our site A reply. We know we actually blocked the egress to external resources like Google, but we are still letting the connections slip in the pods and within the services that's inside your Kubernetes cluster. It still works. Last thing we want to of course make sure that we can actually block the metadata.Google.Internal, and we can see it's very similar to the Google.Com. It will never load and eventually it will hang up. We blocked egress.
Next, we want to try to do the second different method. First I'm going to delete the policy, and now what we want to do is we want to create something called a network set. I've already set it up for you, but it's basically as you can see here, we have a label called rep, and then our IP is the 169254169, which is the metadata service IP address, and now that we have this network set, what we can do is then create a policy to block that specific network set. I click add policy and we're going to set it to block metadata, and here we're going to set our scopic level. We want to apply it to the Screenshot-App again, and then we're going to set it to only egress, and then I'm going to add an egress policy.
The first policy we want to add is the denied through our network set. We want to set up that to all protocols, and then for our endpoint we're going to pick the color and then we're going to choose red, and then we save, and then we're going to add another egress rule, which allows everything else. Apply that and then now we can take a look at it again.
We can go back to try and see if we can reach the Metadata.Google.Internal, quick enter and it's loading, but we can see that it's not getting a reply. We can try the next one where we hit the shop surface and we instantly get a response back so our site still works, and the last thing you want to try hitting Google, and we can see a lot of this text isn't the Google website. Now we specifically only blocked the metadata service.
Back to the slides to recap what we did on Tigera Secure. We saw the network flows between the different pods and services using our Kibana service, and within this the network policy GUI to create a policy to block egress, and then we double check it by accepting execs into the Screenshot-App, and then trying to reach Google and the metadata service. We then, we looked at policies, and then add a new policy to block the metadata surface by using a network set.
In conclusion, today I demonstrated a possible attack scenario. We walked through the different points of the attack and discussed the different approaches to prevent them. I demonstrated how Tigera Secure and Calico can also be used to create network policies to reduce the attack service. Although we couldn't do much with the credentials we took today, in a future webinar I will show you what we can actually do to pivot and get more access.
Here are some references that I've found that I used to create this demo. You can always reach me Tigera@, sorry Garwood@Tigera.io, and that's basically it for me.
Michael Kopp: We have a couple questions and about four minutes left. Garwood? Do you want to take some of those questions?
Garwood Pang: Sure!
Michael Kopp: Okay. It seems we have about three or so. I guess the first question is referring to the Screenshot-App, and it asks why did the XSS become SSRF. This is a little while ago though hopefully everyone can remember.
Garwood Pang: Yeah, basically when we had our Screenshot-App, the Screenshot-App actually acts as a browser. In our case where we saw how, when I tried to do the redirect to the metadata I saw Google. We can't... See, that's not really rough on my laptop, but since Screenshot-App is still applied to cluster and is in a pod it's actually able to reach the metadata service. Our originally cross-site scripting or the redirect it was doesn't work. It was crime site, but then since we were inside the cluster and pods the attacker will be able to reach the internal APIs and this means that we have a server side request forgery hacking.
Michael Kopp: Okay. Let me see, I'm trying to distill this question down. I guess that what they're asking is about network sets, and can you explain a little more about what network sets are?
Garwood Pang: Sure, sure. I would show back the, our... Sorry, I'm going to present our screen share.
Michael Kopp: Yeah, yeah. Go ahead.
Garwood Pang: I'm going to go back to the Tigera website, what I never said to do is essentially, if I create a new network set it's basically pulling different IP address that you find interesting. For our case the metadata surface. Sometimes you can also do, maybe there's some malicious or suspicious IP that you want to include, and once this stuff is down we create a set that we can then label suspicious, and then we can set it block like what we just do and setup one specific IP. We can do a list of IPs and domains then we can be like "Label all these." And say "Okay, these are all bad IPs. We've got to regularly block them off."
Garwood Pang: Let me head back to the slide.
Michael Kopp: Okay, great. Well, I think that is all the time we have for. That was a full hour. Garwood, thank you so much for your presentation and we look forward to seeing more presentations on security vulnerabilities and how to protect yourself against them.
Michael Kopp: Once again, thank you all for attending, and if you have any questions or would like to copy the slides please contact Contact@Tigera.io, and if you'd like any topics, webinar topics, we'd love to hear from you. Also, that same email address, and let us know what topics you'd like to hear about. Especially, since we're going to have Garwood on more, what kind of vulnerabilities you would like to see looked at.
Michael Kopp: Anyway, I thank you so much for attending and we will see you at our next webinar. Thank you and goodbye.
Garwood Pang: Thank you!