Trust, TensorFlow and the Cloud

People are usually honest beings. In fact, the world keeps moving forward in spite of all the scandals, frauds and the corruption because an overwhelming number of us believe in each other, trust each other. It's this same inherent built-in trust that makes us trust our cloud providers. And for the most part, we are not going to be wrong. Trust is a good thing.

For our cloud providers, we trust that they aren't reading the memory pages of our containers. We trust that they aren't trying to inject the untrusted arbitrary code during the execution. We trust that they aren't snooping on our workloads just to launch a competitive service on their platform. For most of us, it's not just a simple cautious trust that it should be by default, but rather a complete blind trust.

System Enforced Trust

Photo by Benny Meier on Unsplash

Most of the people understand and respect those lines on the road. They understand going on the other side when you are not supposed violates the trust of others in you and the system. It can and does lead to very bad and unpleasant consequences. However, we learn from the history that greed and short-sightedness have a paralyzing effect even on the most honest amongst us.

When you have a system that relies on the discretion of the participants to keep it trustworthy, it's only a matter of time before someone breaches that trust.

Photo by Radio Slavonija

That's why sometimes we design systems that are inherently secure and don't depend on the participants to enforce the trust. Instead, the trust is enforced by the very foundation of the system it is built upon, just like a concrete divider which enforces a strict separation between opposite moving traffic.

Questions on the integrity of the cloud providers come from their ability to be 'root' on the systems where our workloads are running on. We don't have much to say in the way these underlying systems are provisioned. We traded ease-of-use with the control of the systems. We fell in love with the trojan horse so much that we don't want to believe that someone might be tempted to hide in it.

Memory Pages and Encryption

Upcoming technologies like Intel's TME and MKTME, IBM Ultravisor or AMD memory encryption enable encrypting the memory pages used by the virtual machines.

Image by IBM

The above image gives you a fair idea of how it's supposed to work with IBM's solution. Others follow a similar route. To overly simplify things with the view to brevity, the memory pages used by the virtual machine are encrypted from the system's point of view. So even being a 'root' on such a system will not allow the malicious user to read what's executing inside that virtual machine.

How Cloud Providers can enable this?

Since they own and provide the infrastructure, they will have to get the systems that have the memory encryption technologies. They will also have to provision the software stack to make all this work seamlessly with existing workflows like Kubernetes. Both the hardware and software in this domain is at such an early stage that none of the cloud providers are providing such a service (or trust for that matter).

Try it out

But that doesn't mean we cannot get our hands dirty with this. Let's say we want to serve our state-of-the-art TensorFlow Model using our cloud provider. We want to make sure the model we are serving is not snooped upon by anyone including the cloud provider. Let's wear our 'Cloud Provider' hat and provision the infrastructure.

Since now we know that we will be able to encrypt the memory pages of the virtual machine, for those ultra critical workloads shouldn't we run our containers inside those virtual machines? There is such a runtime for containers that allows you to run container inside a virtual machine seamlessly, kata containers.

Kata Containers

Image by Kata

Kata Containers can interface with kubernetes via containerd-cri. So essentially with the magic of some annotations (more on that later), you can make sure that certain pods of the kubernetes that require extra workload isolation will be launched inside their respective virtual machines.

See how you can install Kata on x86 or IBM's Power Systems.

Containerd

Once installed you will have to let the containerd know to about it. This is done by starting the containerd with a custom configuration file.

Here is the containerd configuration that worked for me. You will have to start a containerd using

containerd -c /path_to_file/containerd_config

Kubernetes

Now let's make sure that the kubernetes end up talking to containerd that we just started. This is done by,

git clone https://github.com/kubernetes/kubernetes/
cd kubernetes
make quick-binaries
CONTAINER_RUNTIME=remote CONTAINER_RUNTIME_ENDPOINT='unix:///run/containerd/containerd.sock'  ./hack/local-up-cluster.sh

That should be it. Now let's remove the Cloud Provider's hat and let's be the users of the cloud services again :)

Encrypt your TensorFlow Model

Why do you need to encrypt the Model? Because what we said so far helps you protect your workload from getting snooped by cloud provider during the execution but when your data is at rest in the docker registry the admin of that registry still can get their hands on the very data you were trying to hide. So we aim to encrypt all that we care about and decrypt in a secure enclave where no unauthorized snooping could take place.

We can easily encrypt our model and package it with a docker image, but in order to decrypt it in a kubernetes workflow, we need to understand a few things about Kubernetes emptyDir volumes.

Kubernetes EmptyDir Volumes

Kubernetes pods can have a special type of storage attached to them, emptyDir type (aka Ephemeral Storage). The main purpose of ephemeral storage is to allow the containers of the same pod to communicate with each other using a filesystem. As the name, ephemeral, suggests this type of storage is not used for persistent storage requirements.

Kubelet creates an empty directory (hence the name, emptyDir volume) on the worker node. This directory can be backed by tmpfs on the worker node. This directly is allocated to the pod in question as a storage volume. As soon as the pod is destroyed or restarted the content of this directory are purged.

So what role this ephemeral storage type from kubernetes plays in our effort to encrypt our TensorFlow model? To understand that let's see how kata containers deal with ephemeral storage slightly differently than your regular container runtime.

The image on the left shows how an ephemeral volume provisioning works in kubernetes by default. The emptyDir provisioned by the Kubelet resides on the host system and then is bind mounted to the containers of the pod.

Now recall that the containers under kata runtime are run inside a virtual machine. So when you use Kata runtime with kubernetes, Kata provisions that ephemeral storage inside the virtual machine backed by the memory pages of that virtual machine. See the image on the right. This means containers writing to the ephemeral storage running using the kata runtime and hosted on the system supporting the memory encryption (such as IBM Ultravisor) will not expose their content to the very host they are running on. Isn't it kinda cool? :)

The first 3 containers that you see above are pod init containers. Init containers in the pod execute sequentially. So we can use them to prepare our environment before executing the application. This is how the pod workflow looks like in our case,

Fetch the keys from the keyserver and copy it in ephemeral storage
Fetch the encrypted TensorFlow model and copy it in ephemeral storage
Read the encrypted TensorFlow model from the ephemeral storage, decrypt it and write it back to the ephemeral storage
Serve the model using the TensorFlow model server.

I have put together a video where we serve the encrypted TensorFlow model from a docker image. We are going to ask our TensorFlow serving to tell us if this kitty is a 'Persian Cat'. :)

Image by Wikipedia

*Note that although when using Kata the contents of the ephemeral storage are not visible on the host, without host having support for memory encryption a malicious admin can still take a memory dump of the VM process to get to the content.

Here is the sample kubernetes pod yaml.

apiVersion: v1
kind: Pod
metadata:
  name: modelserving 
  # let containerd know that this pod has to be run
  # in an alternate runtime which in our case would be
  # kata container runtime
#  annotations:
#    io.kubernetes.cri.untrusted-workload: "true"
  labels:
    name: modelserving 
spec:
  containers:
  # App Container - This container holds the TF Serving binary
  - name: modelserving 
    image: pharshal/tfserving:latest
    ports:
    - containerPort: 5800 
    volumeMounts:
    - name: cache-volume
      mountPath: /models/inception 
    command: ["/usr/bin/tensorflow_model_server","--port=8500", "--model_name=inception","--model_base_path=/models/inception"]
  initContainers:
  # Fetch Container - Use this container if you want to fetch the decryption key from a key server
#  - name: keys-container
#    volumeMounts:
#    - mountPath: /cache
#      name: cache-volume
#    imagePullPolicy: IfNotPresent
#    command: ['sh', '-c', 'wget -O /cache/secret key-server-ip:/secret']
  # Data container - You encrypted tensorflow model should be in this image
  - name: data-container
    volumeMounts:
    - mountPath: /cache
      name: cache-volume
    imagePullPolicy: IfNotPresent
    image: pharshal/tfmodel:latest 
    command: ['sh', '-c', 'cp /model.tar.gpg /cache/']
  # Tools Container - Utils for decryption
  - name: init-decrypt
    volumeMounts:
    - mountPath: /cache
      name: cache-volume
    imagePullPolicy: IfNotPresent
    image: pharshal/tfutils:latest
    #command: ["/usr/bin/gpg","--passphrase-file","/cache/secret","--batch","--quiet","--yes","--no-use-agent","-o","/cache/index.html","-d","/cache/index.html.gpg"]
    command: ["/usr/bin/gpg","--passphrase","heyho","--batch","--quiet","--yes","--no-use-agent","-o","/cache/model.tar","-d","/cache/model.tar.gpg"]
  - name: init-decrypt2
    volumeMounts:
    - mountPath: /cache
      name: cache-volume
    imagePullPolicy: IfNotPresent
    image: pharshal/tfutils:latest
    command: ["sh","-c","tar -xf /cache/model.tar -C /cache"]
  volumes:
  - name: cache-volume
    emptyDir:
      medium: "Memory"
---
apiVersion: v1
kind: Service
metadata:
  name: modelserving 
  labels:
    name: modelserving 
spec:
  type: NodePort
  ports:
    - port: 5800 
      nodePort: 30080 
      name: modelserving 
  selector:
    name: modelserving

Train and encrypt your own model

Let's get a pre-trained TensorFlow Inception Model. Only follow until the end of Part 1 from that link. You should be able to see something like,

Successfully loaded model from inception-v3/model.ckpt-157585 at step=157585.
Exporting trained moedl to models/inception/1
Successfully exported model to models/inception
$ ls models/inception
1

Now we have to encrypt this model

cd models/inception
tar -cf model.tar 1
gpg -c model.tar

This will create a file named 'model.tar.gpg'. Remember the passphrase you entered while encrypting the model, we will need it later while creating a kubernetes pod yaml. Now let's package this encrypted trained TensorFlow Model in a Docker Image

cat > Dockerfile << EOF
FROM ubuntu:latest
COPY model.tar.gpg /
EOF

and to build,

docker build . -t  docker_image_name:docker_image_tag

Thanks.

That's all folks! :-)

Search This Blog