Use Kubernetes to Speed Machine Learning Development

As industries shift to a microservices approach of deploying applications using containers, data scientists can reap the benefits. Data Scientists use specific frameworks and operating systems that can often conflict with the requirements of a production system. This has led to many clashes between IT and R&D departments. IT is not going to change the OS to meet the needs of a model that needs a specific framework that won’t run on RHEL 7.2.

Containers allow a data scientist to construct self-contained environments that package up necessary dependencies and logic. This also allows the data scientist a seat at the table as discussions move from DevOps to DataOps. As data arrives and is parsed for value, containers that perform specific tasks can be staged along the way, creating a machine learning workflow on new incoming data that was not possible just a few years ago.

Data scientists can deploy multiple containers to account for adjustments in the data or variations in their model. This allows for an organization to run models in parallel to evaluate and then choose which one they find more valuable because it was applied on new real-time data and not optimized on historical data.

For this example, I installed Docker and Kubernetes using kubeadm on AWS ec2 instances to create a two-node kubernetes cluster running centos 7.5:

[justin@ip-10-0-0-105 ~]$ rpm --query centos-release
centos-release-7-5.1804.el7.centos.2.x86_64
[justin@ip-10-0-0-105 ~]$ python -V
Python 2.7.5
[justin@ip-10-0-0-105 ~]$ kubectl cluster-info
Kubernetes master is running at http://10.0.0.105:6443
KubeDNS is running at http://10.0.0.105:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
[justin@ip-10-0-0-105 ~]$ kubectl get nodes
 
NAME                         STATUS         ROLES           AGE   VERSION
 
ip-10-0-0-149.ec2.internal    Ready             <none>           5h        v1.11.3
ip-10-0-0-236.ec2.internal    Ready             master            5h        v1.11.3

Data Scientists typically development, train, test and optimize their models in an R&D environment that can be configured to meet their needs. Here is a Tensorflow model I wrote in a sandbox that applies a Recurrent Neural Network on simulated time series data.

import pandas as pd
import numpy as np
import os
import random
import shutil
import tensorflow as tf
import tensorflow.contrib.metrics as metrics
import tensorflow.contrib.rnn as rnn
 
def main():
    random.seed(111)
    rng = pd.date_range(start='1/01/2000', end='9/21/2018')
    ts = pd.Series(np.random.uniform(-10, 10, size=len(rng)), rng).cumsum()
    TS = np.array(ts)
    num_periods = 100
 
    f_horizon = 1  #forecast horizon, one period into the future
    x_data = TS[:(len(TS)-(len(TS) % num_periods))]
    x_batches = x_data.reshape(-1, 100, 1)
    y_data = TS[1:(len(TS)-(len(TS) % num_periods))+f_horizon]
    y_batches = y_data.reshape(-1, 100, 1)
 
    #number of periods per vector we are using to predict one period ahead
 
    num_periods = 100                 
 
    inputs = 1                #number of vectors submitted
    hidden = 100              #number of neurons we will recursively work through
    output = 1                #number of output vectors
 
    #create variable objects
 
    X = tf.placeholder(tf.float32, [None, num_periods, inputs])      
    y = tf.placeholder(tf.float32, [None, num_periods, output])
 
    
    #create our RNN
 
    model = tf.nn.rnn_cell.BasicRNNCell(num_units=hidden, activation=tf.nn.relu)   
 
    #choose dynamic over static
 
    rnn_output, states = tf.nn.dynamic_rnn(model, X, dtype=tf.float32)                     
    learning_rate = 0.001 
 
    #change the form into a tensor
 
    stacked_rnn_output = tf.reshape(rnn_output, [-1, hidden])          
    stacked_outputs = tf.layers.dense(stacked_rnn_output, output)   
    outputs = tf.reshape(stacked_outputs, [-1, num_periods, output]) 
 
    #define the cost function which evaluates the quality of our model
 
    loss = tf.reduce_sum(tf.square(outputs - y))            
 
    #gradient descent method
 
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)              
 
    #train the result of the application of the cost_function
 
    training_op = optimizer.minimize(loss)                                   
    saver = tf.train.Saver()   #we are going to save the model
    DIR="/tmp/model/"
    init = tf.global_variables_initializer()
    epochs = 1000       
 
    with tf.Session() as sess:
            init.run()
            model_performance = []
            for ep in range(epochs):
            sess.run(training_op, feed_dict={X: x_batches, y: y_batches})
            mse = loss.eval(feed_dict={X: x_batches, y: y_batches})
            model_performance.append((ep, mse))
            saver.save(sess, os.path.join(DIR,"model"),global_step = epochs)
 
if __name__== "__main__":
 
  main()

I built this in the R&D environment, but now I want to move it over to the production environment. I will use Docker to build an image that I can then put my model into and deploy using Kubernetes.

First I will create a Dockerfile that will allow me to construct an image with an Ubuntu OS and install the dependencies and packages my model needs to function.

[justin@ip-10-0-0-105 ~]$ mkdir dockbuild
[justin@ip-10-0-0-105 ~]$ cd dockbuild/
[justin@ip-10-0-0-105 dockbuild]$ vi Dockerfile
 
FROM ubuntu:16.04
RUN apt-get update && apt-get install -y \
 build-essential \
 curl \
 git \
 libfreetype6-dev \
 libpng12-dev \
 libzmq3-dev \
 mlocate \
 pkg-config \
 python-dev \
 python-numpy \
 python-pip \
 software-properties-common \
 swig \
 zip \
 zlib1g-dev \
 libcurl3-dev \
 openjdk-8-jdk\
 openjdk-8-jre-headless \
 wget \
 && \
 apt-get clean && \
 rm -rf /var/lib/apt/lists/*
 
RUN echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" \
 
 | tee /etc/apt/sources.list.d/tensorflow-serving.list
 
RUN curl http://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg \
 
 | apt-key add -
 
RUN apt-get update && apt-get install -y \
 
 tensorflow-model-server
 
RUN pip install --upgrade pip
RUN pip install pandas tensorflow tensorflow-serving-api
CMD ["/bin/bash"]

I am going to use the Tensorflow Serving API to execute and save my model within a Docker container. Next, build the image and then run it:

[justin@ip-10-0-0-105 dockbuild]$ sudo docker build -t justin-tf_serving .
[justin@ip-10-0-0-105 dockbuild]$ sudo docker run --name=rnn_model_1 -it justin-tf_serving
 
root@1c48c2df62f8:/# cd
root@1c48c2df62f8:~# vi rnn_model_1.py
root@1c48c2df62f8:~# python rnn_model_1.py

Copy the model in and then run the python script. The model parameters will be saved in the /tmp/ folder within the container. To exit the container and have it keep running in the background press Ctrl+P and Ctrl+Q. I need to persist the changes I made to the justin-tf_serving container in order for my model data to permanently remain. Retrieve the container ID and commit the changes into a new image called tf_kube1.

[justin@ip-10-0-0-105 dockbuild]$ sudo docker ps -a
[justin@ip-10-0-0-105 dockbuild]$ sudo docker commit 07845b9c7ec5 tf_kube1
[justin@ip-10-0-0-105 dockbuild]$ sudo docker stop rnn_model_1

Kubernetes allows you to pull images from a private or local image hub, but for the purpose of this example, we will push and then pull our new image from Docker Hub. Lock in with your username and password.

[justin@ip-10-0-0-105 dockbuild]$ sudo docker login --username=jbrandenburg
Password:
Login Succeeded
[justin@ip-10-0-0-105 dockbuild]$ sudo docker tag tf_kube1 jbrandenburg/kube-example
[justin@ip-10-0-0-105 dockbuild]$ sudo docker push jbrandenburg/kube-example

Once our image is on Docker Hub we need to specify how we want Kubernetes to use the image on our cluster. We do this via a .yaml file. We setting up a deployment of containers that will also be running as a service.

[justin@ip-10-0-0-105 dockbuild]$ vi kube_example.yaml apiVersion: v1 kind: Deployment metadata: &nbsp; name: tfrnn-deployment spec: &nbsp; replicas: 3 &nbsp; template: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; metadata: &nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; labels: &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; app: tfrnn-server &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; spec: &nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; containers: &nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - name: rnn-model-1 &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; image: jbrandenburg/kube-example &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; command: &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - /bin/sh &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; args: &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - -c &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - tensorflow_model_server --model_name=model --model_base_path=/tmp/model &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ports: &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - containerPort: 8500 --- apiVersion: v1 kind: Service metadata: &nbsp; labels: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; run: tfrnn-service &nbsp; name: tfrnn-service spec: &nbsp; ports: &nbsp; - port: 8500 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; targetPort: 8500 &nbsp; selector: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; app: tfrnn-server &nbsp; type: LoadBalancer Create the Kubernetes objects: [justin@ip-10-0-0-105 dockbuild]$ kubectl get nodes [justin@ip-10-0-0-105 dockbuild]$ kubectl create -f kube_example.yaml deployment.extensions/tfrnn-deployment created service/tfrnn-service created [justin@ip-10-0-0-105 dockbuild]$ kubectl get deployments NAME&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; DESIRED&nbsp;&nbsp; CURRENT&nbsp;&nbsp; UP-TO-DATE&nbsp;&nbsp; AVAILABLE&nbsp;&nbsp; AGE tfrnn-deployment&nbsp;&nbsp; 3&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; 3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; 3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; 23s [justin@ip-10-0-0-105 dockbuild]$ kubectl get pods NAME&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; READY &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; STATUS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; RESTARTS&nbsp;&nbsp; AGE tfrnn-deployment-868f55dd5-7s4tw&nbsp;&nbsp; 1/1&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Running&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 42s tfrnn-deployment-868f55dd5-mnvlb&nbsp;&nbsp; 1/1&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Running&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 42s tfrnn-deployment-868f55dd5-qf5j6&nbsp;&nbsp; 1/1&nbsp;&nbsp; Running&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 42s [justin@ip-10-0-0-105 dockbuild]$ kubectl get services NAME&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; TYPE&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; CLUSTER-IP EXTERNAL-IP&nbsp;&nbsp; PORT(S)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; AGE kubernetes&nbsp; &nbsp; ClusterIP&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 10.96.0.1&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;none&gt;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 443/TCP&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; 1m tfrnn-service&nbsp;&nbsp; LoadBalancer&nbsp;&nbsp; 10.106.86.19&nbsp;&nbsp; &lt;pending&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8500:31723/TCP&nbsp;&nbsp; 1m [justin@ip-10-0-0-105 dockbuild]$ kubectl describe service tfrnn-service Name:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; tfrnn-service Namespace:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; default Labels:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;run=tfrnn-service Annotations:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&lt;none&gt; Selector:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;app=tfrnn-server Type:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; LoadBalancer IP:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 10.106.86.19 Port:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; &lt;unset&gt;&nbsp; 8500/TCP TargetPort:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;8500/TCP NodePort:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&lt;unset&gt;&nbsp; 31723/TCP Endpoints:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Session Affinity:&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; None External Traffic Policy:&nbsp; Cluster Events:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&lt;none&gt; Log in to one of the three pods that we just instantiated: [justin@ip-10-0-0-105 dockbuild]$ kubectl exec -it tfrnn-deployment-868f55dd5-7s4tw -- /bin/bash root@tfrnn-deployment-868f55dd5-7s4tw:/# ls /tmp/model/ checkpoint&nbsp; model-1000.data-00000-of-00001&nbsp; model-1000.index&nbsp; model-1000.meta

Kubernetes is now running our Docker image that contained the trained Tensorflow model we created. Now we can push new data through the model and our model can evaluate this data and give us our results. We could have created second images with adjustments in the model hyperparameters and our pods could be running Model A and Model B side by side to compare results.

Our models have all they need to run in their containers. The containers are configured to run in the production environment. Kubernetes will let us specify resources to improve efficiency in the compute allocated to our models and will let us know if a container is not performing as it should.

As recently as two years ago, once I had performed my analysis and gained insight from data, I was never able to take the next step and deploy this insight. I would write a report, send an email, or present some slides, but my value was limited to only what decision makers would do with it. Transferring my workflow logic and model into a production-ready application required the approval of many people and the dedication of a software developer. In a dynamic industry, this lag could allow the data to change which would the model results to be less meaningful.

With developments in containers and Kubernetes, this doesn’t need to be the case any longer. The value of data science is determined by the insight it gives into data. This value can only increase as the ability to solve challenges in real time becomes more available.

This article originated from http://thenewstack.io/use-kubernetes-to-speed-machine-learning-development/

Justin Brandenburg is a Tigera guest blogger. Justin Brandenburg is a Data Scientist in the MapR Professional Services group. Justin has experience in a number of data areas ranging from counter-narcotics to cyber intrusion analysis. In past projects, he has utilized machine learning, econometrics, graph analytics and agent-based modeling to fulfill the customer needs. He has an undergraduate degree in Economics from Va Tech, a Masters in Economics from Johns Hopkins University and a Masters in Computational Social Science from George Mason University.

————————————————-

Free Online Training
Access Live and On-Demand Kubernetes Tutorials

Calico Enterprise – Free Trial
Solve Common Kubernetes Roadblocks and Advance Your Enterprise Adoption

How-To Open Source

Join our mailing list

Get updates on blog posts, workshops, certification programs, new releases, and more!

Use Kubernetes to Speed Machine Learning Development

Join our mailing list

Related posts

Prevent Data Exfiltration in Kubernetes: The Critical Role of Egress Access Controls

Recap: KubeCon + CloudNativeCon EU + CalicoCon 2024

Cisco Acquires Isovalent: A Big Win for Cloud-Native Network Security and a Validation of Tigera’s Vision