Tensorflow – Training and evaluation on separate machines

September 17th, 2018 | Karol Majek

Tensorflow is an open-source framework for high-performance numerical computation. It is used for both research and production systems on a variety of platforms from mobile and edge devices, to desktops and clusters of servers. Its main use is Machine Learning and especially Deep Learning.

In this post, we will show you how you can use your MooseFS cluster with Tensorflow to train deep neural networks!

Training and evaluation on separate machines

Imagine you have a machine with 1-4 GPUs and you want to use all of its for training but you still want to see the evaluation results. If you don’t want to run the evaluation job on CPU only, you can run the evaluation on other GPU enabled host. We will use a simple configuration where we have 3 servers with Tensorflow installed: model training, evaluation and Tensorboard graph visualization tool.

Hardware configuration

In this demo we are using three hosts with the following hardware/software:

Training host
- GPU (optional, recommended)
- MooseFS Client (required)
- MooseFS Chunkserver (optional)
Evaluation host
- GPU (optional, recommended)
- MooseFS Client (required)
- MooseFS Chunkserver (optional)

We assume training and evaluation hosts have one or more GPUs. It is recommended configuration since GPUs speed up deep neural networks training process many times. The third machine is used only for visualization of training/evaluation process. It runs tensorboard which is reading logs from a specified directory.

Of course, you can install Tensorflow on chunkservers or install MooseFS Chunkserver in your Tensorflow GPU-enabled host.

MooseFS test cluster in Google Cloud

For the purpose of this experiment, we will use the MooseFS cluster deployed on machines in Google Cloud. If you don’t have your cluster ready, go to this post How to quickly set up the MooseFS cluster in Google Cloud.

If you want to check the throughput of your network, go here.

Computation on separate nodes

In this case, we consider training Tensorflow models on separate machines. We assume operating systems are installed on new hosts, and hosts are in the same local network. Tensorflow machines have MooseFS Clients connected to the MooseFS Storage Cluster.

Installation

We will set up two identical machines for Tensorflow with 4 CPU cores, 16GB of RAM and one GPU – Tesla K80. Machines will be using standard 100GB hard drives. The configuration is presented on a screenshot below. We will create one and install Tensorflow and MooseFS and we will copy it afterward.

Machines with GPUs cost more! Check the pricing or go to the Pricing Calculator to estimate costs. You can’t create GPU enabled machines using free trial money, you will be charged.

You need to add port 6006 to firewall rules for tensorflow0 to be able to see training progress in Tensorboard. We will add network tag tensorboard to the tensorflow0 machine. This can be done only when the machine is switched off.

MooseFS In Google Cloud - Adding network tag for Tensorboard — MooseFS In Google Cloud – Adding a network tag for Tensorboard

We will create a firewall rule for label tensorboard

MooseFS in Google Cloud - Create Firewall rule for Tensorboard — MooseFS in Google Cloud – Create Firewall rule for Tensorboard

Install python

Let us install python3 and Tensorflow on Tensorflow0 machine. We will start by updating Ubuntu.

sudo apt update
sudo apt upgrade -y

Python 3.5.2 should be preinstalled already, so let us check the version of Python3 by typing:

python3 -V

Now we will install Python package manager PIP for python3

sudo apt install -y python-pip python3-pip
sudo pip2 install --upgrade pip
sudo pip3 install --upgrade pip

Install CUDA

Now we are ready to install Tensorflow. We want to install GPU enabled version of Tensorflow, so let us check the hardware installed. But first, we will need to install CUDA. CUDA 9.0 is our recommendation for Ubuntu 16.04 and Tensorflow 1.9.0.

We will use CUDA installation instructions from https://developer.nvidia.com/cuda-90-download-archive with a small modification – we will explicitly install 9.0, since there are newer versions.

wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
sudo apt-get update
sudo apt install cuda-9.0 -y

To complete installation reboot is required

sudo reboot

Now we can check Nvidia devices using the following command:

nvidia-smi

You should get a similar result as below.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44                 Driver Version: 396.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0    55W / 149W |     16MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1731      G   /usr/lib/xorg/Xorg                            14MiB |
+-----------------------------------------------------------------------------+

Install CUDNN

CUDNN is required for Tensorflow with GPU support.

Go to developer.nvidia.com/rdp/cudnn-archive and download latest CUDNN 7 for CUDA 9.0. You need to be registered to download it, registration is free of charge.

We recommend using general Linux package cuDNN v7.1.4 Library for Linux instead of packages prepared for Ubuntu.

After downloading the archive, we will copy it to our server using Google Cloud command line tool gcloud.

On your local machine:

gcloud compute scp ~/Downloads/cudnn-9.0-linux-x64-v7.1.tgz tensorflow0:

Back to tensorflow0. We will unpack the archive.

tar xvzf cudnn-9.0-linux-x64-v7.1.tgz
cd cuda
sudo cp -P include/cudnn.h /usr/include
sudo cp -P lib64/libcudnn* /usr/lib/x86_64-linux-gnu/
sudo chmod a+r /usr/lib/x86_64-linux-gnu/libcudnn*

Install Tensorflow

You can install Tensorflow using just one command thanks to Pip.

sudo pip2 install tensorflow-gpu==1.10.0
sudo pip3 install tensorflow-gpu==1.10.0

And now everything should be ready to run Tensorflow code. We can quickly check available devices for Tensorflow using:

python2 -c "from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())" 2>/dev/null
python3 -c "from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())" 2>/dev/null

Sample result shows that we can run on CPU and Tesla K80 GPU:

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 7924687012402941002
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 11266549351
locality {
bus_id: 1
links {
}python2 -c "from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())" 2>/dev/null
python3 -c "from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())" 2>/dev/null
}
incarnation: 8645147392941575694
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7"
]

Install MooseFS Client

sudo su

Add repository:

wget -O - https://ppa.moosefs.com/moosefs.key | apt-key add -

Add an appropriate repository entry in /etc/apt/sources.list.d/moosefs.list:

echo "deb http://ppa.moosefs.com/moosefs-3/apt/ubuntu/xenial xenial main" > /etc/apt/sources.list.d/moosefs.list

And run:

apt update

You can install MooseFS Client using the following command:

apt install moosefs-client

Now we will mount MooseFS to /mnt/moosefs

mkdir -p /mnt/moosefs
mfsmount /mnt/moosefs

That’s it! MooseFS is now mounted in /mnt/moosefs. We should see mount in CGI (mfsmaster:9425)

MooseFS CGI Mounts - Tensorflow machine mounted — MooseFS CGI Mounts – Tensorflow machine mounted

Download Tensorflow and Tensorflow/models repositories

We will download Tensorflow and Tensorflow models repositories from Github.

cd ~
sudo apt install git -y

git clone https://github.com/tensorflow/tensorflow.git

cd tensorflow

git clone https://github.com/tensorflow/models.git

Install Bazel

Install requirements

sudo apt-get install pkg-config zip g++ zlib1g-dev unzip python zip openjdk-9-jdk-headless openjdk-9-jre-headless zlib1g-dev -y

Download Bazel from Github https://github.com/bazelbuild/bazel/releases. Choose the latest bazel_X.YY.Z-installer-linux-x86_64.sh package.

wget https://github.com/bazelbuild/bazel/releases/download/0.16.0/bazel-0.16.0-installer-linux-x86_64.sh
chmod +x bazel-0.16.0-installer-linux-x86_64.sh
./bazel-0.16.0-installer-linux-x86_64.sh --user

Fine tuning Inception v3 in TensorFlow

Download flowers dataset

Now we will download flowers dataset and got through inception v3 tutorial.

# location of where to place the flowers data
FLOWERS_DATA_DIR=/mnt/moosefs/flowers-data

# build the preprocessing script.
cd ~/tensorflow/models/research/inception
bazel build //inception:download_and_preprocess_flowers
sudo pip2 install numpy
bazel-bin/inception/download_and_preprocess_flowers "${FLOWERS_DATA_DIR}"

It should finish with similar last line:

2018-08-09 09:56:40.812567: Finished writing all 3170 images in data set.

Let’s see what is inside of the directory

ls -lah ${FLOWERS_DATA_DIR}

Which results:

total 447M
drwxrwxr-x 3 karolmajek karolmajek 2.0M Aug 9 09:56 .
drwxrwxrwx 3 root       root       2.0M Aug 9 09:49 ..
-rw-rw-r-- 1 karolmajek karolmajek 219M Aug 9 09:49 flower_photos.tgz
drwxrwxr-x 4 karolmajek karolmajek 2.0M Aug 9 09:56 raw-data
-rw-rw-r-- 1 karolmajek karolmajek 97M  Aug 9 09:56 train-00000-of-00002
-rw-rw-r-- 1 karolmajek karolmajek 96M  Aug 9 09:56 train-00001-of-00002
-rw-rw-r-- 1 karolmajek karolmajek 16M  Aug 9 09:56 validation-00000-of-00002
-rw-rw-r-- 1 karolmajek karolmajek 16M  Aug 9 09:56 validation-00001-of-00002

Now we will download a pretrained Inception v3 model:

INCEPTION_MODEL_DIR=/mnt/moosefs/inception-v3-model
mkdir -p ${INCEPTION_MODEL_DIR}
cd ${INCEPTION_MODEL_DIR}

# download the Inception v3 model
curl -O http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz
tar xzf inception-v3-2016-03-01.tar.gz

This will create a directory called inception-v3 which contains the following files.

ls inception-v3
# README.txt checkpoint model.ckpt-157585

Now we will start training on flowers dataset

# Build the model. Note that we need to make sure the TensorFlow is ready to
# use before this as this command will not build TensorFlow.
cd ~/tensorflow/models/research/inception
bazel build //inception:flowers_train
bazel build //inception:flowers_eval

We will run everything in a screen because training takes a long time

sudo apt install screen -y

Let us start fine-tuning the model!

screen -S flowers_train

cd ~/tensorflow/models/research/inception
FLOWERS_DATA_DIR=/mnt/moosefs/flowers-data
INCEPTION_MODEL_DIR=/mnt/moosefs/inception-v3-model

# Path to the downloaded Inception-v3 model.
MODEL_PATH="${INCEPTION_MODEL_DIR}/inception-v3/model.ckpt-157585"

# Directory where to save the checkpoint and events files.
TRAIN_DIR=/mnt/moosefs/flowers/train/

# Run the fine-tuning on the flowers data set starting from the pre-trained
# Imagenet-v3 model.
bazel-bin/inception/flowers_train \
--train_dir="${TRAIN_DIR}" \
--data_dir="${FLOWERS_DATA_DIR}" \
--pretrained_model_checkpoint_path="${MODEL_PATH}" \
--fine_tune=True \
--initial_learning_rate=0.001 \
--input_queue_memory_factor=1

The training should start and you should see a similar output:

2018-08-09 11:30:23.891668: Pre-trained model restored from /mnt/moosefs/inception-v3-model/inception-v3/model.ckpt-157585
2018-08-09 11:31:05.327975: step 0, loss = 2.78 (1.0 examples/sec; 31.969 sec/batch)
2018-08-09 11:31:38.805869: step 10, loss = 2.57 (23.4 examples/sec; 1.370 sec/batch)
2018-08-09 11:31:52.601970: step 20, loss = 2.45 (23.2 examples/sec; 1.376 sec/batch)
2018-08-09 11:32:06.383492: step 30, loss = 2.17 (22.0 examples/sec; 1.452 sec/batch)
2018-08-09 11:32:20.049899: step 40, loss = 2.02 (23.4 examples/sec; 1.367 sec/batch)
2018-08-09 11:32:33.762179: step 50, loss = 1.96 (23.4 examples/sec; 1.370 sec/batch)
2018-08-09 11:32:47.634371: step 60, loss = 1.78 (23.2 examples/sec; 1.382 sec/batch)

Now you can detach from the screen using combination Ctrl+A, Ctrl+D

screen -S flowers_board
tensorboard --logdir=/mnt/moosefs/flowers --host 0.0.0.0

MooseFS In Google Cloud - Tensorboard visualization — MooseFS In Google Cloud – Tensorboard visualization

To reattach to the screen after pressing combination Ctrl+A, Ctrl+D you can use:

screen -r flowers_train

Clone the machine!

Since we installed everything we can clone tensorflow0 machine. We will create a snapshot of the disk.

MooseFS in Google Cloud - Create disk snapshot — MooseFS in Google Cloud – Create disk snapshot

Now we can clone our instance tensorflow0 and use the snapshot as a disk.

Start tensorflow1 machine and mount MooseFS

sudo mfsmount /mnt/moosefs

Then run the following commands:

screen -S flowers_eval

cd ~/tensorflow/models/research/inception
FLOWERS_DATA_DIR=/mnt/moosefs/flowers-data
INCEPTION_MODEL_DIR=/mnt/moosefs/inception-v3-model

# Path to the downloaded Inception-v3 model.
MODEL_PATH="${INCEPTION_MODEL_DIR}/inception-v3/model.ckpt-157585"

# Directory where to save the checkpoint and events files.
TRAIN_DIR=/mnt/moosefs/flowers/train/

# Directory where to save the evaluation events files.
EVAL_DIR=/mnt/moosefs/flowers/eval/

# Evaluate the fine-tuned model on a hold-out of the flower data set.
bazel-bin/inception/flowers_eval \
  --eval_dir="${EVAL_DIR}" \
  --data_dir="${FLOWERS_DATA_DIR}" \
  --subset=validation \
  --num_examples=500 \
  --checkpoint_dir="${TRAIN_DIR}" \
  --input_queue_memory_factor=1

After a moment the evaluation result should be ready in Tensorboard

Tensorflow and MooseFS - training and evaluation in Tensorboard — Tensorflow and MooseFS – training and evaluation in Tensorboard

Model is saved every 5000 steps and evaluation code is looking for a new checkpoint every 5 minutes. After 2 hours of training your Tensorboard should look similar to this:

Summary

In this post, you have learned how to use MooseFS with Tensorflow to start evaluation on the other host and evaluate the model while training. Tensorflow is writing logs and saving models to MooseFS storage, so all of the files are available on both machines.

You can also run another training simultanously and get more results with different parameters faster!