Blog · 2018-09-17 · Karol

TensorFlow – Training and Evaluation on Separate Machines

MooseFS acts as the shared checkpoint and log store that lets a TensorFlow Inception v3 training job and a continuous evaluation job run simultaneously on separate GPU machines. This step-by-step guide covers the full stack: CUDA, cuDNN, TensorFlow, Bazel, and MooseFS client on Google Cloud.

Introduction

TensorFlow is an open-source framework for high-performance numerical computation. It is used for both research and production systems on a variety of platforms from mobile and edge devices to desktops and clusters of servers.

The scenario involves three servers: one for model training, one for evaluation, and one running TensorBoard for visualization.

Hardware configuration

Training host: GPU (optional, recommended), MooseFS Client (required), MooseFS Chunkserver (optional)
Evaluation host: GPU (optional, recommended), MooseFS Client (required), MooseFS Chunkserver (optional)

MooseFS test cluster in Google Cloud

The experiment uses MooseFS deployed on Google Cloud machines. You can follow the guide for setting up a MooseFS cluster in Google Cloud to prepare the environment.

Installation Steps

Install Python

sudo apt update
sudo apt upgrade -y
python3 -V
sudo apt install -y python-pip python3-pip
sudo pip2 install --upgrade pip
sudo pip3 install --upgrade pip

Install CUDA

wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
sudo apt-get update
sudo apt install cuda-9.0 -y
sudo reboot
nvidia-smi

Install cuDNN

Download from developer.nvidia.com/rdp/cudnn-archive, then:

gcloud compute scp ~/Downloads/cudnn-9.0-linux-x64-v7.1.tgz tensorflow0:
tar xvzf cudnn-9.0-linux-x64-v7.1.tgz
cd cuda
sudo cp -P include/cudnn.h /usr/include
sudo cp -P lib64/libcudnn* /usr/lib/x86_64-linux-gnu/
sudo chmod a+r /usr/lib/x86_64-linux-gnu/libcudnn*

Install TensorFlow

sudo pip2 install tensorflow-gpu==1.10.0
sudo pip3 install tensorflow-gpu==1.10.0

Verify the installation:

python2 -c "from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())"
python3 -c "from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())"

Install MooseFS Client

sudo su
wget -O - https://ppa.moosefs.com/moosefs.key | apt-key add -
echo "deb http://ppa.moosefs.com/moosefs-3/apt/ubuntu/xenial xenial main" > /etc/apt/sources.list.d/moosefs.list
apt update
apt install moosefs-client
mkdir -p /mnt/moosefs
mfsmount /mnt/moosefs

Download repositories

cd ~
sudo apt install git -y
git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
git clone https://github.com/tensorflow/models.git

Install Bazel

sudo apt-get install pkg-config zip g++ zlib1g-dev unzip python zip openjdk-9-jdk-headless openjdk-9-jre-headless zlib1g-dev -y
wget https://github.com/bazelbuild/bazel/releases/download/0.16.0/bazel-0.16.0-installer-linux-x86_64.sh
chmod +x bazel-0.16.0-installer-linux-x86_64.sh
./bazel-0.16.0-installer-linux-x86_64.sh --user

Fine-tuning Inception v3 in TensorFlow

Download the flowers dataset

FLOWERS_DATA_DIR=/mnt/moosefs/flowers-data
cd ~/tensorflow/models/research/inception
bazel build //inception:download_and_preprocess_flowers
sudo pip2 install numpy
bazel-bin/inception/download_and_preprocess_flowers "${FLOWERS_DATA_DIR}"

Download the pretrained model

INCEPTION_MODEL_DIR=/mnt/moosefs/inception-v3-model
mkdir -p ${INCEPTION_MODEL_DIR}
cd ${INCEPTION_MODEL_DIR}
curl -O http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz
tar xzf inception-v3-2016-03-01.tar.gz

Build and run training

cd ~/tensorflow/models/research/inception
bazel build //inception:flowers_train
bazel build //inception:flowers_eval
sudo apt install screen -y
screen -S flowers_train

FLOWERS_DATA_DIR=/mnt/moosefs/flowers-data
INCEPTION_MODEL_DIR=/mnt/moosefs/inception-v3-model
MODEL_PATH="${INCEPTION_MODEL_DIR}/inception-v3/model.ckpt-157585"
TRAIN_DIR=/mnt/moosefs/flowers/train/

bazel-bin/inception/flowers_train \
--train_dir="${TRAIN_DIR}" \
--data_dir="${FLOWERS_DATA_DIR}" \
--pretrained_model_checkpoint_path="${MODEL_PATH}" \
--fine_tune=True \
--initial_learning_rate=0.001 \
--input_queue_memory_factor=1

Run TensorBoard

screen -S flowers_board
tensorboard --logdir=/mnt/moosefs/flowers --host 0.0.0.0

Clone the machine and run evaluation

After creating a snapshot and cloning the tensorflow0 instance:

sudo mfsmount /mnt/moosefs
screen -S flowers_eval

cd ~/tensorflow/models/research/inception
FLOWERS_DATA_DIR=/mnt/moosefs/flowers-data
INCEPTION_MODEL_DIR=/mnt/moosefs/inception-v3-model
MODEL_PATH="${INCEPTION_MODEL_DIR}/inception-v3/model.ckpt-157585"
TRAIN_DIR=/mnt/moosefs/flowers/train/
EVAL_DIR=/mnt/moosefs/flowers/eval/

bazel-bin/inception/flowers_eval \
  --eval_dir="${EVAL_DIR}" \
  --data_dir="${FLOWERS_DATA_DIR}" \
  --subset=validation \
  --num_examples=500 \
  --checkpoint_dir="${TRAIN_DIR}" \
  --input_queue_memory_factor=1

Summary

TensorFlow writes logs and saves models to MooseFS storage, so all of the files are available on both machines. With this setup you can run multiple training jobs with different parameters simultaneously to accelerate result generation.