Blog · 2018-09-17 · Karol

TensorFlow – Training and Evaluation on Separate Machines

MooseFS acts as the shared checkpoint and log store that lets a TensorFlow Inception v3 training job and a continuous evaluation job run simultaneously on separate GPU machines. This step-by-step guide covers the full stack: CUDA, cuDNN, TensorFlow, Bazel, and MooseFS client on Google Cloud.

Introduction

TensorFlow is an open-source framework for high-performance numerical computation. It is used for both research and production systems on a variety of platforms from mobile and edge devices to desktops and clusters of servers.

The scenario involves three servers: one for model training, one for evaluation, and one running TensorBoard for visualization.

Hardware configuration

  • Training host: GPU (optional, recommended), MooseFS Client (required), MooseFS Chunkserver (optional)
  • Evaluation host: GPU (optional, recommended), MooseFS Client (required), MooseFS Chunkserver (optional)

MooseFS test cluster in Google Cloud

The experiment uses MooseFS deployed on Google Cloud machines. You can follow the guide for setting up a MooseFS cluster in Google Cloud to prepare the environment.

Installation Steps

Install Python
sudo apt update
sudo apt upgrade -y
python3 -V
sudo apt install -y python-pip python3-pip
sudo pip2 install --upgrade pip
sudo pip3 install --upgrade pip
Install CUDA
wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
sudo apt-get update
sudo apt install cuda-9.0 -y
sudo reboot
nvidia-smi
Install cuDNN

Download from developer.nvidia.com/rdp/cudnn-archive, then:

gcloud compute scp ~/Downloads/cudnn-9.0-linux-x64-v7.1.tgz tensorflow0:
tar xvzf cudnn-9.0-linux-x64-v7.1.tgz
cd cuda
sudo cp -P include/cudnn.h /usr/include
sudo cp -P lib64/libcudnn* /usr/lib/x86_64-linux-gnu/
sudo chmod a+r /usr/lib/x86_64-linux-gnu/libcudnn*
Install TensorFlow
sudo pip2 install tensorflow-gpu==1.10.0
sudo pip3 install tensorflow-gpu==1.10.0

Verify the installation:

python2 -c "from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())"
python3 -c "from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())"
Install MooseFS Client
sudo su
wget -O - https://ppa.moosefs.com/moosefs.key | apt-key add -
echo "deb http://ppa.moosefs.com/moosefs-3/apt/ubuntu/xenial xenial main" > /etc/apt/sources.list.d/moosefs.list
apt update
apt install moosefs-client
mkdir -p /mnt/moosefs
mfsmount /mnt/moosefs
Download repositories
cd ~
sudo apt install git -y
git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
git clone https://github.com/tensorflow/models.git
Install Bazel
sudo apt-get install pkg-config zip g++ zlib1g-dev unzip python zip openjdk-9-jdk-headless openjdk-9-jre-headless zlib1g-dev -y
wget https://github.com/bazelbuild/bazel/releases/download/0.16.0/bazel-0.16.0-installer-linux-x86_64.sh
chmod +x bazel-0.16.0-installer-linux-x86_64.sh
./bazel-0.16.0-installer-linux-x86_64.sh --user

Fine-tuning Inception v3 in TensorFlow

Download the flowers dataset
FLOWERS_DATA_DIR=/mnt/moosefs/flowers-data
cd ~/tensorflow/models/research/inception
bazel build //inception:download_and_preprocess_flowers
sudo pip2 install numpy
bazel-bin/inception/download_and_preprocess_flowers "${FLOWERS_DATA_DIR}"
Download the pretrained model
INCEPTION_MODEL_DIR=/mnt/moosefs/inception-v3-model
mkdir -p ${INCEPTION_MODEL_DIR}
cd ${INCEPTION_MODEL_DIR}
curl -O http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz
tar xzf inception-v3-2016-03-01.tar.gz
Build and run training
cd ~/tensorflow/models/research/inception
bazel build //inception:flowers_train
bazel build //inception:flowers_eval
sudo apt install screen -y
screen -S flowers_train

FLOWERS_DATA_DIR=/mnt/moosefs/flowers-data
INCEPTION_MODEL_DIR=/mnt/moosefs/inception-v3-model
MODEL_PATH="${INCEPTION_MODEL_DIR}/inception-v3/model.ckpt-157585"
TRAIN_DIR=/mnt/moosefs/flowers/train/

bazel-bin/inception/flowers_train \
--train_dir="${TRAIN_DIR}" \
--data_dir="${FLOWERS_DATA_DIR}" \
--pretrained_model_checkpoint_path="${MODEL_PATH}" \
--fine_tune=True \
--initial_learning_rate=0.001 \
--input_queue_memory_factor=1
Run TensorBoard
screen -S flowers_board
tensorboard --logdir=/mnt/moosefs/flowers --host 0.0.0.0
Clone the machine and run evaluation

After creating a snapshot and cloning the tensorflow0 instance:

sudo mfsmount /mnt/moosefs
screen -S flowers_eval

cd ~/tensorflow/models/research/inception
FLOWERS_DATA_DIR=/mnt/moosefs/flowers-data
INCEPTION_MODEL_DIR=/mnt/moosefs/inception-v3-model
MODEL_PATH="${INCEPTION_MODEL_DIR}/inception-v3/model.ckpt-157585"
TRAIN_DIR=/mnt/moosefs/flowers/train/
EVAL_DIR=/mnt/moosefs/flowers/eval/

bazel-bin/inception/flowers_eval \
  --eval_dir="${EVAL_DIR}" \
  --data_dir="${FLOWERS_DATA_DIR}" \
  --subset=validation \
  --num_examples=500 \
  --checkpoint_dir="${TRAIN_DIR}" \
  --input_queue_memory_factor=1

Summary

TensorFlow writes logs and saves models to MooseFS storage, so all of the files are available on both machines. With this setup you can run multiple training jobs with different parameters simultaneously to accelerate result generation.