Deep Learning with Docker
Statically link all your dependencies
Written by Christopher HesseFebruary 1st, 2017
Docker is a way to statically link everything short of the Linux kernel into your application. Because you can access GPUs while using a Docker container, it's also a great way to link Tensorflow or any dependencies your machine learning code has so anyone can use your work.
You can distribute a reproducible machine learning project that requires little to no setup on the part of the user, so instead of:
Terminal
# 6 hours of installing dependencies
python train.py
> ERROR: libobscure.so cannot open shared object
You can do something like:
Terminal
dockrun tensorflow/tensorflow:0.12.1-gpu python train.py
> TRAINING SUCCESSFUL
And run your train.py script with all dependencies, including GPU support.
In this setup, Docker containers are ephemeral and do not persist any data inside the container. You can imagine that the Docker container is a 1GB tensorflow.exe that has all the dependencies you need already compiled in.
Why
Open source software often has a web of dependencies that is hard to reproduce: different versions of compilers, missing header files, incorrect library path, etc, all resulting in a lot of wasted time trying to get everything setup so that you can run the software.
With Docker, in theory, users only need to get Docker working correctly, and then all your code should run for them. Thankfully, Docker has raised 180 million dollars and has converted at least some of that cash into software that actually mostly works.
I'm going to cover using it on Linux, but using it on Mac should be the same, except that GPU support does not exist.
How to do the thing
For the case of machine learning, you probably want to distribute your code as a GitHub repo. Your dependencies are normally distributed as a series of Linux command lines that are supposed to be copied and pasted into a terminal.
Docker replaces that second part with a command that will fetch the correct Docker image required to run your code instead. You're statically linking all your dependencies together into a 3GB (compressed) image that the user can then download and use with no additional effort.
Let's look at the original Torch pix2pix implementation:
Terminal
git clone https://github.com/phillipi/pix2pix.git
cd pix2pix
bash datasets/download_dataset.sh facades
# install dependencies for some time
...
# train
env \
  DATA_ROOT=datasets/facades \
  name=facades \
  niter=200 \
  save_latest_freq=400 \
  which_direction=BtoA \
  display=0 \
  gpu=0 \
  cudnn=0 \
  th train.lua
While the training script has very few dependencies, which is great, the included tools have a number of not very documented dependencies that can be annoying to assemble.
If you mess up the dependencies somehow you end up with errors like this:
luajit: symbol lookup error:
/root/torch/install/lib/lua/5.1/libTHNN.so: undefined symbol: TH_CONVERT_ACCREAL_TO_REAL
Docker provides a way around this by distributing a binary artifact of all of your dependencies through Docker Hub.
Dockerized
On a Linux server you can install normal docker plus nvidia-docker and then your Docker containers get GPU access with no noticeable performance hit.
If you're on a Mac you can install Docker for Mac, which is pretty solid in my experience. You won't be able to run anything on the GPU, but then again, few Macs seem to support CUDA anyway. You can still test everything in CPU mode, which works well, if slowly.
On Linux, here's a script that installs docker on a fresh Ubuntu 16.04 LTS install, for use on cloud providers:
Terminal
curl -fsSL https://affinelayer.com/docker/setup-docker.py | sudo python3
Once you have docker installed, running the pix2pix code looks like:
Terminal
sudo docker run --rm --volume /:/host --workdir /host$PWD affinelayer/pix2pix <command>
Here's the full training setup, on multiple lines for readability:
Terminal
git clone https://github.com/phillipi/pix2pix.git
cd pix2pix
bash datasets/download_dataset.sh facades
sudo docker run \
  --rm \
  --volume /:/host \
  --workdir /host$PWD \
  affinelayer/pix2pix \
  env \
    DATA_ROOT=datasets/facades \
    name=facades \
    niter=200 \
    save_latest_freq=400 \
    which_direction=BtoA \
    display=0 \
    gpu=0 \
    cudnn=0 \
    th train.lua
This will download the image I built (including Torch + nvidia-docker support) which is like 3GB of data, so AOL users may be out of luck.
When this runs it should print out training debug information. This is pretty good already, but running on the GPU is crucial to get sufficient speed for training with the architecture used in pix2pix.
GPU
Running it on the GPU is just replacing docker in the previous commands with nvidia-docker.
nvidia-docker is not yet included in standard Docker, so you may have to do some setup. Here's a script that works on the same fresh Ubuntu 16.04 LTS install:
Terminal
curl -fsSL https://affinelayer.com/docker/setup-nvidia-docker.py | sudo python3
This should take about 5 minutes to run. I tested this on Azure and AWS (it took a few days and filing support tickets to get access to the GPU instances). Both have NVIDIA K80 cards rated at 2.9 FP32 TFLOPS.
When you have nvidia-docker setup, this should print your current graphics card:
Terminal
sudo nvidia-docker run --rm nvidia/cuda nvidia-smi
Assuming that works, run the GPU training mode of pix2pix:
Terminal
sudo nvidia-docker run \
  --rm \
  --volume /:/host \
  --workdir /host$PWD \
  affinelayer/pix2pix \
  env \
    DATA_ROOT=datasets/facades \
    name=facades \
    niter=200 \
    save_latest_freq=400 \
    which_direction=BtoA \
    display=0 \
    th train.lua
This uses the same Docker image, but allows access to the GPU.
The docker commands will connect to the Docker image registry and download the specified image affinelayer/pix2pix, then run the command inside an ephemeral container made from that image.
Protips
For Python with Tensorflow, there are a couple of command line options you'll probably want to use:
--env PYTHONUNBUFFERED=x
This makes it so that Python prints all output immediately, instead of buffering it because it doesn't realize a user is viewing the output.
--env CUDA_CACHE_PATH=/host/tmp/cuda-cache
This makes it so that you don't have a 1 minute delay every time you start Tensorflow and it has to recompile CUDA kernels from scratch.
Combined, the Docker command line looks like this:
Terminal
sudo nvidia-docker run \
  --rm \
  --volume /:/host \
  --workdir /host$PWD \
  --env PYTHONUNBUFFERED=x \
  --env CUDA_CACHE_PATH=/host/tmp/cuda-cache \
  <image> \
  <command>
This is pretty long, so you may want to define an alias:
Terminal
alias dockrun="sudo nvidia-docker run --rm --volume /:/host --workdir /host\$PWD --env PYTHONUNBUFFERED=x --env CUDA_CACHE_PATH=/host/tmp/cuda-cache"
Here's the alias being used to run pix2pix-tensorflow:
Terminal
git clone https://github.com/affinelayer/pix2pix-tensorflow.git
cd pix2pix-tensorflow
python tools/download-dataset.py facades
dockrun affinelayer/pix2pix-tensorflow python pix2pix.py \
  --mode train \
  --output_dir facades_train \
  --max_epochs 200 \
  --input_dir facades/train \
  --which_direction BtoA
pix2pix-tensorflow has no dependencies besides Tensorflow 0.12.1 (the currently released version at the time), but even with that, the very first GitHub issue filed was a user using the wrong version of Tensorflow.
How to set this up for your project
Fortunately, setting this up so users can use your Docker image is pretty easy too.
You need to make an empty directory with a file in it named Dockerfile. Here's the one used above. Building the image looks like this:
Terminal
mkdir docker-build
cd docker-build
curl -O https://affinelayer.com/docker/Dockerfile
sudo docker build --rm --no-cache --tag pix2pix .
Hours later when this finishes, you should be able to see the image:
Terminal
sudo docker images pix2pix
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
pix2pix             latest              bf5bd6bb35f8        3 seconds ago       11.38 GB
Publishing the image is easy too, though the push command requires that you've setup a Docker Hub account and used docker login to set that account as active.
Terminal
sudo docker tag pix2pix <accountname>/pix2pix
sudo docker push <accountname>/pix2pix
Now users can use this image to run your software without a bunch of work. Nice!
You can also pass around Docker images without Docker Hub, but it seems to be a little clunky:
Terminal
# save image to disk, this took about 18 minutes
sudo docker save pix2pix | gzip > pix2pix.image.gz

# load image from disk, this took about 4 minutes
gunzip --stdout pix2pix.image.gz | sudo docker load
Reproducibility
While the Docker image is easy to copy around unmodified, the Dockerfile to image conversion is not necessarily reproducible. You can examine the history of commands used to create the image:
Terminal
sudo docker history --no-trunc pix2pix
But this doesn't show all the files that went into making the image. For instance, if your Dockerfile contains a git clone, or an apt-get update it's likely that running docker build on the same Dockerfile on two different days will produce two different images if it works at all. In addition, if docker build ends up compiling code specifically for your CPU, it may not work on another machine.
As long as the Docker image is the thing that is distributed, then it is reproducible. If you want to reproduce the image from the Dockerfile however, that will not work unless the Dockerfile writer is very careful.
It's unclear if the benefits are worth the effort, but if your Dockerfile is built FROM scratch and with --network none, it should be mostly reproducible.
While it would be cool if reproducible image generation was easy, Docker already makes it possible to get reproducible dependencies that actually work, which is a great step forward.
all code samples on this site are in the public domain unless otherwise stated