Distributed Deep Learning

Caffe on Spark on AWS

Vincent Van Steenbergen - @nsteenv


Vincent Van Steenbergen

Playing with Scala, Akka & Spark +/- 3 years

Deeply interested in Artificial Intelligence and Data Analysis



Deep Learning

aka. Convolutional Neural Networks (convNet)


Image analysis

DL cat

Image generation

DL cat


DL cat

Training a model requires:

1. a lot of time (usually weeks/months)

2. a lot of computing power

Ex: AlphaGo - 1202 CPU and 176 GPU - 6 weeks training

So how can I do that...

from my laptop?

for a decent cost?

within a short timespan?

Yes you can!

Burning laptop

Technicaly possible on a (high end) laptop but very slow

Solution: distribute training over a cluster

Apache Spark


Pool ressources from all the spark slaves on the cluser

Amazon Web Services (EC2)

GPU instances (g2.2xlarge, g2.8xlarge)

Spot instances (on demand, generally 2-3 times cheaper than regular instances)

g2.8xlarge configuration

Four NVIDIA GRID GPUs, each with 1,536 CUDA cores and 4 GB of video memory

32 vCPUs

60 GiB of memory

240 GB (2 x 120) of SSD storage

Average price: $1.00 per hour

Deep Learning Frameworks

TensorFlow (Google)

Caffe (Berkeley)

Torch (Facebook/Deepmind)

Caffe on Spark

Distribute Caffe on a Spark cluster

Developed/maintained by Yahoo (Flickr)


Can run on an existing cluster along other Spark jobs

Leverage existing Caffe models

Use SQL, DataFrames, existing LMDB files

Peer-to peer communication with Message passing

Not bad...

... let's give it a go!


MNIST Dataset

Classifying handwritten digits

MNIST Features

Under the hood

MNIST Layers


Thank you!

Any questions?

My email: v.vansteenbergen@gmail.com


Sample launch scripts

Caffe on Spark

Yahoo blog
