Distributed Deep Learning

Caffe on Spark on AWS

Vincent Van Steenbergen - @nsteenv

whoami

Vincent Van Steenbergen

Playing with Scala, Akka & Spark +/- 3 years

Deeply interested in Artificial Intelligence and Data Analysis

Disclaimer

Deep Learning

aka. Convolutional Neural Networks (convNet)

Applications

Image analysis

Image generation

Games

Training a model requires:

1. a lot of time (usually weeks/months)

2. a lot of computing power

Ex: AlphaGo - 1202 CPU and 176 GPU - 6 weeks training

So how can I do that...

from my laptop?

for a decent cost?

within a short timespan?

Yes you can!

Technicaly possible on a (high end) laptop but very slow

Solution: distribute training over a cluster

Apache Spark

Pool ressources from all the spark slaves on the cluser

Amazon Web Services (EC2)

GPU instances (g2.2xlarge, g2.8xlarge)

Spot instances (on demand, generally 2-3 times cheaper than regular instances)

g2.8xlarge configuration

Four NVIDIA GRID GPUs, each with 1,536 CUDA cores and 4 GB of video memory

32 vCPUs

60 GiB of memory

240 GB (2 x 120) of SSD storage

Average price: $1.00 per hour

Deep Learning Frameworks

TensorFlow (Google)

Caffe (Berkeley)

Torch (Facebook/Deepmind)

Caffe on Spark

Distribute Caffe on a Spark cluster

Developed/maintained by Yahoo (Flickr)

Advantages

Can run on an existing cluster along other Spark jobs

Leverage existing Caffe models

Use SQL, DataFrames, existing LMDB files

Peer-to peer communication with Message passing

Not bad...

... let's give it a go!

MNIST

Classifying handwritten digits

Under the hood

Results

Thank you!

Any questions?

My email: v.vansteenbergen@gmail.com

Ressources

Sample launch scripts

Caffe on Spark

Yahoo blog

SparkNet