Distributed Deep Learning

Caffe on Spark on AWS

Vincent Van Steenbergen - @nsteenv

whoami

Vincent Van Steenbergen

Playing with Scala, Akka & Spark +/- 3 years

Deeply interested in Artificial Intelligence and Data Analysis

Disclaimer

Disclaimer

Deep Learning

aka. Convolutional Neural Networks (convNet)

Applications

Image analysis

DL cat

Image generation

DL cat

Games

DL cat

Training a model requires:

1. a lot of time (usually weeks/months)

2. a lot of computing power

Ex: AlphaGo - 1202 CPU and 176 GPU - 6 weeks training

So how can I do that...

from my laptop?

for a decent cost?

within a short timespan?

Yes you can!

Burning laptop

Technicaly possible on a (high end) laptop but very slow

Solution: distribute training over a cluster

Apache Spark

Spark

Pool ressources from all the spark slaves on the cluser

Amazon Web Services (EC2)

GPU instances (g2.2xlarge, g2.8xlarge)

Spot instances (on demand, generally 2-3 times cheaper than regular instances)

g2.8xlarge configuration

Four NVIDIA GRID GPUs, each with 1,536 CUDA cores and 4 GB of video memory

32 vCPUs

60 GiB of memory

240 GB (2 x 120) of SSD storage

Average price: $1.00 per hour

Deep Learning Frameworks

TensorFlow (Google)

Caffe (Berkeley)

Torch (Facebook/Deepmind)

Caffe on Spark

Distribute Caffe on a Spark cluster

Developed/maintained by Yahoo (Flickr)

Advantages

Can run on an existing cluster along other Spark jobs

Leverage existing Caffe models

Use SQL, DataFrames, existing LMDB files

Peer-to peer communication with Message passing

Not bad...

... let's give it a go!

MNIST

MNIST Dataset

Classifying handwritten digits

MNIST Features

Under the hood

MNIST Layers

Results

Thank you!

Any questions?

My email: v.vansteenbergen@gmail.com

Ressources

Sample launch scripts

Caffe on Spark

Yahoo blog

SparkNet