Xeon-CafPhi

Summary

Optimize Caffe deep learning framework on CPU and Xeon Phi. We provide a highly parallelizable convolution implementation that is capable of providing over 4x speed-up over off-the-shelf Caffe implementation on CPU, if backward convolution is also changed to our method. We achieved this by removing the memory bound on Caffe implementation and thereby exposing another axis of parallelism. Optimization was done by evaluating the performance of MNIST data set. Caffe on Xeon-Phi provided comparable results with CPU for smaller images, But provides nearly 3x performance for large images. We were unable to completely run Caffe on Xeon-Phi due to library loading issues but were able to offload a part (forward convolution) of the computation to Xeon-Phi for evaluation.

Project Report

What is Caffe?

Caffe is a Pure C++/CUDA architecture for Deep Learning. The library has been completely written in C++ and CUDA to attain best optimization.

Caffe is used every day by data scientists and researchers to train neural networks to have a machine classify data. It deploys a supervised machine learning technique, therefore, it has a training phase and a testing phase.

Caffe uses a convolutional neural network, it takes as input images or data in 2D and labels them. It propagates through the network, convolving weights with the input in different layers to give out the output label probabilities. During training we know the true label and we use that to calculate the discrepancy (loss). We take this loss (derivative) back through the network, updating the weights.

Caffe operates in the following way:

Why do we wish to parallelize it?

Training a modern neural network model takes days. For a data scientist while changing parameters it would be helpful to complete a bit quicker.

Caffe is already parallelized using the Cblas library on CPU or the Cublas library on GPU. The Cblas library is written in Fortran language, which is optimized for arithmetic operations, moreover, Cblas internally applies SIMD sse/avx intrinsics to achieve the data parallel model of parallelism.

Operating environment:

CPU: There are 15 machines in the cluster and each machine is a Two, Six-core Xeon e5-2620 v3 processor, 2.4GHz, 15MB L3 cache, hyoer-threading, AVX-instruction support.

Xeon Phi: One Xeon Phi 5110p co-processor board with 60 cores each operating at 1 GHz and provides 4-threads per core. The board has 8GB RAM and has AVX512 ("16-wide") instruction support.

Data Set: All test was done on MNIST data set

Doing better than off-the-shelf Caffe on CPU

In order to improve the parallelism offered by Caffe we decided to first take time measurements of different layers, this exercise was aimed at exposing the layer we were to attack and which would best benefit the Caffe as a whole by our efforts to parallelize it.

We used the cycletimer.h header file to measure layer timings, we measured separately the forward and backward propagation of all the layers in the network.

Our measurements clearly show the convolution layer taking a major chunk of the time taken per iteration of the forward, backward and we used this information to focus our attention on the forward convolution function.

Forward Convolution Function

Looking at the Caffe’s implementation of the convolution layer, we noticed the following:

Caffe implementation actually changes the convolution windows into columns of a matrix, then it takes up those columns and does a matrix multiplication with the weight vector. This results in the convolved image.

We can clearly see 2 overheads of this method:

Firstly that we need to allocate a buffer to store the results of im2col operation. (Memory bound)
Second, we have an overhead of computing the im2col operation.

The original writer of caffe saw cblas or cublas library’s capability to multiply the matrices efficiently as an opportunity and changed the issue of writing a parallelized code by himself. In our implementation, we do the convolution as is as shown below, and thus without an im2col operation, our method requires no buffer to store its results.


for i in 1..Num_Input_Channels
  for o in 1..Num_Output_Channels
    for h in 1..Image_height
      for w in 1..Image_width
        for x in 1..Kernel_height
          for y in 1..Kernel_width
            output(o, h, w) += input(i, h+x, w+y) * filter(i, o, x, y)
          end
        end
      end
    end
  end
end

Performance results on CPU

Above graph provides the performance of the our implementation with off-the-shelf Caffe implementation on each stage. With basic convolution implemented, our code takes twice as long to complete. Note that off-the-shelf code is already optimized and executes in parallel on CBLAS library. We added parallelism using AVX intrinsics and Cilk and were able to beat the performance of off-the-shelf Caffe implementation. Thus we have also used data parallel model of parallelism in our implementation.

As seen in the last section of the graph, the time taken for off-the-shelf is reduced to 37ms. This was done by modifying the off-the-shelf code to allocate enough buffer to hold all the images in a batch. Off-the-shelf code as described by the author operates on one image at a time on CPU (but on all images in a batch in GPUs) because of the additional memory required to store the intermediate matrix results. In our implementation, there are no intermediate results and hence no additional memory is required. Hence we parallelized the code by performing batch processing in the convolutional layer.

This additional memory has huge implications as shown in the graph below: Thus with increasing kernel and image sizes, the amount of memory required increases exponentially.

Our conclusion from this performance analysis is that with parallel batch processing on basic convolution (shared model of parallelism) and removing additional buffer in forward and backward convolution, We can provide over 4x the performance over native off-the-shelf implementation.

Performance results on Xeon-Phi:

In Xeon Phi, Caffe can take advantage of parallellism on cores and/or SIMD units to run high computation jobs. There are two ways to do it:

Run Caffe with Xeon-Phi as host: The complete Caffe code can be loaded onto Xeon-Phi and can be run on it. This has the overhead of loading all the assosiated libraries like glog, gflags, protobuf and other associated libraries. We had issues with loading the necessary libraries onto Xeon-Phi. We discussed with the course staff and they also could not load libraries, hence we agreed to only offload part of computation to evaluate performance.
Offload part of computation to Xeon-Phi: A part of the computation can be offloaded to Xeon-Phi. Unfortunately, this also required necessary libraries to be available on Xeon-Phi. Hence, as concluded earlier we ran simulation of multiple batches of forward convolution and used this to evaluate the operation on CPU and Xeon-Phi.

The offloaded code performs forward convolution on our convolution code with both randomized pixels and images from MNIST. This was evaluated on CPU, Xeon implicit, Xeon explicit and Xeon automatic. The performance is shown in figure below:

Input 1 simulates the image of MNIST and input 2 simulates larger data set such as ImageNet.

Explicit offload technique is a compiler assisted and gave us the best performance results. This is because we have complete control over where to direct data movement. Based on the convolution operation we launched an a thread for every image using OpenMP. This better utilized the cache and took advantage of locality by keeping a single image data in one processor.

In implicit offload technique both CPU and Xeon-Phi shares memory and runtime manages synchronization. Since we are operating on huge images, the performance of this technique is the worst due to the overhead of synchronizing this data.

In automatic offload technique performs computationally intensive calls on MKL (Math Kernel Library). The MKL manages the details and work division is done across host and MIC. We used the MKL library to offload the CBlas matrix operation as implemented in off-the-shelf Caffe to evaluate its performance. The performance of this techique is not as good as explicit. A good performance is seen for convolution of on low sized images. This is because MKL kicks in only after reaching an appropriate threshold in image size. Example SGEMM(M, N, K) = {2048, 256, 2048}. Out input 1 image size is much smaller than this threshold and hence the computation was done locally on CPU, hence the good performance.

Our conclusion from this performance analysis is that an explicit control of parallelism in the hands of the programmer is the best since programmer understands the underlying operation and can determine the best place to parallelize. For a small data set such as MNIST, the performance on Xeon-Phi is comparable with off the shelf Caffe implementation, But for larger image sizes with higher computation power the performance of Xeon explicit is expected to exceed CPU performance by 3x, as seen for input 2.

We were constrained on the memory limit of 16GB on Xeon-Phi and also due to limited time we were unable to run our algorithm on CPU and Xeon-Phi for a larger data set such as ImageNet.

What we could have done better

Used a larger data set such as ImageNet. Could have worked around Xeon Phi's 16GB memory limit by operating on part of the data set
Able run Caffe on Xeon Phi as host or offload large parts of its computation to Xeon Phi by detecting and resolving the library loading issue.
Compare Xeon Phi and GPU on different work loads

References

Project Checkpoint

We have completed the Integration and analysis of Caffe on our local machines. We are on track with the plan but are behind on evaluating the same on Xeon-Phi on Latedays. We recently got access to the same and will evaluate this in priority. Hence the schedule is modified slightly.

We have benchmarked the Caffe with out of the box CPU implementation in the following configuration: 64-bit Ubuntu-14.04 Linux Virtual machine: RAM: 8192 MB Processors: 4 Video Memory: 48 MB

MNIST => 5.5 seconds/100 iterations

CIFAR-10 => 25.6 seconds/100 iterations

Analysis: Caffe Implementation of Convolution Layer

Caffe implements the convolution of different filters over images in the form of matrix multiplication, it converts the convolutions into a set of matrix multiplications.

Convolution is usually done using a sliding window approach, it is an operation where the filter is applied to each pixel and its neighbours and we get the response at that pixel by adding the multiplication of the corresponding elements at that pixel.

Convolution in 1-D

Thus it is an iterative operation where we have multiple for loops

for w in 1..W
  for h in 1..H
    for x in 1..K
      for y in 1..K
        output(w, h) += input(w+x, h+y) * filter(x, y)
      end
    end
  end
end

Where the image is WxH size and filter is KXK size

The implementer of Caffe was not GPU savvy yet he wished to have his convolution run on a GPU and leverage its many cores to achieve a speed-up and train his neural networks faster. Since he couldn’t easily parallelize the for loops easily, he accomplished this by using a utility much like Matlab’s im2col which changes an input 2d array into a set of columns. Each column has the window instance of image and the filter is also represented as columns and then he uses matrix multiplication to get the output.

Of course he pads the image appropriately as per mask size, which you will see if you look at layers in Caffe (5x5 filters have a padding of 2 on the image).

CBlas libraries are already available optimised for the matrix multiplication and then col2im is applied back to the matrix multiplication output and we get the convolved image.

The major drawback is having to put these column images in RAM which costs time to transfer to GPU and even takes up shared space.

Our Design

We will parallelise the for loops which should save space and we wish to do this for latedays’ XeonPhi processor. We wish to implement the convolution iteratively which the author didn’t do.

We will first be implementing convolution serially and then parallelize the same. If the image in (width X Height) with depth D (Eg: RGB), then each location can be represented as patch of (K X K) which can be interpreted as (K X K X D) vector. M filters are then applied to such a patch. This can be accomplished using multiple for loops in the following format:

for w in Width
  for h in Height
    for i in 1..K
      for j in 1..K
        for m in 1..M
          for d in 1..D
            output(w, h, d) += input(w+i, h+j, d) * filter(m, i, j, d)
          end
        end
      end
    end
  end
end

We intend to parallelize this implementation by using ISPC, openMP and Pthreads and based on analysis determine which combination of choices at appropriate loops give us the best performance for MNIST and CIFAR-10 data sets. This will require detailed analysis and evaluation.

Following references helped us analyze and understand convolution:

Project Proposal

Background

A convolutional neural network is currently a very famous Machine Learning algorithm which is used to classify data. It is especially used in computer vision to classify images with high accuracy. A convolutional neural network applied on an image has multiple axes of parallelism at the different layers of the propagation during the training and testing of the network.

Each hidden layer has multiple feature maps, which are produced by convolving respective weights weighed over all the pixels of the previous layer’s feature maps and adding corresponding results. This convolution task is highly parallelisable.

MaxPooling is a task of subsampling the image by taking the average or maximum amongst subparts of the images. This is done in order to avoid any variance. This task of down-sampling images is also highly parallelisable.

Caffe is a framework written in C++ to implement this convolutional neural network, in order to utilise these axes of parallelism available. Caffe takes the convolutions, uses cblas library to turn them into a bunch of matrix multiplications and the uses CudaDNN with Nvidia GPUs to attain faster training speeds.

Challenge

Optimizing Caffe for Xeon Phi is hard as it is already optimized for both generic CPU and GPU. We intend to find specific parts of Caffe implementation that can be optimized to better use the features of Xeon Phi which will be challenging to find. This will require us to understand the existing parallelism support and find new parallelism or fine tune existing support to that of Xeon Phi.

My partner and I have not used or worked with Caffe before and hence we will need to spend time to use and understand the code. Convolution implementation is done by transforming the input data to matrices and computation is done on these. We would need to understand and reimplement this feature to better utilize parallelism of Xeon Phi, which will be challenging.

Resources

Hardware: We will be using Latedays cluster machine that has Xeon Phi co-processor board. We will also be using our laptops to perform initial testing.
Software: Caffe software which comes with BSD 2-Clause license will be our starting point for this project. Test Data sets: CIFAR-10 and MNIST will also be required for evaluation.

Goals/Deliverables:

Primary goal is to understand and modify the Caffe codebase to best utilize all the resources available in Xeon Phi. We will be evaluating performance with CIFAR-10 and MNIST data sets and comparing the same with default Caffe code run on the same machine.
Overall, we would like to gain information about performance benefits of optimizing Caffe on a powerful board such as Xeon Phi. We would also like to gain information about the performance drawbacks of running these optimized code on other machines such as our laptops.
A possible extension is to explore the optimization of Caffe code with GPU support on Latedays cluster.

Platforms

We will be using the Linux posix threads framework and ISPC framework to parallelise the caffe implementation for the Xeon Phi processor of Latedays cluster. We will also consider OpenMPI to leverage other 14 cores on Latedays cluster.

Schedule

Friday, April 10: Setup environment to run Caffe on our personal systems on a VM. Run caffe framework on MNIST and CIFAR10 dataset and gather performance results.
Friday, April 17: Analyse the Caffe code to understand the operations done for MNIST and CIFAR-10 data sets.
Based on analysis find alternative implementation of Matrix multiplication and frame an outline for Convolution implementation.
Friday, April 24: Run Caffe code on Xeon-Phi on latedays and gather performance results.
Implement and run serial implementation of Convolution on Xeon-Phi. We will time the implementation, recognise the the bottlenecks and improve the implementation. We also intend to find other regions of code that can be parallelized.
Friday, May 1: Parallelize the Convolution implementation and measure performance. We intend to evaluate by using combination of ISPC, OpenMP and pthreads wherever applicable and evaluate the performance.
Optimise the implementation keeping in mind both the cache coherence and the bandwidth.
Friday, May 7: We time our implementation, check for correctness and race the caffe CPU vs GPU implementation.
If time permits recognise opportunities of parallelism utilised by GPU matrix multiplication method and try to combine GPU and CPU features on Latedays cluster.