Distributed Learning with Apache MXNet and Horovod

Translation of the article prepared in advance of the start of the course “Industrial ML on Big Data”

Distributed training on several high-performance computing instances can reduce the training time of modern deep neural networks on a large amount of data from several weeks to hours or even minutes, which makes this training technique prevailing in the practical use of deep learning. Users must understand how to share and synchronize data across multiple instances, which in turn has a big impact on scaling efficiency. In addition, users should also know how to deploy a training script that runs on a single instance across multiple instances.

In this article, we’ll talk about a quick and easy way to distributed learning using the open source deep learning library Apache MXNet and the Horovod distributed learning framework. We will demonstrate the performance advantages of the Horovod framework and demonstrate how to write the MXNet tutorial so that it works in a distributed manner with Horovod.

What is Apache MXNet

Apache MXNet – An open framework for deep learning, which is used to create, train and deploy deep neural networks. MXNet abstracts the complexities associated with the implementation of neural networks, has high performance and scalability, and also offers APIs for popular programming languages ​​such as Python, C ++, Clojure, Java, Julia, R, Scala and others.

MXNet distributed learning with parameter server

MXNet Standard Distributed Learning Module uses the parameter server approach. It uses a set of parameter servers to collect gradients from each worker, perform aggregation, and send updated gradients back during the next optimization iteration. Determining the correct ratio of servers to workers is the key to effective scaling. If there is only one parameter server, it may turn out to be a bottleneck in the calculation. Conversely, if too many servers are used, then many-to-many communications can clog all network connections.

What is Horovod

Horovod – An open distributed deep learning framework developed by Uber. It uses efficient technologies for interaction between multiple GPUs and nodes, such as the NVIDIA Collective Communications Library (NCCL) and the Message Passing Interface (MPI) to distribute and aggregate model parameters between Wrecks. It optimizes the use of network bandwidth and scales well when working with models of deep neural networks. It currently supports several popular machine learning frameworks, namely MXNet, Tensorflow, Keras, and PyTorch.

MXNet and Horovod Integration

MXNet integrates with Horovod through the distributed learning APIs defined in Horovod. Horovod Communication APIs horovod.broadcast (), horovod.allgather () and horovod.allreduce () implemented using asynchronous callbacks of the MXNet engine, as part of its task graph. Thus, the data dependencies between communication and computing are easily handled by the MXNet engine to avoid performance losses due to synchronization. Distributed optimizer object defined in Horovod horovod.DistributedOptimizer expands Optimizer on MXNet so that it invokes the appropriate Horovod APIs for distributed parameter updates. All of these implementation details are transparent to end users.

Fast start

You can quickly start training a small convolutional neural network on the MNIST dataset using MXNet and Horovod on your MacBook.
To get started, install mxnet and horovod from PyPI:

pip install mxnet
pip install horovod

Note: If you encounter an error during pip install horovodmaybe you need to add a variable MACOSX_DEPLOYMENT_TARGET = 10.vvwhere vv – this is the version of your MacOS version, for example, for MacOSX Sierra you will need to write MACOSX_DEPLOYMENT_TARGET = 10.12 pip install horovod

Then install OpenMPI from here.

Download the test script at the end. mxnet_mnist.py from here and run the following commands in a MacBook terminal in the working directory:

mpirun -np 2 -H localhost:2 -bind-to none -map-by slot python mxnet_mnist.py

So you start training on two cores of your processor. The output will be as follows:

INFO:root:Epoch[0] Batch [0-50] Speed: 2248.71 samples/sec      accuracy=0.583640
INFO:root:Epoch[0] Batch [50-100] Speed: 2273.89 samples/sec      accuracy=0.882812
INFO:root:Epoch[0] Batch [50-100] Speed: 2273.39 samples/sec      accuracy=0.870000

Performance demonstration

When training the ResNet50-v1 model on an eight-instance ImageNet 64-GPU dataset p3.16xlarge EC2, each of which contains 8 NVIDIA Tesla V100 GPUs on the AWS cloud, we achieved a training throughput of 45,000 images / sec (i.e. the number of trained samples per second). The training was completed in 44 minutes after 90 eras with the best accuracy of 75.7%.

We compared this with MXNet distributed learning with the approach of using parameter servers on 8, 16, 32 and 64 GPUs with a server with one parameter and the ratio of servers to workers 1 to 1 and 2 to 1, respectively. You can see the result in Figure 1 below. On the y-axis on the left, the columns show the number of images for training per second, the lines show the scaling efficiency (i.e. the ratio of the actual throughput to the ideal) on the y-axis on the right. As you can see, the choice of the number of servers affects the efficiency of scaling. If there is only one parameter server, scaling efficiency drops to 38% on 64 GPUs. To achieve the same scaling efficiency as with Horovod, you need to double the number of servers in relation to the number of workers.

Figure 1. Comparison of distributed learning using MXNet with Horovod and the parameter server

In Table 1 below, we compared the total cost of an instance when performing experiments on 64 GPUs. Using MXNet with Horovod provides the best throughput at the lowest cost.

Table 1. Cost comparison between Horovod and the parameter server with the ratio of servers to workers 2 to 1.

Steps to play

In the next steps, we’ll show you how to reproduce the result of distributed learning using MXNet and Horovod. To learn more about distributed learning with MXNet, read this post.

Step 1

Create a cluster of homogeneous instances with MXNet version 1.4.0 or higher and Horovod version 0.16.0 or higher to use distributed learning. You will also need to install libraries for learning on the GPU. For our instances, we chose Ubuntu 16.04 Linux, with the GPU Driver 396.44, CUDA 9.2, the cuDNN 7.2.1 library, the NCCL 2.2.13 communicator, and OpenMPI 3.1.1. You can also use Amazon Deep Learning AMIwhere these libraries are already preinstalled.

Step 2

Add the ability to work with the Horovod API to your MXNet training script. The following MXNet Gluon API-based script can be used as a simple template. Lines in bold are needed if you already have the appropriate training script. Here are some critical changes you need to make to learn with Horovod:

  • Set the context according to the local Horovod rank (line 8) to understand that the training is done on the correct graphics core.
  • Pass the initial parameters from one worker to all (line 18) to make sure that all workers start working with the same initial parameters.
  • Create Horovod DistributedOptimizer (line 25) to update the parameters distributed.

For a complete script, see Horovod-MXNet examples Mnist and ImageNet.

1  import mxnet as mx
2  import horovod.mxnet as hvd
4  # Horovod: initialize Horovod
5  hvd.init()
7  # Horovod: pin a GPU to be used to local rank
8  context = mx.gpu(hvd.local_rank())
10 # Build model
11 model = ...
13 # Initialize parameters
14 model.initialize(initializer, ctx=context)
15 params = model.collect_params()
17 # Horovod: broadcast parameters
18 hvd.broadcast_parameters(params, root_rank=0)
20 # Create optimizer
21 optimizer_params = ...
22 opt = mx.optimizer.create('sgd', **optimizer_params)
24 # Horovod: wrap optimizer with DistributedOptimizer
25 opt = hvd.DistributedOptimizer(opt)
27 # Create trainer and loss function
28 trainer = mx.gluon.Trainer(params, opt, kvstore=None)
29 loss_fn = ...
31 # Train model
32 for epoch in range(num_epoch):
33    ...

Step 3

Log in to one of the workers to start distributed learning using the MPI directive. In this example, distributed learning runs on four instances with 4 GPUs in each, and with 16 GPUs in the cluster in total. The stochastic gradient descent optimizer (SGD) will be used with the following hyperparameters:

  • mini-batch size: 256
  • learning rate: 0.1
  • momentum: 0.9
  • weight decay: 0.0001

As we scaled from one GPU to 64 GPUs, we linearly scaled the learning speed according to the number of GPUs (from 0.1 for 1 GPU to 6.4 for 64 GPUs), while maintaining the number of images per GPU equal to 256 ( from a package of 256 images for 1 GPU to 16,384 for 64 GPUs). The weight decay and momentum parameters changed as the number of GPUs increased. We used blended learning accuracy with the direct pass float16 data type and float32 for gradients to speed up the float16 calculations supported by the NVIDIA Tesla GPU.

$ mpirun -np 16 
    -H server1:4,server2:4,server3:4,server4:4 
    -bind-to none -map-by slot 
    -mca pml ob1 -mca btl ^openib 
    python mxnet_imagenet_resnet50.py


In this article, we looked at a scalable approach to distributed model training using Apache MXNet and Horovod. We have shown scaling efficiency and economic efficiency in comparison with the approach using the parameter server on the ImageNet dataset, on which the ResNet50-v1 model was trained. We also reflected the steps by which you can modify an existing script to start training on multiple instances using Horovod.

If you’re just starting out with MXNet and deep learning, go to the installation page MXNeto build MXNet first. We also strongly recommend reading the article. MXNet in 60 minutesto get started.

If you have already worked with MXNet and want to try distributed learning with Horovod, then check out Horovod installation pagebuild it with MXNet and follow the example Mnist or ImageNet.

* cost is calculated based on hourly rate AWS for EC2 Instances

Learn more about the course “Industrial ML on Big Data”

Similar Posts

Leave a Reply Cancel reply