FORCE enables extreme pruning of artificial neural networks at initialization. - Naver Labs Europe

A new method called FORCE achieves extreme sparsity in artificial neural networks by progressively removing up to 99.9% of parameters at initialization, making it a promising candidate for training networks on edge devices.

Force

Artificial neural networks (ANNs) are algorithms designed to process information in a way inspired by the human brain. They’re based on neurons, or nodes, that are organised into layers to make up a network. The connection between these nodes—whose strength depends on the weight (or parameter) of the connection—determines how information is processed. These parameters are usually defined during the training stage, the goal of which is to update weight values to decrease the error (or loss) of the predictions over the training set. The choice of dataset used to train ANNs depends on the particular use case. For example, image datasets are used to develop computer vision technology, whereas language datasets are implemented during the creation of natural language processing models.

As long as 30 years ago, researchers realized that ANN models were overparameterized. In other words, a large portion of parameters could be ‘pruned’ (set to 0) with little impact on the performance of the model. Pruning these parameters, which leads to network sparsity, has several benefits. First, it acts as a regularizer, meaning that redundant connections are eliminated and the model is constrained, which tends to improve generalization. Second, as sparse subnetworks require less storage space and use fewer FLOPS (floating point operations per second), pruning reduces computational cost. Finally, pruning makes deployment on local devices easier by reducing the memory footprint and computation time, which in turn reduces latency.

Traditionally, pruning methods have been used to remove weights after a network has been trained. A dense network is first fit to a training set and subsequently pruned according to a certain weight-selection criterion. This initial process is followed by a fine-tuning round, which enables the network to adapt to the pruning. Most optimization methods that implement pruning repeat this train → prune → fine-tune cycle several times, making them rather computationally demanding. Recent alternative approaches have instead incorporated dynamic pruning (1), which requires only one round of training. As a result, these new approaches are typically less resource intensive. Nevertheless, they generally benefit from a fine-tuning round to accommodate the weights after fixing the final (pruned) subnetwork, which makes training a bit longer. Moreover, since the topology of the subnetwork changes during training, it’s more difficult to fully leverage the benefits of sparsity during training.

Other recent work (2, 3) has shown that it’s also possible to prune networks at initialization, rather than during training. This means that the benefits of sparsity can be leveraged at both training and test time, and may make it possible to train models directly on edge devices such as drones or smartphones, as well as reducing the cost of exploratory data analysis.

A brief history of pruning at initialization

At ICLR (International Conference on Learning Representations) 2019, Frankle and Carbin presented the lottery ticket hypothesis (LTH) (4). Their hypothesis describes sparse subnetworks that can be trained from scratch (i.e. from weight values at initialization) to match or surpass the performance of their dense counterparts. This idea was an encouraging one, as it suggested that the benefits of sparsity could be utilized at both training and test time. However, despite clear theoretical interest, the algorithm designed to find these subnetworks required the dense network to be trained several times, making the overall process impractical. (For an in-depth review of the LTH, we recommend this post.) At the same conference, Lee and colleagues presented SNIP (2), which was the first method to perform pruning at initialization. Although SNIP wasn’t able to attain the same level of sparsity as the LTH, the work was nonetheless significant because it showed an effective way to find sparse subnetworks at initialization. More recently, at ICLR 2020, Wang and colleagues presented GRASP (3)—another method for pruning at initialization—that addressed some of the issues with SNIP’s weight-selection criteria. But the assumptions of this approach do not appear to hold at high sparsities. With our own work on FORCE (FOResight Connection sEnsitivity), we aim to overcome some of these issues.

A synopsis of concurrent work

SynFlow: Pruning neural networks without any data by iteratively conserving synaptic flow

SynFlow (5) presents a method for pruning a network without any data. The main intuition of the authors is that it isn’t possible to remove all connections from a layer without, as a result, blocking the signal propagation through the network. They therefore suggest a data-agnostic pruning criterion that, by construction, will never deplete a whole layer if it can remove weights from elsewhere. In our work, we argue that this assumption is untrue for modern architectures with skip connections (i.e. shortcuts to jump over layers in a neural network), such as ResNet. In fact, in our work we observed that our approach, FORCE, is able to reach good accuracies with a ResNet50 architecture despite pruning entire layers.

SNIP-it: Pruning via iterative ranking of sensitivity statistics

SNIP-it (6) follows an idea very similar to ours, and although our work is more theoretical and seeks to find a good approximation of an ‘ideal’ way of pruning, our final algorithm for optimizing FORCE is close to this proposition. However, SNIP-it work puts more focus on exploring tangential directions, such as filter pruning at initialization or progressively pruning during training. Conversely, we focus on unstructured, iterative pruning (i.e. progressively removing individual connections instead of entire filters) that can be leveraged during training.

Formulating the pruning-at-initialization problem

Given a network with randomly initialized parameters and a user-defined sparsity, our objective is to find the subnetwork that, when trained from scratch, exhibits maximum performance after training (e.g. accuracy for classification problems). Solving this problem is unfeasible, however, since it would require training all possible subnetworks to find a full solution for pruning at initialization. Instead, these algorithms use a hand-designed weight-selection criterion with the aim of predicting the impact that each weight will have later in training. Although this is a heuristic choice without theoretical guarantees, it has empirically proven to be surprisingly useful.

Comparing previous work with FORCE

For SNIP, the researchers adapted a weight-selection criterion first introduced by Mozer and Smolensky in 1988 (7), and named it ‘connection sensitivity’. Connection sensitivity measures the impact that each weight has on the loss (specifically, it computes the product of each weight with the gradient of the loss with respect to that weight in absolute value). However, the gradient of the loss function will differ before and after pruning, due to complex interactions between weights. Conversely, Wang and colleagues 2020 (3) suggested a different approach, GRASP, with the objective of maximizing the gradient norm (backward signal) after pruning. They treat pruning as a perturbation and apply Taylor’s approximation to compute the gradient. This assumption however appears to no longer hold when a large portion of the weights are removed.

Instead, we propose looking at the connection sensitivity after pruning, and keeping the subset of weights with a higher capacity to change the loss after pruning, hence the name FOResight Connection sEnsitivity. The full mathematical details of FORCE can be found in our paper (8).

Comparing FORCE, GRASP and SNIP:

  • SNIP can be seen as a particular case of FORCE which assumes that all parameters are removed one by one with replacement. It therefore fails to capture the effect of removing a group of parameters. In contrast, FORCE looks at the connection sensitivity after pruning, thus taking into account the effect of having removed all the weights as a group.
  • Similar to GRASP, FORCE is based on the gradients after pruning. However, GRASP assumes that pruning is a small perturbation on the gradient norm, which isn’t reasonable for high sparsity levels. As FORCE prunes iteratively, it always removes a small number of weights (instead of removing all weights at once), leading to a better approximation.
  • At high sparsity levels, SNIP and GRASP will be suboptimal, due to the large changes in gradients and the Taylor approximation, respectively. Empirically, we observe that SNIP and GRASP both fail in the case of extreme pruning, whereas FORCE succeeds.

Progressive skeletonization: removing weights iteratively

As mentioned above, we can think of the connection sensitivity introduced in SNIP as an approximation of FORCE, where we assume that the gradients before and after pruning remain unchanged. Although this is not true in general, it seems to yield good results for moderate sparsities. However, this gradient approximation is better suited when the number of weights to be pruned is much smaller than the number of remaining weights in the network. To achieve extreme pruning, and therefore obtain high sparsity levels, we propose pruning the network iteratively by removing a small number of weights at each step using the FORCE objective and the gradient approximation.

Sparsity schedules: linear and exponential

To better understand how iterative pruning affects the optimization of FORCE, we performed an experiment where we vary the number of iterative steps (T). We studied two sparsity schedules, shown in Figure 1: for the linear schedule, we begin with the dense network and then arrive at the desired sparsity in equally sized steps; for the exponential schedule, we exponentially decay sparsity (that is, the sparser the network, the fewer weights we remove).

Num iterative steps

With one iteration, FORCE is equivalent to SNIP. As we increase sparsity, however, pruning iteratively becomes crucial. The linear schedule requires more iterations than the exponential one to achieve a similar result, reinforcing our intuition that the portion of pruned versus retained weights must be small in order for the gradient approximation to hold. For the exponential schedule, this happens naturally, but for the linear schedule a greater number of iterations are required so that all steps become small enough.

 Network sparsity and batch size

We use more data when pruning iteratively than for one-shot methods. Therefore, to achieve a fair comparison between our method and alternative approaches, we introduced the variants SNIP-MB and GRASP-MB (i.e. using multiple batches, MB, as opposed to just one). We approximated their respective saliencies using the same amount of data as for FORCE (see Figure 2). When using more data to estimate SNIP or GRASP saliencies, we see a boost in performance, indicating the need for a better approximation of the saliency. Moreover, we found that FORCE outperforms other methods using the same amount of data.

Remaining weights
Figure 2: The accuracy/sparsity trade-off for different pruning methods (FORCE, GRASP and SNIP) for two artificial neural network models (ResNet50 and VGG19), measured on the CIFAR-10 test dataset. When we increase the number of batches (MB: multiple batches), we find that one-shot methods can prune to higher sparsity levels but that our gradual pruning method achieves even higher sparsity. The shaded areas denote the standard deviation of the runs (too small to be visible in some cases). Note that in (b), GRASP and GRASP-MB overlap. Green: FORCE. Brown: GRASP. Purple: GRASP-MB. Blue: SNIP. Orange: SNIP-MB. Red: Random pruning baseline. Grey: Dense baseline (i.e. the network before pruning).

Network structure after pruning

Another informative analysis can be carried out by looking at the structure of the pruned subnetworks. This allows us to understand which layers are more heavily pruned, and in which ratio. Figure 3 shows that, for ResNet50, all methods prune some layers completely. This is because skip connections allow the flow of forward and backward signals. Architectures without skip connections (such as VGG), on the other hand, require non-empty layers to keep the flow of information. We hypothesize that this is the reason that we’re able to prune a ResNet network to higher sparsity levels than VGG.

 

Figure 3: Visualization
Figure 3: Visualization of remaining weights, after pruning 99.9% of parameters, in terms of the reduction in density at different layers of the network for SNIP-MB, GRASP-MB and FORCE: (a) and (d) show the fraction of remaining weights for each prunable layer; (b) and (e) show the actual number of remaining weights; and (c) and (f) show zooms of the bottom areas of the plots in (b) and (e), respectively, with the y-axis cut to 100 weights per layer. Note that for (c) ResNet50, some layers have no weights left, whereas this is not the case for (f) VGG architecture.

Summarizing FORCE, and avenues for future work

FORCE, our new approach to pruning ANNs, implements iterative pruning at initialization. By removing a small number of weights at each step, using the FORCE objective and the gradient approximation, our approach achieves extreme sparsity in the network with a much better sparsity/accuracy trade-off than previous methods. More detail can be found in our paper (8), and our implementation can be found on GitHub.

To the best of our knowledge, no published results compare pruning at initialization methods with the random pruning baseline on ImageNet. We find that, in the case of ResNet50 and ImageNet, methods for pruning at initialization perform no better than random pruning for high sparsities. Despite this being a fairly negative result, we’re nonetheless confident that there are better ways we can prune networks at initialization. It appears obvious, however, that the behaviour of pruning methods is not uniform across networks and datasets, suggesting a direction in clear need of further exploration.

This work was done in collaboration with researchers from the University of Oxford, within the NAVER Global AI R&D Belt.

References

[1] Dynamic Model Pruning with Feedback. Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev and Martin Jaggi. International Conference on Learning Representations (ICLR), 26 April–1 May 2020.

[2] SNIP: Single-Shot Network Pruning Based on Connection Sensitivity. Namhoon Lee, Thalaiyasingam Ajanthan and Philip Torr. International Conference on Learning Representations (ICLR), New Orleans, LA, 6–9 May 2019.

[3] Picking Winning Tickets before Training by Preserving Gradient Flow. Chaoqi Wang, Guodong Zhang and Roger Grosse. International Conference on Learning Representations (ICLR), 26 April–1 May 2020.

[4] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. Jonathan Frankle and Michael Carbin. International Conference on Learning Representations (ICLR), New Orleans, LA, 6–9 May 2019.

[5] Pruning Neural Networks without Any Data by Iteratively Conserving Synaptic Flow. Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins and Surya Ganguli. arXiv:2006.05467 [cs.LG].

[6] Pruning via Iterative Ranking of Sensitivity Statistics. Stijn Verdenius, Maarten Stol and Patrick Forré. arXiv:2006.00896 [cs.LG].

[7] Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment. Michael C. Mozer and Paul Smolensky. Advances in Neural Information Processing Systems 1 (NIPS), 1988, pp. 107–115.

[8] Progressive Skeletonization: Trimming More Fat from a Network at Initialization. Pau de Jorge, Amartya Sanyal, Harkirat S. Behl, Philip H. S. Torr, Gregory Rogez and Puneet K. Dokania. arXiv:2006.09081 [cs.CV].

Related Content