Pau De Jorge, Amartya Sanyal, Harkirat Behl, Philip Torr, Gregory Rogez, Puneet Dokania |
2020 |
Artificial neural networks (ANNs) are algorithms designed to process information in a way inspired by the human brain. They’re based on neurons, or nodes, that are organised into layers to make up a network. The connection between these nodes—whose strength depends on the weight (or parameter) of the connection—determines how information is processed. These parameters are usually defined during the training stage, the goal of which is to update weight values to decrease the error (or loss) of the predictions over the training set. The choice of dataset used to train ANNs depends on the particular use case. For example, image datasets are used to develop computer vision technology, whereas language datasets are implemented during the creation of natural language processing models.
As long as 30 years ago, researchers realized that ANN models were overparameterized. In other words, a large portion of parameters could be ‘pruned’ (set to 0) with little impact on the performance of the model. Pruning these parameters, which leads to network sparsity, has several benefits. First, it acts as a regularizer, meaning that redundant connections are eliminated and the model is constrained, which tends to improve generalization. Second, as sparse subnetworks require less storage space and use fewer FLOPS (floating point operations per second), pruning reduces computational cost. Finally, pruning makes deployment on local devices easier by reducing the memory footprint and computation time, which in turn reduces latency.
Traditionally, pruning methods have been used to remove weights after a network has been trained. A dense network is first fit to a training set and subsequently pruned according to a certain weight-selection criterion. This initial process is followed by a fine-tuning round, which enables the network to adapt to the pruning. Most optimization methods that implement pruning repeat this train → prune → fine-tune cycle several times, making them rather computationally demanding. Recent alternative approaches have instead incorporated dynamic pruning (1), which requires only one round of training. As a result, these new approaches are typically less resource intensive. Nevertheless, they generally benefit from a fine-tuning round to accommodate the weights after fixing the final (pruned) subnetwork, which makes training a bit longer. Moreover, since the topology of the subnetwork changes during training, it’s more difficult to fully leverage the benefits of sparsity during training.
Other recent work (2, 3) has shown that it’s also possible to prune networks at initialization, rather than during training. This means that the benefits of sparsity can be leveraged at both training and test time, and may make it possible to train models directly on edge devices such as drones or smartphones, as well as reducing the cost of exploratory data analysis.
At ICLR (International Conference on Learning Representations) 2019, Frankle and Carbin presented the lottery ticket hypothesis (LTH) (4). Their hypothesis describes sparse subnetworks that can be trained from scratch (i.e. from weight values at initialization) to match or surpass the performance of their dense counterparts. This idea was an encouraging one, as it suggested that the benefits of sparsity could be utilized at both training and test time. However, despite clear theoretical interest, the algorithm designed to find these subnetworks required the dense network to be trained several times, making the overall process impractical. (For an in-depth review of the LTH, we recommend this post.) At the same conference, Lee and colleagues presented SNIP (2), which was the first method to perform pruning at initialization. Although SNIP wasn’t able to attain the same level of sparsity as the LTH, the work was nonetheless significant because it showed an effective way to find sparse subnetworks at initialization. More recently, at ICLR 2020, Wang and colleagues presented GRASP (3)—another method for pruning at initialization—that addressed some of the issues with SNIP’s weight-selection criteria. But the assumptions of this approach do not appear to hold at high sparsities. With our own work on FORCE (FOResight Connection sEnsitivity), we aim to overcome some of these issues.
SynFlow: Pruning neural networks without any data by iteratively conserving synaptic flow
SynFlow (5) presents a method for pruning a network without any data. The main intuition of the authors is that it isn’t possible to remove all connections from a layer without, as a result, blocking the signal propagation through the network. They therefore suggest a data-agnostic pruning criterion that, by construction, will never deplete a whole layer if it can remove weights from elsewhere. In our work, we argue that this assumption is untrue for modern architectures with skip connections (i.e. shortcuts to jump over layers in a neural network), such as ResNet. In fact, in our work we observed that our approach, FORCE, is able to reach good accuracies with a ResNet50 architecture despite pruning entire layers.
SNIP-it: Pruning via iterative ranking of sensitivity statistics
SNIP-it (6) follows an idea very similar to ours, and although our work is more theoretical and seeks to find a good approximation of an ‘ideal’ way of pruning, our final algorithm for optimizing FORCE is close to this proposition. However, SNIP-it work puts more focus on exploring tangential directions, such as filter pruning at initialization or progressively pruning during training. Conversely, we focus on unstructured, iterative pruning (i.e. progressively removing individual connections instead of entire filters) that can be leveraged during training.
Given a network with randomly initialized parameters and a user-defined sparsity, our objective is to find the subnetwork that, when trained from scratch, exhibits maximum performance after training (e.g. accuracy for classification problems). Solving this problem is unfeasible, however, since it would require training all possible subnetworks to find a full solution for pruning at initialization. Instead, these algorithms use a hand-designed weight-selection criterion with the aim of predicting the impact that each weight will have later in training. Although this is a heuristic choice without theoretical guarantees, it has empirically proven to be surprisingly useful.
For SNIP, the researchers adapted a weight-selection criterion first introduced by Mozer and Smolensky in 1988 (7), and named it ‘connection sensitivity’. Connection sensitivity measures the impact that each weight has on the loss (specifically, it computes the product of each weight with the gradient of the loss with respect to that weight in absolute value). However, the gradient of the loss function will differ before and after pruning, due to complex interactions between weights. Conversely, Wang and colleagues 2020 (3) suggested a different approach, GRASP, with the objective of maximizing the gradient norm (backward signal) after pruning. They treat pruning as a perturbation and apply Taylor’s approximation to compute the gradient. This assumption however appears to no longer hold when a large portion of the weights are removed.
Instead, we propose looking at the connection sensitivity after pruning, and keeping the subset of weights with a higher capacity to change the loss after pruning, hence the name FOResight Connection sEnsitivity. The full mathematical details of FORCE can be found in our paper (8).
As mentioned above, we can think of the connection sensitivity introduced in SNIP as an approximation of FORCE, where we assume that the gradients before and after pruning remain unchanged. Although this is not true in general, it seems to yield good results for moderate sparsities. However, this gradient approximation is better suited when the number of weights to be pruned is much smaller than the number of remaining weights in the network. To achieve extreme pruning, and therefore obtain high sparsity levels, we propose pruning the network iteratively by removing a small number of weights at each step using the FORCE objective and the gradient approximation.
To better understand how iterative pruning affects the optimization of FORCE, we performed an experiment where we vary the number of iterative steps (T). We studied two sparsity schedules, shown in Figure 1: for the linear schedule, we begin with the dense network and then arrive at the desired sparsity in equally sized steps; for the exponential schedule, we exponentially decay sparsity (that is, the sparser the network, the fewer weights we remove).
With one iteration, FORCE is equivalent to SNIP. As we increase sparsity, however, pruning iteratively becomes crucial. The linear schedule requires more iterations than the exponential one to achieve a similar result, reinforcing our intuition that the portion of pruned versus retained weights must be small in order for the gradient approximation to hold. For the exponential schedule, this happens naturally, but for the linear schedule a greater number of iterations are required so that all steps become small enough.
We use more data when pruning iteratively than for one-shot methods. Therefore, to achieve a fair comparison between our method and alternative approaches, we introduced the variants SNIP-MB and GRASP-MB (i.e. using multiple batches, MB, as opposed to just one). We approximated their respective saliencies using the same amount of data as for FORCE (see Figure 2). When using more data to estimate SNIP or GRASP saliencies, we see a boost in performance, indicating the need for a better approximation of the saliency. Moreover, we found that FORCE outperforms other methods using the same amount of data.
Another informative analysis can be carried out by looking at the structure of the pruned subnetworks. This allows us to understand which layers are more heavily pruned, and in which ratio. Figure 3 shows that, for ResNet50, all methods prune some layers completely. This is because skip connections allow the flow of forward and backward signals. Architectures without skip connections (such as VGG), on the other hand, require non-empty layers to keep the flow of information. We hypothesize that this is the reason that we’re able to prune a ResNet network to higher sparsity levels than VGG.
FORCE, our new approach to pruning ANNs, implements iterative pruning at initialization. By removing a small number of weights at each step, using the FORCE objective and the gradient approximation, our approach achieves extreme sparsity in the network with a much better sparsity/accuracy trade-off than previous methods. More detail can be found in our paper (8), and our implementation can be found on GitHub.
To the best of our knowledge, no published results compare pruning at initialization methods with the random pruning baseline on ImageNet. We find that, in the case of ResNet50 and ImageNet, methods for pruning at initialization perform no better than random pruning for high sparsities. Despite this being a fairly negative result, we’re nonetheless confident that there are better ways we can prune networks at initialization. It appears obvious, however, that the behaviour of pruning methods is not uniform across networks and datasets, suggesting a direction in clear need of further exploration.
This work was done in collaboration with researchers from the University of Oxford, within the NAVER Global AI R&D Belt.
[1] Dynamic Model Pruning with Feedback. Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev and Martin Jaggi. International Conference on Learning Representations (ICLR), 26 April–1 May 2020.
[2] SNIP: Single-Shot Network Pruning Based on Connection Sensitivity. Namhoon Lee, Thalaiyasingam Ajanthan and Philip Torr. International Conference on Learning Representations (ICLR), New Orleans, LA, 6–9 May 2019.
[3] Picking Winning Tickets before Training by Preserving Gradient Flow. Chaoqi Wang, Guodong Zhang and Roger Grosse. International Conference on Learning Representations (ICLR), 26 April–1 May 2020.
[4] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. Jonathan Frankle and Michael Carbin. International Conference on Learning Representations (ICLR), New Orleans, LA, 6–9 May 2019.
[5] Pruning Neural Networks without Any Data by Iteratively Conserving Synaptic Flow. Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins and Surya Ganguli. arXiv:2006.05467 [cs.LG].
[6] Pruning via Iterative Ranking of Sensitivity Statistics. Stijn Verdenius, Maarten Stol and Patrick Forré. arXiv:2006.00896 [cs.LG].
[7] Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment. Michael C. Mozer and Paul Smolensky. Advances in Neural Information Processing Systems 1 (NIPS), 1988, pp. 107–115.
[8] Progressive Skeletonization: Trimming More Fat from a Network at Initialization. Pau de Jorge, Amartya Sanyal, Harkirat S. Behl, Philip H. S. Torr, Gregory Rogez and Puneet K. Dokania. arXiv:2006.09081 [cs.CV].
This work was accepted to ICLR 2021. Progressive skeletonization: trimming more fat from a network at initialization
Pau de Jorge, Amartya Sanyal, Harkirat S. Behl, Philip H.S. Torr, Gregory Rogez, Puneet K. Dokania. Available on openreview.net
NAVER LABS Europe 6-8 chemin de Maupertuis 38240 Meylan France Contact
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimization problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimization to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments. More details on our research can be found in the Explore section below.
For a robot to be useful it must be able to represent its knowledge of the world, share what it learns and interact with other agents, in particular humans. Our research combines expertise in human-robot interaction, natural language processing, speech, information retrieval, data management and low code/no code programming to build AI components that will help next-generation robots perform complex real-world tasks. These components will help robots interact safely with humans and their physical environment, other robots and systems, represent and update their world knowledge and share it with the rest of the fleet. More details on our research can be found in the Explore section below.
Visual perception is a necessary part of any intelligent system that is meant to interact with the world. Robots need to perceive the structure, the objects, and people in their environment to better understand the world and perform the tasks they are assigned. Our research combines expertise in visual representation learning, self-supervised learning and human behaviour understanding to build AI components that help robots understand and navigate in their 3D environment, detect and interact with surrounding objects and people and continuously adapt themselves when deployed in new environments. More details on our research can be found in the Explore section below.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
—————
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
1. Difference in female/male salary: 34/40 points
2. Difference in salary increases female/male: 35/35 points
3. Salary increases upon return from maternity leave: Non calculable
4. Number of employees in under-represented gender in 10 highest salaries: 5/10 points
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
——————-
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
1. Les écarts de salaire entre les femmes et les hommes: 34 sur 40 points
2. Les écarts des augmentations individuelles entre les femmes et les hommes : 35 sur 35 points
3. Toutes les salariées augmentées revenant de congé maternité : Incalculable
4. Le nombre de salarié du sexe sous-représenté parmi les 10 plus hautes rémunérations : 5 sur 10 points
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimisation problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimisation to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments.
The research we conduct on expressive visual representations is applicable to visual search, object detection, image classification and the automatic extraction of 3D human poses and shapes that can be used for human behavior understanding and prediction, human-robot interaction or even avatar animation. We also extract 3D information from images that can be used for intelligent robot navigation, augmented reality and the 3D reconstruction of objects, buildings or even entire cities.
Our work covers the spectrum from unsupervised to supervised approaches, and from very deep architectures to very compact ones. We’re excited about the promise of big data to bring big performance gains to our algorithms but also passionate about the challenge of working in data-scarce and low-power scenarios.
Furthermore, we believe that a modern computer vision system needs to be able to continuously adapt itself to its environment and to improve itself via lifelong learning. Our driving goal is to use our research to deliver embodied intelligence to our users in robotics, autonomous driving, via phone cameras and any other visual means to reach people wherever they may be.
This web site uses cookies for the site search, to display videos and for aggregate site analytics.
Learn more about these cookies in our privacy notice.
You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.
FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.
AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.
Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.
This content is currently blocked. To view the content please either 'Accept social media cookies' or 'Accept all cookies'.
For more information on cookies see our privacy notice.