ICML 2019 Highlights - Naver Labs Europe

A bit of  background:

The 36th International Conference on Machine Learning (ICML) was held at Long Beach Convention Center in California. ICML continues to grow rapidly: in 2019 there were 774 papers accepted out of 3424 submissions, compared with 621 acceptances for 2473 submissions in 2018.

The conference had (up to) nine parallel streams, which made it hard to decide what to attend – but it may have been slightly easier than 2018 when there were 10 parallel streams. Each stream had three sessions (11am, 2pm and 4pm), and each session had one 20-minute presentation followed by nine 5-minute spotlight presentations (versus 10-minute spotlights in 2018). For presenters having only 5 minutes, this was often not enough time to get across their main idea.

The institution contributing the most articles (measuring the ratio of the number of authors from this institution to the total number of authors for each paper and using data from this article) was Google. Indeed, according to this measure, Google, Google Brain and DeepMind together produced slightly more articles than the top three academic institutions together (Stanford, Berkeley and MIT), who in turn produced more than twice as many articles as the top three European institutions together (University of Oxford, ETH Zurich and EPFL)!

Following is a recap and analysis of some of the sessions that particularly caught our interest and/or are relevant to what we’re working on in the lab.

Part 1: Chris Dance, Research Fellow

Multi-task RL with PEARL

In the paper “Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables”, Rakelly et al. present a multi-task RL method called PEARL (probabilistic embeddings for actor-critic RL). The authors are motivated by the fact that many of the problems we would like autonomous agents to solve share common structure: e.g. screwing a cap on a bottle and turning a doorknob both involve grasping an object in the hand and rotating the wrist.

This paper attracted my attention as the authors use posterior sampling (Strens et al. 2000; Osband et al., NIPS 2013), which is an elegant and powerful classical RL method, but they do so in a deep RL context.  Now, in classical RL, posterior sampling maintains a posterior over possible Markov Decision Processes (MDPs), samples an MDP from this posterior, then acts according to an optimal policy for this MDP for some period of time. By analogy, PEARL keeps track of the “posterior” using a trained “inference network” which outputs a distribution over some “probabilistic context variables” that capture the current uncertainty over the task. Also, while this probabilistic context model could readily be combined with on-policy RL methods (like PPO), the authors instead use the SOTA off-policy RL algorithm SAC (Haarnoja et al. 2018), thereby minimising the number of samples required both for meta-training and adaptation. Indeed, as shown in the following Figure, PEARL requires 20-100x fewer samples in the domains tested when compared with other multi-task RL methods like MAML, and it often attains substantially better final returns! The authors also describe PEARL in a rather good blog article just in case anyone finds the full paper hard to follow.


Figure. PEARL requires 20-100x fewer samples than other SOTA multi-task RL algorithms like MAML (Finn et al. 2017) and often obtains substantially better final returns. From Rakelly et al. (2019).


Efficient visually-guided RL with SOLAR
While model-free RL algorithms have been used in most of the highly-visible success stories of RL today, they are too sample inefficient to be practical in applications like vision-based robot control. For instance, the SOTA model-free RL algorithm SAC (Haarnoja et al., 2018) takes four hours of interaction to learn the simple visually-guided task of stacking one block on another, even when starting from just a single initial configuration. In contrast, the model-based RL method described in SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning by Zhang et al. can learn this task in about one hour, when starting from a wide range of initial configurations.

SOLAR figure

Figure. SOLAR can learn simple real-world visually-guided tasks like stacking blocks and pushing a mug onto a coaster with no additional sensor information. From Zhang et al. (2019).

While previous model-based RL algorithms for vision-based control rely on the ability to accurately predict videos into the future, an ability which requires many samples to learn, the algorithm used in this work relies on structured representations which can be learned more efficiently. This algorithm is called stochastic optimal control with latent representations (SOLAR) and it relies on the linear quadratic regulator with fitted linear models (LQR-FLM) algorithm (Levine and Abbeel, 2014), which fits local linear Gaussian transition models and local quadratic cost models. Now LQR-FLM works well if the state is low-dimensional and linearity is a good approximation, but not if the state is high-dimensional and linearity is a poor approximation, as is the case if images are used for the state. Therefore, SOLAR treats image observations as if they were generated from a low-dimensional latent space by a deconvolutional neural network that is learned with amortized variational inference. The authors claim that this makes SOLAR the most efficient method to date for image-based robotic task learning.


Understanding how deep learning generalizes

In Stable Rank Normalization for Improved Generalization in Neural Networks”, by Sanyal, Torr, Dokania, which appeared in the Workshop on Understanding and Improving Generalization in Deep Learning, the authors are motivated by recent generalisation bounds for deep networks. Several such bounds (Neyshabur et al. 2018; Bartlett et al. 2017) involve a factor called the stable rank of a matrix  which is defined by furmula ICML figure

where  are the singular values of  and  is the rank of . The authors conduct experiments to understand the impact of directly controlling the stable rank on the generalization behaviour of NNs. In particular, they conduct gradient descent on stable-rank-normalised versions of the weight matrices. They show that this gives consistently smaller generalisation gaps on CIFAR-10/-100 than when using vanilla gradient descent or spectral normalisation. Of course, this lower generalisation gap comes at the expense of a poorer fit to the training set, so the authors do not get a lower test error than the vanilla algorithm, and the performance of their vanilla algorithm is in turn very far from the current SOTA as there is no data augmentation.

The authors also kindly pointed out the paper Li, Soltanolkotabi and Omyak, Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks, arXiv, April 2019. This paper extends recent work that explains the good generalization of overparameterized networks by arguing that SGD results in trained weights that are close to randomly-initialized weights and this proximity implies a low Rademacher complexity, such as the work of Li and Liang (2018) and of Neyshabur et al. (2019). While that recent work assumed the training data was correctly labelled, Li et al. (2019) instead consider the case where a fraction of the training data has the wrong labels, and they prove two main results.

  • Theorem 2.2 shows that the solution found by gradient descent with early stopping degrades gracefully as the level of label corruption grows. In fact, if the fraction of labels corrupted is small relative to the distance between data clusters with different labels, then the network will correctly label the data points whose labels were corrupted.
  • Theorem 2.4 shows that to (over)fit to corrupted labels requires that the network’s weights stray far from their initialization. This explains why it takes longer to train a network to fit noisy labels and also emphasises the central role of the distance between initial and trained weights in understanding the generalisation properties of deep networks.


Part 2: Arnaud Sors, Researcher, Machine Learning & Optimization

Population-based training for deep neural nets and meta-learning (Tutorials)

As the machine learning community today needs learning methods that go beyond classical stochastic gradient descent on one loss function and one dataset and because of the relevance to robotics, Arnaud Sors selected the tutorials on population-based training and on meta learning.

A session on recent advances in Population-based training for deep neural nets was held by Jeff Clune, Joel Lehman and Kenneth Stanley. They introduced novelty search algorithms, quality-diversity algorithms, the challenges of open-ended learning and the automatic design of training environments so it was an extensive introduction to the field.

The tutorial revolved around the idea that, training towards a predefined cost can have severe limitations in the context of RL but also in standard supervised learning. Novelty search consists of encouraging diversity as a goal and follows from the idea that the success of a model in a seemingly unrelated, secondary objective can often help in the initial objective. This is related to the work we’re doing in curiosity-driven exploration in robotics, for example in active localization. In extending novelty search there are quality diversity algorithms such as MAP elites[arXiv], where a search archive of local champions over predefined parameter configurations is maintained and jointly evolved. Maintaining an archive of possible best solutions is also interesting for generalization and for adaptability. The tutorial continued with challenges and preliminary experiments related to open-ended learning, i.e. continuous learning over populations [video] and the problem of designing training environments while simultaneously training an agent to succeed in these environments [arXiv]. The tutorial finished with work on indirect encoding, i.e. a network encoding how to construct a larger network.

Another session on meta-learning with Chelsea Finn and Sergey Levine started with a general and classic overview of the field. The speakers then went over various possible optimization methods such as black box optimization, optimization-based inference, non-parametric methods and extensions to Bayesian meta-learning. A second part was on meta-RL, i.e. learning to adapt to new tasks very quickly, and possible implementation choices. A last part was on common challenges coming with the use of meta-RL, mostly how to avoid meta-overfitting and how to propose new tasks automatically.

If you’re especially interested in this topic I recommend you check the slides for the session as they’re a compact and efficient introduction to the field. A complete reading list on the topic by subtopics can also found at [tinyurl.com/meta-reading]

Following are some pieces of work that particularly caught my interest either because of their broad applicability or because they address a common recurring problem.


Cheap orthonormal constraints in neural networks

I somewhat stumbled upon the work of Mario Lezcano-Casado and David Martinez-Rubio [arXiv] on cheap orthonormal constraints for weight matrices and it struck me how interesting it is for many ML applications. Below is a high-level overview where I try to show the main ideas.


     The optimization problem

The authors consider optimization problems involving an orthogonality or unitary constraint on parameters: \(\min_{B \in G} f(B)\) where \(G\) can be \(SO(n)\) (i.e. special orthogonal group, the group of rotation matrices) or \(U(n)\) (i.e. the unitary group, similar notion for complex matrices). We describe the problem for SO(n). It’s almost identical for U(n).

To optimize over such a space, we would ideally like a method that:

  • can be used with general purpose optimizers (for this the optimization problem needs to be unconstrained
  • does not change the optimization landscape
  • is computationally cheap in the forward and backward pass
  • has unbiased gradients


     A very quick look at Riemanian geometry and how it is used to transform the problem

A Riemanian manifold is a smooth manifold that has an inner product on the tangent space. \(SO(n)\) is a Riemanian manifold of \(\mathcal R^n\) (and it’s also a compact and connected Lie group so this means that in this group, multiplication is smooth). The Lie algebra of a Lie group is its tangent space at the identity. A classic result of Riemanian geometry is that the Lie algebra \(so(n)\) of \(SO(n)\) is the group of skew-symmetric matrices:

\(so(n) = \{A \in \mathcal R^{n\times n} | A + A^T = 0\}\)

The exponential map is a function that (by definition) maps the tangent space (Lie algebra) to the manifold (Lie group) locally. Another classic result from Riemanian geometry is that the Lie exponential map is equivalent to the matrix exponential defined as the Taylor expansion \(exp(A) = I + A + \frac{1}{2} A^2 + …\)

Therefore, the constrained optimization problem initially introduced can be cast to:
\(\min_{A \in g} f(exp(A)\) where \(g\) is the group of skew symmetric matrices, canonically parameterized with a triangular parameter matrix. The optimization problem is now unconstrained

Riemanian geometry ICML figure


     Approximation of the exponential map and its gradient

Now, although the problem is unconstrained, the matrix exponential appears in it. The authors propose to approximate it in the forward pass using a combination of Padé approximants and a scaling squaring trick, up to machine precision. They also derive an expression for the gradient of \(f \circ exp\), including an approximation of the differential of the exponential of matrices using the exponential of a \(2n \times 2n\) matrix. Finally, they show that this reparameterization, along with approximations, is cheap in the context of minibatch stochastic optimization because the part involving exponential differentiation has to be computed only once per batch. They also provide Pytorch code to try out orthogonal constraints in your weight matrices – including a generalization to non-square matrices.



I was interested in this work because I believe it has a broad range of applications. Orthogonal constraints are useful in dimensionality reduction, clustering, feature selection or even dictionary learning. Using parameterized and differentiable orthogonal transformations, such techniques could be integrated into the forward pass of a neural net. Orthogonal constraint can also be useful to stabilize recurrent neural networks (with the unitary group), or Riemanian gradient descent (cf. Arjovsky et al.). Orthogonal constraints are also important in invertible neural networks and normalizing flows, and a paper using this way of constraining a transformation was actually presented in the workshop on normalizing flow.


Learning from noisy labels

I noticed at least 4 papers on how to deal with noisy labels which is a problem that almost every ML practitioner has to deal with at some point… Below is a brief overview of some of the approaches that were presented that I thought could be useful.

  • In SelectiveNet by Y. Geifman and R. El-Yaniv, a selection function \(g\) with binary output is added to the prediction function \(f\) of the neural network, and the output of the overall neural network to input \(x\) is defined to be either \(f(x)\) if \(g(x)=1\), or ‘don’t know’ if \(g(x)=0\) (the rejection option). The network is trained to optimize both classification (or regression) and rejection simultaneously. The network has a hyperparameter to specify desired coverage. The optimization problem (trained with given coverage) is cast as a two-loss problem (risk-coverage loss and auxiliary loss), where the auxiliary loss avoids early elimination of relevant samples.
  • In Combatting Label Noise in deep learning using Abstention, Thulasidasan et al. add one class to the output of the network for the model to ‘abstain’ and the cross-entropy loss is modified accordingly. The approach is very simple and is meant to address cases where label noise is structured, i.e. features in the data allow the prediction that a sample has a high likelihood to have a bad label. In this case, the approach is shown to be very effective.
  • In Learning with bad training data via iterative trimmed loss minimization, Y. Shen and S. Sanghavi note that in settings with a fraction of badly labeled data, the evolution of training accuracy is different for clean and bad samples. Based on this, they repeatedly filter out the samples with worst loss during the training, and keep training on the remaining samples. The authors show results on classification with random and with systematic label noise, as well as on Generative modeling where some of the training distribution comes from a different dataset.
  • Finally, in Understanding and utilizing deep neural networks trained with Noisy labels, by Chen et al., the authors use the same idea that bad samples are harder to classify, and split their training set into two subsets. They (repeatedly) train on one subset, and select samples from the other subset that are correctly classified.


Part 3: Quentin Grail, PhD student, Machine Learning & Optimization

 Never-Ending Learning (Tutorial)

As humans, we can learn from multiple and diverse experiences throughout our lives but current machine learning models tend to learn a single objective function from a unique set of data. Never-Ending Learning was the topic of the tutorial of Tom Mitchell, one of the co-founders of ICML with his colleague Partha Talukdar.

They presented the Never-Ending Learning paradigm in several case studies, including their Never-Ending Language Learner (NELL).

NELL has been designed to read the web 24 hours/day and extract relational predicates between entities. Since January 2010 it’s accumulated more than 50 million candidate beliefs such as the ones you can see below.

NELL fragment ICML figure

Figure. Fragment of the 80 million beliefs NELL read from the web. Each edge represents a belief triple (e.g. play(MapleLeafs, hockey)), with an associated confidence and provenance not shown here. This figure contains only correct beliefs from NELL’s which has many incorrect beliefs as NELL continues to learn.


Neural approaches to conversational AI (Tutorial)

Working on question-answering (QA) myself at NAVER, I was particularly interested in Michel Galley and Jianfeng Gao’s tutorial. They presented a large panel of the recent advances in Deep Learning for conversational AI. Starting with a description of the latest progress related to QA, they described several models and applications of QA from a single piece of text to Open Domain QA, which corresponds to the general case where a user wants to query a large dataset such as Wikipedia, Freebase…They also introduced the recent multi-step reasoning question-answering task that requires gathering multiple pieces of information spread across multiple documents.

The last part of the tutorial dealt with task-oriented dialogue models that aim to help users to accomplish specific tasks. Here they presented standard dialog system architectures and more recent End-to-End trainable neural architectures and Multi-Domain Dialogue State Tracking.They also reviewed several reinforcement learning approaches for conversational agents.

The session concluded with a brief description of conversation models and chatbots which mainly focused on social bots, like Amazon Alexa and Google Home. One of the major differences between chatbots and task-oriented dialog agents is their objective. Task-oriented dialog agents prioritize accomplishing specific tasks while social bots like Xiaolce, (one of the most popular Chinese chatbots from Microsoft) tend to maximize user engagement with personalized and enjoyable conversations.


Adversarial examples session

Adversarial Attacks on Node Embeddings via Graph Poisoning was about poisoning network representations. The main idea of network representation is to learn node embeddings, represented in a graph structure and which is useful for further tasks such as node classification, node clustering or edge prediction etc. The authors Bojchevski and Günnemann studied the robustness of such representations. They demonstrated that a small, well-chosen adversarial perturbation i.e. adding/removing an edge, can be sufficient to poison the graph and have significant consequences on the downstream task. They also showed that such an approach can be easily transferred across other graph structures.

Most of the work in adversarial attacks has focused on modifying the input example and in Adversarial camera stickers: A physical camera-based attack on deep learning systems the authors wonder if it’s possible to fool a deep classifier by physically manipulating the input device (in this case the camera). They modeled the effect of putting a dot sticker on the camera and jointly optimized the adversarial and physical ‘realizability’ of the solution and they were able to fool ImageNet Classifiers in approximately 50% of the cases.


More on Understanding and improving generalization in deep learning

Aleksander Madry, Associate Professor at MIT, gave a talk on Are All Features Created Equal? He’s interested in how to create meaningful and explainable representations of images in the face of how fragile current networks are when it comes to adversarial examples. It’s true that even small invisible to the human eye perturbations on input data can lead to a serious failure in the model.

Another well-known related problem of neural networks is misaligned representations which is when images that are very different from a human point of view are actually represented as similar by the network.

Madry argues that meaningless perturbations from a human point of view can be meaningful from the model point of view and that there’s no reason a model should restrict its learning to robust representations. They call robust features those that humans would use to classify images, and non-robust features all the others. Because these non-robust features are present in the dataset, learned models tend to use them because they are incapable of distinguishing them from the robust features regarding the distribution of the samples of the dataset.

In this work, the authors create an augmented dataset containing adversarial examples to force the classifier to use only the robust features to take a decision.They show that doing so allows the training of a more robust network with better example representations. It also makes it easier to manipulate the features of a trained model (i.e. change the color of the object, add stripes etc.) Although it slightly decreases the accuracy, removing non-robust features does really help to train a more robust neural network.

Conceptual diagram ICML figure

Figure. A conceptual diagram of the experiments of Section 3. In (a) we disentangle features into combinations of robust/non-robust features (Section 3.1). In (b) we construct a dataset which appears mislabeled to humans (via adversarial examples) but results in good accuracy on the original test set (Section 3.2) Ref: Adversarial Examples Are Not Bugs, They Are Features


And to finish – one (of the two) best paper awards!

The contribution of the paper Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations by Locatello et al. was twofold. First, they theoretically show that learning disentangled representations in an unsupervised way is impossible without bias on the model and the data. Secondly, they did a survey on most of the recent algorithms and evaluation metrics by training more than 12000 models on multiple datasets.

Their conclusions were that, while recent methods successfully perform on the proposed losses, no good disentanglement models could be identified without explicit supervision. They also showed that it’s not clear for the moment that unsupervised learning of such disentanglement decreases the learning complexity of the downstream tasks.

So, that’s it from us for this year. Reach out to us if one of the topics we’ve covered has piqued your interest and we hope to see you at SIGIR, ACL, ICCV, EMNLP or NeurIPS before the year is out!

Ceci correspond à une petite biographie d'environ 200 caractéres