Wednesday, December 19, 2018

Sanjeev Arora: Toward A Theoretical Understanding of Deep Learning

This is Prof. Arora's 2nd Ahlfors Lecture at the Department of Mathematics of Harvard University, given earlier this fall, on September 12 2018 (the lectures are in memory of Prof Lars Ahlfors, a Fields Medalist, who was on the Harvard Mathematics faculty for more than thirty years). I have not seen the video of the first lecture in this series, but going by the introduction to this 2nd lecture, I imagine it contained much of the same material as in his earlier lecture at the Institute of Advanced Study entitled 'Everything you wanted to know about Machine Learning but didn't know whom to ask!', which I have separately blogged, below.

Ever since roughly 2012 or slightly later, a number of issues about Deep Learning have bothered those who have a longer history of familiarity with the subject. One might have wondered, for example, how it comes to be that gradient descent works so well in Deep Learning Networks, given the well-known non-convexity of the loss function that the training algorithms attempt to minimize. Or one might have asked why, in the presence of such overwhelming overparametrization, where the number of parameters (which an earlier generation learnt to call 'weights') so vastly exceeds the number of training examples, that networks nevertheless manage to generalize, putting to naught the old statistician's dictum about overfitting? Or one might have asked more generally, what the role of depth actually is in 'deep' learning, given how many papers now are oh so nonchalantly proclaiming dozens if not hundreds of deep layers. And what about unsupervised learning, on which so little theoretical understanding currently obtains. In a humorous aside, Prof Arora actually cites Prof Yann LeCun of NYU as saying that 'the revolution will neither be televised nor supervised!'.

A truly inspiring aspect of Prof Arora's current research program is precisely that he has chosen to build it around such questions. This lecture discusses his research in fairly great detail, and begins with a general introduction to the main research issues, providing along the way definitions of some arcane insider lingo such as 'regularization' (which means in this context the set of techniques used to ensure that the network does not converge to the 'true' minimum, when, for whatever reason, one has reason to believe that this will not result in a robust solution to the problem at hand).

He organizes the lecture into three main parts (indeed, also like the previous one which I blogged here):

1. Why is Overparametrization (also called Overprovisioning) a good idea? That is, why should one have so many more parameters (which an earlier generation used to call 'weights') than training examples?

2. How do Deep Neural Networks actually go about their Optimization task?

3. What is the Theory of Generative Adversarial Networks (also called GANs)?

Most of the time in the lecture is spent on the first part, and this blog post will also mainly deal with issues he raised under that head. The other two parts are equally (if not more) interesting, but I will leave them for a possible subsequent post. They are slightly more advanced topics, and it will be challenging enough to convey the main message of the first part in a way that would appeal to the widest possible interested audience!

First, and extremely interestingly, there is the 'Acceleration Effect' - which is, that the presence of 'extra' layers actually speeds up the gradient descent; the intuition being that it does so by providing alternative paths in the (now necessarily more complex and higher-dimensional) 'energy landscape', and moreover, there is reason to believe that it also provides the gradient descent algorithm with 'a memory', so that the chances of 'getting stuck' in a false minimum of the loss function is far reduced. The following picture, taken from the video of his talk, illustrates the point. Of the three curves shown, one (corresponding to the network with the largest number of 'extra' layers) has reached a value of the loss function lower than the others in the same amount of time, thus demonstrating the 'acceleration effect'.

Then there is the Noise Stability Property, in which adding Gaussian noise (even if it be of the same norm as the weight vector, which one would normally think would instantly muddy up any function it was about to calculate) in any layer of the Deep Network actually has very little effect on the final function that the network was trained to compute. Indeed, the induced error attenuates as it propagates up the layers, but even in a single layer quickly starts to vanish. The following picture, also taken from the video of his talk, illustrates the results of the Noise Stability Experiment:

But perhaps the most intriguing demonstration is that, indeed, In Real Life, deep networks are able to generalize [despite the overparametrization that classical statisticians would instinctively decry]. Whereas one expects that the training error can indeed be made very small with a reasonably representative training set, one is led naively to believe that the 'testing error' would begin to rise after a critical value of the number of parameters is crossed (or a similar quantitative measure of the network complexity is reached). But counter-intuitively, the generalization error stabilizes, neither increasing nor decreasing much.

After a detailed discussion of more advanced theoretical ideas such as the Layer Cushion (a measure of the redundancy in the network), the activation contraction, black-box and non-black-box analyses, the manifold assumption, and others, in the context of the three main parts of the lecture (as above), Prof Arora moves to wrap up his lecture by discussing directions of research likely to prove fruitful in the near future. Some of these his group is already working on, but other suggestions are actually earnest implorations to all those interested in contributing to the field:

1. Use ideas from Physics and the calculus of variations (such as Hamiltonian and Lagrangian methods) which he feels would shed more light on the dynamics of gradient descent in multi-dimensional spaces.

2. Develop a good theoretical handle on Unsupervised Learning and Deep Reinforcement Learning.

3. Design models to understand and analyze interactive learning (such as learning languages or other skills).

Overall, it is a lecture well worth watching (and perhaps watching again) because on an initial pass (unless one is a subject matter expert in this field) one cannot hope to absorb everything. This blog is a modest attempt to aid this process for those interested, by providing a brief discussion of some important issues in the first part of the lecture.

One feels deeply grateful to Prof Arora not only for having bravely done the research, but also for having prepared and given this lecture; and to the Harvard University Mathematics Department for having invited and hosted him, and most importantly, for having videotaped it and shared it on youtube.