Wednesday, December 26, 2018

Sanjeev Arora: Interview with John Urschel at Heidelberg Laureate Forum

The Heidelberg Laureate Forum (HLF) is an organization devoted to encouraging interaction between extremely distinguished senior scholars and researchers in Computer Science and Mathematics (the Laureates) and early career academics (graduate students and postdocs). The  Forum works with a number of other organizations in the events it hosts - including the Association for Computational Machinery (ACM), the International Mathematical Union, the Norwegian Academy of Science and Letters (which also awards the Abel Prize) and the Lindau Nobel Laureate Meetings.

This video shows Prof Sanjeev Arora, interacting with MIT graduate student John Urschel. Among others who attended the Heidelberg Laureate Forum in 2018 along with him, to name just a few, were: Michael Atiyah, Vinton Cerf,  Jeff Dean, Richard Karp, Les Valiant and SRS Varadhan.
I blog the video for the insight it provides into Prof Arora's working style and working philosophy  (such as how to pick problems to work on). I hope this blog post will make the video  more accessible to search engines, encouraging more people to watch the full video. In the blog I will quote near-verbatim some of the answers, again hoping that will whet people's appetite for the full video. There is some inevitable paraphrasing, so please do not rely on this post as a transcript! (Occasionally I have also added emphasis in italics, but this is only my own editorial judgment. The video also contains some  interjecting remarks from the video director behind the camera, to clarify a point or to seek additional explanation. These are very useful indeed, but I have not explicitly referred to them).

In choosing to pair Urschel and Arora in conversation, the video director appears to have deliberately chosen two people who have reasonably similar academic interests, but also fairly contrasting personal styles. There is some initial awkwardness in the conversation, but overall it is an excellent choice of conversationalists.

The video begins with Urschel asking Arora what were some of the most interesting recent topics he had come across in the field of neural nets and deep learning.

Arora: (I have learned about) New architectures and algorithms that enable neural nets to learn and generalize better, such as batch normalization, residual nets, ReLU (rectified linear unit) networks, dense nets and similar topics...

The conversation moves along, and then Urschel asks Arora how he came to work in the field of neural nets (since Arora had earlier worked on computational complexity and optimization problems, which are more mainstream in Computer Science). What I would note also is that Prof Arora, through  his work on machine learning and deep networks, has actually managed to mainstream such work within Computer Science (and to some extent within Mathematics). 

Arora: I did not start with a preconceived idea of what to work on… but took up interesting issues as they came up – as opposed to colleagues who made a certain area their own for one or two decades…sometimes a newcomer can look at things in a new way… as happened with me and the Euclidean Traveling Salesman Problem (TSP)… if you do that often enough, you can get lucky… the main ingredient is not to be afraid of problems…in Computer Science most of our problems are very close to the source…and rather new… sometimes even the definition of the problem is not fixed… …and since there’s no end to learning, one can learn on the job (rather than have to study everything that was done before)… and most of our stuff is not very deep mathematics…
The conversation again rolls along, and then Urschel asks Arora what he had learned specifically at the Heidelberg International Forum...
Arora:  I’ve learnt about some new problems here at the HLF… e.g.,  did you know that just by observing the power consumption in a chip, in some cases you can break its cryptosystem?  (When it becomes difficult) and for more similar but more complicated tasks, you can throw a Deep Learning Network at it… you track the input-output behavior (of the chip), measure the power consumption, and the deep learning net develops correlations that work well empirically (in the decrypting task...). Also, (at the HLF) I’ve had conversations with people on a personal level that I normally do not have outside my own sub-discipline. 

In concluding this blog post, I will add only that Prof Arora's own career fully exemplifies the career philosophy he outlines here...and as a final quote (although from a different interview he did with the HLF) I leave you with this:
Arora: In Computer Science, things change so fast that if you don’t change your field (every few years) you can quickly get obsolete...

Wednesday, December 19, 2018

Sanjeev Arora: Toward A Theoretical Understanding of Deep Learning

This is Prof. Arora's 2nd Ahlfors Lecture at the Department of Mathematics of Harvard University, given earlier this fall, on September 12 2018 (the lectures are in memory of Prof Lars Ahlfors, a Fields Medalist, who was on the Harvard Mathematics faculty for more than thirty years). I have not seen the video of the first lecture in this series, but going by the introduction to this 2nd lecture, I imagine it contained much of the same material as in his earlier lecture at the Institute of Advanced Study entitled 'Everything you wanted to know about Machine Learning but didn't know whom to ask!', which I have separately blogged, below.

Ever since roughly 2012 or slightly later, a number of issues about Deep Learning have bothered those who have a longer history of familiarity with the subject. One might have wondered, for example, how it comes to be that gradient descent works so well in Deep Learning Networks, given the well-known non-convexity of the loss function that the training algorithms attempt to minimize. Or one might have asked why, in the presence of such overwhelming overparametrization, where the number of parameters (which an earlier generation learnt to call 'weights') so vastly exceeds the number of training examples, that networks nevertheless manage to generalize, putting to naught the old statistician's dictum about overfitting? Or one might have asked more generally, what the role of depth actually is in 'deep' learning, given how many papers now are oh so nonchalantly proclaiming dozens if not hundreds of deep layers. And what about unsupervised learning, on which so little theoretical understanding currently obtains. In a humorous aside, Prof Arora actually cites Prof Yann LeCun of NYU as saying that 'the revolution will neither be televised nor supervised!'.

A truly inspiring aspect of Prof Arora's current research program is precisely that he has chosen to build it around such questions. This lecture discusses his research in fairly great detail, and begins with a general introduction to the main research issues, providing along the way definitions of some arcane insider lingo such as 'regularization' (which means in this context the set of techniques used to ensure that the network does not converge to the 'true' minimum, when, for whatever reason, one has reason to believe that this will not result in a robust solution to the problem at hand).

He organizes the lecture into three main parts (indeed, also like the previous one which I blogged here):

1. Why is Overparametrization (also called Overprovisioning) a good idea? That is, why should one have so many more parameters (which an earlier generation used to call 'weights') than training examples?

2. How do Deep Neural Networks actually go about their Optimization task?

3. What is the Theory of Generative Adversarial Networks (also called GANs)?

Most of the time in the lecture is spent on the first part, and this blog post will also mainly deal with issues he raised under that head. The other two parts are equally (if not more) interesting, but I will leave them for a possible subsequent post. They are slightly more advanced topics, and it will be challenging enough to convey the main message of the first part in a way that would appeal to the widest possible interested audience!

First, and extremely interestingly, there is the 'Acceleration Effect' - which is, that the presence of 'extra' layers actually speeds up the gradient descent; the intuition being that it does so by providing alternative paths in the (now necessarily more complex and higher-dimensional) 'energy landscape', and moreover, there is reason to believe that it also provides the gradient descent algorithm with 'a memory', so that the chances of 'getting stuck' in a false minimum of the loss function is far reduced. The following picture, taken from the video of his talk, illustrates the point. Of the three curves shown, one (corresponding to the network with the largest number of 'extra' layers) has reached a value of the loss function lower than the others in the same amount of time, thus demonstrating the 'acceleration effect'.

Then there is the Noise Stability Property, in which adding Gaussian noise (even if it be of the same norm as the weight vector, which one would normally think would instantly muddy up any function it was about to calculate) in any layer of the Deep Network actually has very little effect on the final function that the network was trained to compute. Indeed, the induced error attenuates as it propagates up the layers, but even in a single layer quickly starts to vanish. The following picture, also taken from the video of his talk, illustrates the results of the Noise Stability Experiment:

But perhaps the most intriguing demonstration is that, indeed, In Real Life, deep networks are able to generalize [despite the overparametrization that classical statisticians would instinctively decry]. Whereas one expects that the training error can indeed be made very small with a reasonably representative training set, one is led naively to believe that the 'testing error' would begin to rise after a critical value of the number of parameters is crossed (or a similar quantitative measure of the network complexity is reached). But counter-intuitively, the generalization error stabilizes, neither increasing nor decreasing much.

After a detailed discussion of more advanced theoretical ideas such as the Layer Cushion (a measure of the redundancy in the network), the activation contraction, black-box and non-black-box analyses, the manifold assumption, and others, in the context of the three main parts of the lecture (as above), Prof Arora moves to wrap up his lecture by discussing directions of research likely to prove fruitful in the near future. Some of these his group is already working on, but other suggestions are actually earnest implorations to all those interested in contributing to the field:

1. Use ideas from Physics and the calculus of variations (such as Hamiltonian and Lagrangian methods) which he feels would shed more light on the dynamics of gradient descent in multi-dimensional spaces.

2. Develop a good theoretical handle on Unsupervised Learning and Deep Reinforcement Learning.

3. Design models to understand and analyze interactive learning (such as learning languages or other skills).

Overall, it is a lecture well worth watching (and perhaps watching again) because on an initial pass (unless one is a subject matter expert in this field) one cannot hope to absorb everything. This blog is a modest attempt to aid this process for those interested, by providing a brief discussion of some important issues in the first part of the lecture.

One feels deeply grateful to Prof Arora not only for having bravely done the research, but also for having prepared and given this lecture; and to the Harvard University Mathematics Department for having invited and hosted him, and most importantly, for having videotaped it and shared it on youtube.

Saturday, December 8, 2018

Everything You Wanted to Know About Machine Learning....

...but didn't know whom to ask! 

Prof Sanjeev Arora, Princeton University, speaking at the Institute of Advanced Study

In a simple one-hour lecture (although one which does assume a certain amount of mathematical sophistication on the part of the audience, which, given that the audience consists mostly of mathematicians at the Institute of Advanced Study, is amply justified!), and one which already has had several thousands of views on youtube, Professor Arora describes the three main categories of Machine Learning Algorithms (Supervised, Unsupervised and Reinforcement Learning), and provides recent examples of 'successes' in each. Reinforcement Learning, which is taken up last, is nevertheless covered in sufficient detail, not leaving the viewer disappointed, and somewhat to one's  pleasant surprise as the lecture unfolds!

One of the topics taken up in some detail is the prediction of ratings for books based on the text of existing reviews, an application under the broad rubric of Natural Language Processing - and a subject of Prof. Arora's own recent work. The topic well illustrates many of the practical issues and methods involved in applications of machine learning, and also provides the intuition for developing theoretical ideas regarding how machine learning works. My own perspective is that of someone who knows much of the subject matter of the lecture for nearly three decades, since the mid-1980s, but I nevertheless found it a useful hour well-spent (on which, one might also add, that the lecture finishes almost exactly in an hour, showing also excellent time management on Prof Arora's part, as the lecture dynamically evolves!). What I greatly appreciated also was that Prof Arora reiterates several times during the talk the main reasons for the heightened interest in machine learning today, and this gives further insight into why someone - even if they might have been, for whatever reason (but especially out of an informed skepticism regarding such methods, as Prof Arora hints was the case with him) previously indifferent to the field - might now become interested, or even very interested in it. Another aspect of the lecture worth mentioning, especially these days when it is such a rarity, is that Prof Arora uses blackboard-and-chalk throughout - this added to my own enjoyment of the lecture, and doubtless also did the same for its live audience - a certain spontaneity is inevitably lost when lecturers use slides prepared earlier.

Two thumbs up, then!