Notions in Optimal Transport for Sigmoid Neural Networks

A beginners’ analysis of "On the Global Convergence of Gradient Descent for Over-Parameterized Models using Optimal Transport", Chizat, Bach, Jan 2023

To access the document, please click [PDF]

The presentation is at [this link]

80 pages report written to prepare a presentation for the PhD course: Real Analysis II, offered at Bocconi University. Expanded devolving time to the Visitiing Student Initiative.

Abstract

The following document is an exploration of the results of the paper in the subtitle, written to understand better the content of the claims. It is not an extension but rather an expansion of some of the elements needed for a less experienced reader. As this production is done in fulfillment of a semester exam for an Optimal Transport course, it does not cover all of the content, and was produced in more or less a month. The focus is on two layer sigmoid neural networks, and all the theoretical results needed to understand them. I also took inspiration from a video presentation of the publication and two blog posts by the authors. Works cited are in line with those of the authors, with some additional resources that I found helpful. Given the breadth of the subject, some of the content is left for future studies, but nothing less than the original publication is presented. I personally see this as a depth project, going very far into theoretical results to see the potential of Theory of neural networks. It is by no means an exposition of skills that I have 100% stored.
Section 1 paves the way for the research work proposed in a broad sense, introducing parametric optimization and the problem that will be studied, as well as a version of it that can be implemented. Section 2 shows how the formalism of Wasserstein Gradient Flows is instrumental to connect the two versions of the problem. Section 3 is the final theoretical contribution, with a characterization of the conditions thanks to which a global optimization is attained with the method considered. Lastly, in Section 4, it is shown that sigmoid neural networks can benefit from the results and be tuned to reach globally optimizing configurations, with satisfactory experimental results.