A depiction of a complex error surface. Image from the Snapshot Ensembles paper.

Do smoother areas of the error surface lead to better generalization?

An experiment inspired by the first lecture of the fast.ai MOOC

In the first lecture of the outstanding Deep Learning Course (linking to version 1, which is also superb, v2 to become available early 2018), we learned how to train a state of the art model using very recent techniques (for instance, the optimal learning rate estimation as described in the Cyclical Learning Rates for Training Neural Networks paper from 2015).

The setup

I trained 100 simple neural networks (each consisting of a single hidden layer with 20 units) on the MNIST dataset achieving on average 95.4% accuracy on the test set. I then measured their ability to generalize by looking at the difference between the loss on the train set and the test set.


Can we safely conclude that the relationship we hypothesized does not hold? By all means, we cannot do this either!

I ❤️ ML / DL ideas — I tweet about them / write about them / implement them. Self-taught RoR developer by trade.