Introduction: Visualizing How Neural Nets Learn

About: Made this for the love of Robotics and sharing my knowledge

In this project, neural networks are like little math machines that learn to draw curves. We will see how to choose how many “layers” (stacks of math steps) and how many “neurons” (tiny calculators inside each layer) are needed so the network can draw any polynomial curve you want, like lines, parabolas, or wiggly higher‑degree shapes.

You will also see what happens when the network is asked to make predictions in places it never saw during training (this is called “out of distribution”), sometimes it guesses sensibly, and sometimes it goes wild in ways that look nothing like your original polynomial. Finally, we will zoom in on overfitting: how a network can cling too tightly to noisy training points, making a curve that zigzags through every dot instead of following the smooth, simple polynomial underneath, and you’ll learn how to spot that behavior in your own experiments.

Supplies

Just a Computer!

Download these three scripts. The README.txt will explain what you have to download to get the script working, the Config.json will allow you to define the lines/polynomials you want to train your model on, and the sizes of your models. Then run MLP_Training_Visualizer.py I wanted to make it open source so anyone can play around with the code and visualize how these mysterious black boxes truly learn functions.

Step 1: Predicting a Line

In this project, you’ll meet three “neural network brains” which we call in the business Multi-layered perceptrons, that all try to copy the same line, but behave very differently.

First, there is a super tiny brain: 1 layer with just 1 neuron.

  1. It is so simple that it basically learns only a flat line (just a bias), so it cannot bend to follow the curve at all.

Next, there is a medium brain: 4 layers with 4 neurons each.

  1. This one is smart enough to learn the shape of the curve where it has training points, but when you ask it to predict far away from the data, we can see its guesses (yellow) are very different from the actual data (green).

Finally, there is the huge brain: 10 layers with 128 neurons.

  1. Now the brain is so powerful that it starts to “overthink” the data, drawing a jagged, wiggly line that goes exactly through noisy points instead of following the nice smooth curve underneath. That wiggly behavior is called overfitting. Its the equivalent of memorizing training points.


Step 2: Predicting a Curved Line

Simple brain: 1 layer with 4 neurons.

  1. It is only strong enough to learn something close to a straight line, so it captures just the overall tilt of the data and misses the curves.

Medium brain: 4 layers with 4 neurons in each layer.

  1. This one is powerful enough to match the whole curve where it has training points, but when you ask it to predict far outside that region, its guesses become poor because it never really “understood” the function there, it just learned to interpolate between known points.

Finally, there is a huge brain: 10 layers with 128 neurons per layer.

  1. It becomes so expressive that it starts to overfit, doing extremely well on the training data while getting much worse on the test data.
  2. Although the data points are more spread out and it's harder to see the the overfitting as a jagged line in the curve plot we can look at the training vs. test loss plot instead.
  3. The Training loss keeps going down while the test loss stays pretty similar, showing that the network is memorizing instead of truly learning the underlying polynomial.


Step 3: Taking a Step Back

We learned that neural networks can all be trained on the same polynomial but behave very differently depending on their size and depth. A small 1‑layer, 4‑neuron network mostly behaves like a linear model and cannot capture the curved shape of the polynomial, so it underfits.

A medium 4‑layer, 4‑neuron network has enough capacity to match the polynomial well on the region where it sees training data, but its predictions become unreliable once we move outside that range, showing that good in‑distribution performance does not guarantee good extrapolation.

A large 10‑layer, 128‑neuron network has so much capacity that it begins to overfit: the training loss keeps dropping while the test loss stops improving or even gets worse, even though the function plots may still look smooth because the data points are far apart.

Just as a side here: Neural net size depends on the problem! As we saw a 1-layer, 4-neuron network is perfect for straight lines, but too small for curves.

Overall, this shows that picking an architecture is a balance: too small of a network and it will not learn the function; big enough and it learns the function on the data range; too big and it memorizes the training set.