#### World's Best AI Learning Platform with profoundly Demanding Certification Programs

Designed by IITian's, only for AI Learners.

Download our e-book of Introduction To Python

How to leave/exit/deactivate a Python virtualenvironment Exception Type: JSONDecodeError at /update/ Exception Value: Expecting value: line 1 column 1 (char 0) How to choosing the right estimator for the machine learning problem? Why not getting result of a as [4, 7,8,3,2] How to implement Queue in python using list? Getting error in user defined function 'collections.OrderedDict' object has no attribute 'status_code' What is use of Heat map ? How to plot heat map? Join Discussion

4 (4,001 Ratings)

220 Learners

Neha Kumawat

a year ago

Trajectory

In
my previous article **“Optimizers in Machine Learning and Deep Learning.”**
I gave a brief introduction about RMSprop optimizers. In this article, I will
try to give an in-depth explanation of the optimizer’s algorithm.

If
you didn’t read my previous articles. I recommend you to first go through my
previous articles on optimizers mentioned below and then come back to this
article for better understanding:

Let’s
start with a brief idea behind Rmsprop

RMSProp
stands for Root
Mean Square Prop, which is an adaptive learning rate
optimization algorithm proposed by **Geoff Hinton **in lecture 6 of the
online course “Neural Networks for Machine
Learning”**.**

The core idea behind RMSprop is to keep the
moving average of the squared gradients for each weight. And then divide the
gradient by square root of the mean square. That’s why it’s called RMSprop
(root mean square).

RMSProp tries to resolve Adagrad’s **radically diminishing
learning rates** by using a moving average of the squared gradient. It
utilizes the magnitude of the recent gradient descents to normalize the
gradient.

Adagrad
will accumulate all previous gradient squares, and RMSprop just calculates the
corresponding average value, so it can alleviate the problem that the learning
rate of the Adagrad algorithm drops quickly.

The
main difference is that RMSProp calculates the differential
squared weighted average of the gradient. This method is
beneficial to eliminate the direction of large swing amplitude and is used to
correct the swing amplitude so that the swing amplitude in each dimension is
smaller. On the other hand, it also makes the network function converge faster.

In RMSProp learning rate gets adjusted automatically and it
chooses a different learning rate for each parameter.

RMSProp divides the learning rate by the average of the
exponential decay of squared gradients.

Now, as you got a brief idea about the RMSprop optimizer. Let me now
explain you the intuition behind RMSprop.

Let’s try to understand in a simple
way. We can say that the RMSprop optimizer is similar to the gradient descent
algorithm with momentum.

In the RMSprop optimizer, it tries to
restrict the oscillations in the vertical direction, which in turn helps us to
increase our learning rate and so that our algorithm could take larger steps in
the horizontal direction and converge fast. The main difference between RMSprop
and gradient descent is how we calculate the gradients for them. From the
below-mentioned equations we can see how the gradients can be calculated for
the RMSprop and gradient descent with momentum. Here, the value of momentum is
denoted by beta and which is usually set to 0.9 most of the time.

Gradient descent with momentum update rule

RMSprop optimizer update rule

So, from the above equation, we can see how both the equation
is almost similar, only the difference between them is how we calculate
gradient for both of them and how we update the weights and bias for them.

Momentum (blue) and RMSprop (green) convergence. Observe RMSprop is faster

We can simply define a function for RMSProp as shown
below:

```
def rmsprop():
w, b, eta = init_w, init_b, 0.1
vw, vb, beta, eps = 0, 0, 0.9, 1e-9
for i in range(max_epochs):
dw, db = 0, 0
for x,y in zip(X,Y):
dw += grad_w(w, b, x, y)
db += grad_b(w, b, x, y)
vw = beta * vw + (1 - beta) * dw**2
vb = beta * vb + (1 - beta) * db**2
w = w - (eta/np.sqrt(v_w + eps)) * dw
b = b - (eta/np.sqrt(v_b + eps)) * db
```

RMSProp (green) vs AdaGrad (white) showing with or without the sum of gradient shown by the squares

To see the effect of the decaying. Let’s compare them
AdaGrad (white) keeps up with RMSProp (green) initially, as expected with the
tuned learning rate and decay rate. But we can see from the above animation
that the **sums of gradient squared** for AdaGrad get accumulate
so fast that they soon become humongous (shown by the sizes of the squares in
the animation). They harm and eventually AdaGrad practically stops moving.

But RMSProp, on the other hand, tries to keep the
squares under a manageable size whole time with the help of the decay rate.
This makes RMSProp faster than AdaGrad.

I hope after reading this article, finally, you
came to know about **what is Rmsprop, how it works? and What’s the difference
between RMSprop and Adagrad optimizer algorithms and its importance**. In the
next articles, I will come with a detailed explanation of some other types of
optimizers.** **For more blogs/courses on data science, machine learning,
artificial intelligence, and new technologies do visit us at** ****InsideAIML.**

Thanks for
reading…

We're Online!

Chat now for any query