oneflow.optim.RMSprop

class oneflow.optim.RMSprop(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, alpha: float = 0.99, eps: float = 1e-08, weight_decay: float = 0, momentum: float = 0.0, centered: bool = False, contiguous_params: bool = False)

Implements RMSprop algorithm.

oot Mean Squared Propagation (RMSProp) is an unpublished, adaptive learning rate method. The original slides proposed RMSProp: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf .

The original equation is as follows:

\[ \begin{align}\begin{aligned}r(w, t) = \alpha r(w, t-1) + (1 - \alpha)(\nabla Q_{i}(w))^2\\\begin{split}W = w - \frac{\eta} {\\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{split}\end{aligned}\end{align} \]

The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by \(sqrt{v(w,t)}\). In some cases, adding a momentum term :math: beta is beneficial. In our implementation, Nesterov momentum is used:

\[ \begin{align}\begin{aligned}r(w, t) = \alpha r(w, t-1) + (1 - \alpha)(\nabla Q_{i}(w))^2\\\begin{split}v(w, t) = \beta v(w, t-1) + \frac{\eta} {\\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{split}\\w = w - v(w, t)\end{aligned}\end{align} \]

if centered is True:

\[ \begin{align}\begin{aligned}r(w, t) = \alpha r(w, t-1) + (1 - \alpha)(\nabla Q_{i}(w))^2\\g(w, t) = \alpha g(w, t-1) + (1 - \alpha)\nabla Q_{i}(w)\\\begin{split}v(w, t) = \beta v(w, t-1) + \frac{\eta} {\\sqrt{r(w,t) - (g(w, t))^2 + \epsilon}} \nabla Q_{i}(w)\end{split}\\w = w - v(w, t)\end{aligned}\end{align} \]

where, \(\alpha\) is a hyperparameter and typical values are 0.99, 0.95 and so on. \(\beta\) is the momentum term. \(\epsilon\) is a smoothing term to avoid division by zero, usually set somewhere in range from 1e-4 to 1e-8.

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-2)

  • momentum (float, optional) – momentum factor (default: 0, oneflow not support momenmtum > 0 now!)

  • alpha (float, optional) – smoothing constant (default: 0.99)

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • centered (bool, optional) – if True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • contiguous_params (bool, optional) – whether to use contiguous ParamGroup which puts all parameters of the same type, device and group into the same tensor and update them together. (default: False)

For example:

Example 1:

# Assume net is a custom model.
rmsprop = flow.optim.RMSprop(net.parameters(), lr=1e-3)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    rmsprop.step()
    rmsprop.zero_grad()

Example 2:

# Assume net is a custom model.
rmsprop = flow.optim.RMSprop(
    [
        {
            "params": net.parameters(),
            "lr": learning_rate,
            "clip_grad_max_norm": 0.5,
            "clip_grad_norm_type": 2.0,
        }
    ],
)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    rmsprop.clip_grad()
    rmsprop.step()
    rmsprop.zero_grad()

If you want to use clip_grad, you can refer this example.

For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to oneflow.nn.utils.clip_grad_norm_().

__init__(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, alpha: float = 0.99, eps: float = 1e-08, weight_decay: float = 0, momentum: float = 0.0, centered: bool = False, contiguous_params: bool = False)

Initialize self. See help(type(self)) for accurate signature.

Methods

__delattr__(name, /)

Implement delattr(self, name).

__dir__()

Default dir() implementation.

__eq__(value, /)

Return self==value.

__format__(format_spec, /)

Default object formatter.

__ge__(value, /)

Return self>=value.

__getattribute__(name, /)

Return getattr(self, name).

__gt__(value, /)

Return self>value.

__hash__()

Return hash(self).

__init__(params[, lr, alpha, eps, …])

Initialize self.

__init_subclass__

This method is called when a class is subclassed.

__le__(value, /)

Return self<=value.

__lt__(value, /)

Return self<value.

__ne__(value, /)

Return self!=value.

__new__(**kwargs)

Create and return a new object.

__reduce__()

Helper for pickle.

__reduce_ex__(protocol, /)

Helper for pickle.

__repr__()

Return repr(self).

__setattr__(name, value, /)

Implement setattr(self, name, value).

__sizeof__()

Size of object in memory, in bytes.

__str__()

Return str(self).

__subclasshook__

Abstract classes can override this to customize issubclass().

_check_variables_in_graph(vars_conf)

_check_variables_optimizer_bound(vars_conf)

_generate_conf_for_graph(train_conf, vars_conf)

_generate_grad_clip_conf_for_optim_conf(…)

_generate_indexed_slices_optimizer_conf(…)

_generate_lr_scale_for_optim_conf(…)

_parse_input_parameters(parameters)

Supports such parameters:

add_param_group(param_group)

Add a param group to the Optimizer s param_groups.

clip_grad([error_if_nonfinite])

Clips gradient norm of an iterable of parameters.

load_state_dict(state_dict)

Load the state of the optimizer which is created by state_dict function.

state_dict()

Returns the state of the optimizer as a dict.

step([closure])

Performs a single optimization step.

zero_grad([set_to_none])

Sets the gradients of all optimized oneflow.Tensor s to zero.

Attributes

support_sparse

Whether the Optimizer support sparse update.