# oneflow.optim¶

## Optimizers¶

oneflow.experimental.optim.Adam(parameters: Union[Iterator[oneflow.python.nn.parameter.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, scale: float = 1.0)

Implements Adam algorithm.

It has been proposed in Adam: A Method for Stochastic Optimization. The implementation of the L2 penalty follows changes proposed in Decoupled Weight Decay Regularization.

This algorithm can adjust the learning rate of each parameter dynamically according to the 1st-moment estimates and the 2nd-moment estimates of gradient.

the equation of parameters updating is:

\begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align}
Parameters
• params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

• lr (float, optional) – learning rate (default: 1e-3)

• betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

• eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

• weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

• scale (float, optional) – the scale factor of loss (default: 1.0)

oneflow.experimental.optim.AdamW(parameters: Union[Iterator[oneflow.python.nn.parameter.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, scale: float = 1.0)

Implements AdamW algorithm.

The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.

The optimizer of the Adam-weight-decay algorithm.

(More details please refer to Adam-weight-decay).

So we use Adam-weight-decay algorithm to solve this problem.

the equation of parameters updating is:

\begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*(\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}+\lambda*param_{old})\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align}
Parameters
• params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

• lr (float, optional) – learning rate (default: 1e-3)

• betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

• eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

• weight_decay (float, optional) – weight decay (L2 penalty) (In the equation is λ, default: 0)

• scale (float, optional) – the scale factor of loss (default: 1.0)

oneflow.experimental.optim.RMSprop(parameters: Union[Iterator[oneflow.python.nn.parameter.Parameter], List[Dict]], lr: float = 0.001, alpha: float = 0.99, eps: float = 1e-08, weight_decay: float = 0, momentum: float = 0.0, centered: bool = False, scale: float = 1.0)

Implements RMSprop algorithm.

oot Mean Squared Propagation (RMSProp) is an unpublished, adaptive learning rate method. The original slides proposed RMSProp: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf .

The original equation is as follows:

\begin{align}\begin{aligned}r(w, t) = \alpha r(w, t-1) + (1 - \alpha)(\nabla Q_{i}(w))^2\\\begin{split}W = w - \frac{\eta} {\\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{split}\end{aligned}\end{align}

The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by $$sqrt{v(w,t)}$$. In some cases, adding a momentum term :math: beta is beneficial. In our implementation, Nesterov momentum is used:

\begin{align}\begin{aligned}r(w, t) = \alpha r(w, t-1) + (1 - \alpha)(\nabla Q_{i}(w))^2\\\begin{split}v(w, t) = \beta v(w, t-1) + \frac{\eta} {\\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{split}\\w = w - v(w, t)\end{aligned}\end{align}

if centered is True:

\begin{align}\begin{aligned}r(w, t) = \alpha r(w, t-1) + (1 - \alpha)(\nabla Q_{i}(w))^2\\g(w, t) = \alpha g(w, t-1) + (1 - \alpha)\nabla Q_{i}(w)\\\begin{split}v(w, t) = \beta v(w, t-1) + \frac{\eta} {\\sqrt{r(w,t) - (g(w, t))^2 + \epsilon}} \nabla Q_{i}(w)\end{split}\\w = w - v(w, t)\end{aligned}\end{align}

where, $$\alpha$$ is a hyperparameter and typical values are 0.99, 0.95 and so on. $$\beta$$ is the momentum term. $$\epsilon$$ is a smoothing term to avoid division by zero, usually set somewhere in range from 1e-4 to 1e-8.

Parameters
• params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

• lr (float, optional) – learning rate (default: 1e-2)

• momentum (float, optional) – momentum factor (default: 0, oneflow not support momenmtum > 0 now!)

• alpha (float, optional) – smoothing constant (default: 0.99)

• eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

• centered (bool, optional) – if True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance

• weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

oneflow.experimental.optim.SGD(parameters: Union[Iterator[oneflow.python.nn.parameter.Parameter], List[Dict]], lr: float = 0.001, momentum: float = 0.0, scale: float = 1.0)

Implements SGD algorithm.

This algorithm takes a random sample’s gradient as an approximate estimate of the overall gradient in small batch gradient descent.

When the momentum = 0, the equation of parameters updating is:

$param_{new} = param_{old} - learning\_rate * grad$

With momentum, the equation of parameters updating is:

\begin{align}\begin{aligned}& V_t = \beta * V_{t-1} + learning\_rate * g_t\\& param_{new} = param_{old} - V_t\end{aligned}\end{align}
Parameters
• params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

• lr (float, optional) – learning rate (default: 1e-3)

• momentum (float, optional) – Momentum factor (default: 0.0)

• scale (float, optional) – the scale factor of loss (default: 1.0)

oneflow.experimental.optim.lr_scheduler.CosineAnnealingLR(optimizer, steps: int, alpha: float = 0.0, last_step=- 1, verbose=False)

This operator creates a Cosine decayed learning rate scheduler.

Before the steps are specified by user, the learning rate will be updated as:

\begin{align}\begin{aligned}& cos\_decay = 0.5*(1+cos(\pi*\frac{current\_step}{steps}))\\& decay\_factor = (1-\alpha)*cos\_decay+\alpha\\& learning\_rate = base\_learning\_rate*decay\_factor\end{aligned}\end{align}

After the steps specified by user, the learning rate will be :

$learning\_rate = {base\_learning\_rate}*{\alpha}$

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts.

Parameters
• optimizer (Optimizer) – Wrapped optimizer.

• steps (int) – The decay steps in the scheduler.

• alpha (float, optional) – The learning rate scale factor ($$\alpha$$). (default: 0.0)

• last_step (int, optional) – The index of last step. (default: -1)

• verbose (bool, optional) – If True, prints a message to stdout for each update. (default: False)

For example:

import oneflow.experimental as flow

...
cosine_annealing_lr = flow.optim.lr_scheduler.CosineAnnealingLR(optimizer, steps=100, alpha=0.0)
for epoch in range(num_epoch):
train(...)
cosine_annealing_lr.step()

oneflow.experimental.optim.lr_scheduler.StepLR(optimizer, step_size: int, gamma: float = 0.1, last_step=- 1, verbose=False)

Decays the learning rate of each parameter group by gamma every step_size steps. Notice that such decay can happen simultaneously with other changes to the learning rate fromoutside this scheduler. When last_step=-1, sets initial lr as lr.

Parameters
• optimizer (Optimizer) – Wrapped optimizer.

• step_size (int) – Period of learning rate decay.

• gamma (float, optional) – Multiplicative factor of learning rate decay. (default: 0.1)

• last_step (int, optional) – The index of last step. (default: -1)

• verbose (bool, optional) – If True, prints a message to stdout for each update. (default: False)

For example:

import oneflow.experimental as flow

...
step_lr = flow.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
for epoch in range(num_epoch):
train(...)
step_lr.step()

oneflow.experimental.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_step=- 1, verbose=False)

Sets the learning rate of each parameter group to the initial lr times a given function. When last_step=-1, sets initial lr as lr.

$learning\_rate = base\_learning\_rate*lambda(last\_step)$
Parameters
• optimizer (Optimizer) – Wrapped optimizer.

• lr_lambda (function or list) – A function which computes a multiplicative factor given an integer parameter epoch, or a list of such functions, one for each group in optimizer.param_groups.

• last_step (int, optional) – The index of last step. (default: -1)

• verbose (bool, optional) – If True, prints a message to stdout for each update. (default: False)

For example:

import oneflow.experimental as flow

...
lambda1 = lambda step: step // 30
lambda2 = lambda step: 0.95 * step
lambda_lr = flow.optim.lr_scheduler.LambdaLR(optimizer, [lambda1, lambda2])
for epoch in range(num_epoch):
train(...)
lambda_lr.step()