oneflow.optim

Optimizers

Copyright 2020 The OneFlow Authors. All rights reserved.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

class oneflow.optim.Adagrad(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, lr_decay: float = 0.0, weight_decay: float = 0, initial_accumulator_value: float = 0.0, eps: float = 1e-10)

Implements Adagrad Optimizer.

The formula is:

\[ \begin{align}\begin{aligned}& S_{t} = S_{t-1} + grad \odot grad\\& decay\_lr = \frac{learning\_rate}{(1 + (train\_step - 1) * lr\_decay)}\\& X_{t} = X_{t-1} - \frac{decay\_lr}{\sqrt{S_{t} + \epsilon}} \odot grad\end{aligned}\end{align} \]
Parameters
  • params (Union[Iterator[Parameter], List[Dict]]) – iterable of parameters to optimize or dicts defining

  • groups (parameter) –

  • lr (float, optional) – The learning rate. Defaults to 0.001.

  • lr_decay (float, optional) – The decay factor of learning rate. Defaults to 0.0.

  • weight_decay (float, optional) – The weight decay. Defaults to 0.

  • initial_accumulator_value (float, optional) – The initial value of S. Defaults to 0.0.

  • eps (float, optional) – A small constant terms added to the denominator to improve numerical stability. Defaults to 1e-10.

For example:

Example 1:

# Assume net is a custom model.
adagrad = flow.optim.Adagrad(net.parameters(), lr=1e-3)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adagrad.step()
    adagrad.zero_grad()

Example 2:

# Assume net is a custom model.
adagrad = flow.optim.Adagrad(
    [
        {
            "params": net.parameters(),
            "lr": learning_rate,
            "clip_grad_max_norm": 0.5,
            "clip_grad_norm_type": 2.0,
        }
    ],
)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adagrad.clip_grad()
    adagrad.step()
    adagrad.zero_grad()

If you want to use clip_grad, you can refer this example.

For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to oneflow.nn.utils.clip_grad_norm_().

step(closure: Optional[Callable] = None)

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class oneflow.optim.Adam(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, do_bias_correction: bool = True)

Implements Adam algorithm.

It has been proposed in Adam: A Method for Stochastic Optimization. The implementation of the L2 penalty follows changes proposed in Decoupled Weight Decay Regularization.

This algorithm can adjust the learning rate of each parameter dynamically according to the 1st-moment estimates and the 2nd-moment estimates of gradient.

the equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align} \]
Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm. (default: False)

  • do_bias_correction (bool, optional) – Whether do bias correction (default: True)

For example:

Example 1:

# Assume net is a custom model.
adam = flow.optim.Adam(net.parameters(), lr=1e-3)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adam.step()
    adam.zero_grad()

Example 2:

# Assume net is a custom model.
adam = flow.optim.Adam(
    [
        {
            "params": net.parameters(),
            "lr": learning_rate,
            "clip_grad_max_norm": 0.5,
            "clip_grad_norm_type": 2.0,
        }
    ],
)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adam.clip_grad()
    adam.step()
    adam.zero_grad()

If you want to use clip_grad, you can refer this example.

For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to oneflow.nn.utils.clip_grad_norm_().

step(closure: Optional[Callable] = None)

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

property support_sparse

Whether the Optimizer support sparse update.

class oneflow.optim.AdamW(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, do_bias_correction: bool = True)

Implements AdamW algorithm.

The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.

The optimizer of the Adam-weight-decay algorithm.

(More details please refer to Adam-weight-decay).

So we use Adam-weight-decay algorithm to solve this problem.

the equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*(\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}+\lambda*param_{old})\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align} \]
Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (In the equation is λ, default: 0)

  • amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm. (default: False)

  • do_bias_correction (bool, optional) – Whether do bias correction (default: True)

For example:

Example 1:

# Assume net is a custom model.
adamw = flow.optim.AdamW(net.parameters(), lr=1e-3)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adamw.step()
    adamw.zero_grad()

Example 2:

# Assume net is a custom model.
adamw = flow.optim.AdamW(
    [
        {
            "params": net.parameters(),
            "lr": learning_rate,
            "clip_grad_max_norm": 0.5,
            "clip_grad_norm_type": 2.0,
        }
    ],
)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adamw.clip_grad()
    adamw.step()
    adamw.zero_grad()

If you want to use clip_grad, you can refer this example.

For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to oneflow.nn.utils.clip_grad_norm_().

step(closure: Optional[Callable] = None)

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

property support_sparse

Whether AdamW Optimizer support sparse update.

class oneflow.optim.LAMB(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, adam_w_mode: bool = True, do_bias_correction: bool = True, amsgrad: bool = False)

Implements LAMB algorithm.

LAMB was proposed in Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.

The equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{u} = \frac{{V_t}}{\sqrt{{S_t}}+\epsilon}\\& \hat{r} = learning\_rate * \frac{||param_{old}||_2}{||\hat{u}||_2}\\& param_{new} = param_{old} - \hat{r} * \hat{u}\end{aligned}\end{align} \]
Parameters
  • parameters (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • adam_w_mode (bool, optional) – apply L2 regularization or weight decay True for decoupled weight decay (also known as AdamW) (default: True)

  • do_bias_correction (bool, optional) – whether to do bias correction (default: True)

  • amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm.

  • SUPPORTED now! (default (NOT) – False)

For example:

Example 1:

# Assume net is a custom model.
lamb = flow.optim.LAMB(net.parameters(), lr=1e-3)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    lamb.step()
    lamb.zero_grad()

Example 2:

# Assume net is a custom model.
lamb = flow.optim.LAMB(
    [
        {
            "params": net.parameters(),
            "lr": learning_rate,
            "clip_grad_max_norm": 0.5,
            "clip_grad_norm_type": 2.0,
        }
    ],
)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    lamb.clip_grad()
    lamb.step()
    lamb.zero_grad()

If you want to use clip_grad, you can refer this example.

For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to oneflow.nn.utils.clip_grad_norm_().

step(closure: Optional[Callable] = None)

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class oneflow.optim.Optimizer(parameters, options)
add_param_group(param_group)None

Add a param group to the Optimizer s param_groups. This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the Optimizer as training progresses.

Parameters

param_group (dict) – Specifies what Tensors should be optimized along with group specific optimization options.

Example:

>>> import oneflow
>>> import oneflow.optim as optim
>>> w1 = oneflow.ones(3, 3)
>>> w1.requires_grad = True
>>> w2 = oneflow.ones(3, 3)
>>> w2.requires_grad = True
>>> o = optim.SGD([w1])
>>> o.param_groups[0]
{'lr': 0.001, 'momentum': 0.0, 'dampening': 0.0, 'weight_decay': 0.0, 'nesterov': False, 'maximize': False, 'params': [tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]], dtype=oneflow.float32, requires_grad=True)]}
>>> o.add_param_group({'params': w2})
>>> o.param_groups[1]
{'lr': 0.001, 'momentum': 0.0, 'dampening': 0.0, 'weight_decay': 0.0, 'nesterov': False, 'maximize': False, 'params': [tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]], dtype=oneflow.float32, requires_grad=True)]}
clip_grad()

Clips gradient norm of an iterable of parameters. The norm is computed over all gradients together, as if they were concatenated into a single vector.

You can set the max_norm and norm_type.

For more details, you can refer to the documentation of each optimizer(like Adam, SGD and so on).

You can also refer the code in oneflow.nn.utils.clip_grad_norm_()

load_state_dict(state_dict)None

Load the state of the optimizer which is created by state_dict function.

It almost copied from: https://pytorch.org/docs/1.10/_modules/torch/optim/optimizer.html#Optimizer.load_state_dict.

state_dict()

Returns the state of the optimizer as a dict.

It contains two entries:

  • state - a dict holding current optimization state. Its content differs between optimizer classes.

  • param_group - a dict containing all parameter groups.

It almost copied from: https://pytorch.org/docs/1.10/_modules/torch/optim/optimizer.html#Optimizer.state_dict.

step(closure: Optional[Callable] = None)Optional[oneflow.Tensor]

Performs a single optimization step (parameter update).

Parameters

closure (Union[Callable, None], optional) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

Returns

The loss.

Return type

Union[Tensor, None]

property support_sparse

Whether the Optimizer support sparse update.

zero_grad(set_to_none: bool = False)

Sets the gradients of all optimized torch.Tensor s to zero.

Parameters

set_to_none (bool) – instead of setting to zero, set the grads to None. This will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors.

For example:

1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently.

2. If the user requests zero_grad(set_to_none=True) followed by a backward pass, grads are guaranteed to be None for params that did not receive a gradient.

3. Optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).

class oneflow.optim.RMSprop(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, alpha: float = 0.99, eps: float = 1e-08, weight_decay: float = 0, momentum: float = 0.0, centered: bool = False)

Implements RMSprop algorithm.

oot Mean Squared Propagation (RMSProp) is an unpublished, adaptive learning rate method. The original slides proposed RMSProp: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf .

The original equation is as follows:

\[ \begin{align}\begin{aligned}r(w, t) = \alpha r(w, t-1) + (1 - \alpha)(\nabla Q_{i}(w))^2\\\begin{split}W = w - \frac{\eta} {\\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{split}\end{aligned}\end{align} \]

The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by \(sqrt{v(w,t)}\). In some cases, adding a momentum term :math: beta is beneficial. In our implementation, Nesterov momentum is used:

\[ \begin{align}\begin{aligned}r(w, t) = \alpha r(w, t-1) + (1 - \alpha)(\nabla Q_{i}(w))^2\\\begin{split}v(w, t) = \beta v(w, t-1) + \frac{\eta} {\\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{split}\\w = w - v(w, t)\end{aligned}\end{align} \]

if centered is True:

\[ \begin{align}\begin{aligned}r(w, t) = \alpha r(w, t-1) + (1 - \alpha)(\nabla Q_{i}(w))^2\\g(w, t) = \alpha g(w, t-1) + (1 - \alpha)\nabla Q_{i}(w)\\\begin{split}v(w, t) = \beta v(w, t-1) + \frac{\eta} {\\sqrt{r(w,t) - (g(w, t))^2 + \epsilon}} \nabla Q_{i}(w)\end{split}\\w = w - v(w, t)\end{aligned}\end{align} \]

where, \(\alpha\) is a hyperparameter and typical values are 0.99, 0.95 and so on. \(\beta\) is the momentum term. \(\epsilon\) is a smoothing term to avoid division by zero, usually set somewhere in range from 1e-4 to 1e-8.

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-2)

  • momentum (float, optional) – momentum factor (default: 0, oneflow not support momenmtum > 0 now!)

  • alpha (float, optional) – smoothing constant (default: 0.99)

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • centered (bool, optional) – if True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

For example:

Example 1:

# Assume net is a custom model.
rmsprop = flow.optim.RMSprop(net.parameters(), lr=1e-3)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    rmsprop.step()
    rmsprop.zero_grad()

Example 2:

# Assume net is a custom model.
rmsprop = flow.optim.RMSprop(
    [
        {
            "params": net.parameters(),
            "lr": learning_rate,
            "clip_grad_max_norm": 0.5,
            "clip_grad_norm_type": 2.0,
        }
    ],
)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    rmsprop.clip_grad()
    rmsprop.step()
    rmsprop.zero_grad()

If you want to use clip_grad, you can refer this example.

For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to oneflow.nn.utils.clip_grad_norm_().

step(closure: Optional[Callable] = None)

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class oneflow.optim.SGD(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, momentum: float = 0.0, dampening: float = 0.0, weight_decay: float = 0.0, nesterov: bool = False, maximize: bool = False)

Implements SGD algorithm.

This algorithm takes a random sample’s gradient as an approximate estimate of the overall gradient in small batch gradient descent.

When the momentum = 0, the equation of parameters updating is:

\[param_{new} = param_{old} - learning\_rate * grad\]

With momentum, the equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_t = \beta * V_{t-1} - learning\_rate * (g_t + param_{old} * weight\_decay)\\& param_{new} = param_{old} + V_t\end{aligned}\end{align} \]
Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-3)

  • momentum (float, optional) – Momentum factor (default: 0.0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0.0)

For example:

Example 1:

# Assume net is a custom model.
sgd = flow.optim.SGD(net.parameters(), lr=1e-3)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    sgd.step()
    sgd.zero_grad()

Example 2:

# Assume net is a custom model.
sgd = flow.optim.SGD(
    [
        {
            "params": net.parameters(),
            "lr": learning_rate,
            "clip_grad_max_norm": 0.5,
            "clip_grad_norm_type": 2.0,
        }
    ],
)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    sgd.clip_grad()
    sgd.step()
    sgd.zero_grad()

If you want to use clip_grad, you can refer this example.

For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to oneflow.nn.utils.clip_grad_norm_().

step(closure: Optional[Callable] = None)

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

property support_sparse

Whether SGD Optimizer support sparse update.

Copyright 2020 The OneFlow Authors. All rights reserved.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

class oneflow.optim.lr_scheduler.CosineAnnealingLR(optimizer: oneflow.nn.optimizer.optimizer.Optimizer, T_max: int, eta_min: float = 0.0, last_step: int = - 1, verbose: bool = False)

The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html.

Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr and \(T_{cur}\) is the number of epochs since the last restart in SGDR:

\[\begin{split}\begin{aligned} \eta_t & = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right), & T_{cur} \neq (2k+1)T_{max}; \\ \eta_{t+1} & = \eta_{t} + \frac{1}{2}(\eta_{max} - \eta_{min}) \left(1 - \cos\left(\frac{1}{T_{max}}\pi\right)\right), & T_{cur} = (2k+1)T_{max}. \end{aligned}\end{split}\]

When last_step=-1, sets initial lr as lr. Notice that because the schedule is defined recursively, the learning rate can be simultaneously modified outside this scheduler by other operators. If the learning rate is set solely by this scheduler, the learning rate at each step becomes:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right)\]

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts.

Parameters
  • optimizer (Optimizer) – Wrapped optimizer.

  • T_max (int) – Maximum number of iterations.

  • eta_min (float) – Minimum learning rate. Default: 0.

  • last_step (int) – The index of last epoch. Default: -1.

  • verbose (bool) – If True, prints a message to stdout for each update. Default: False.

get_lr(base_lr, step)

Compute learning rate using chainable form of the scheduler

class oneflow.optim.lr_scheduler.CosineDecayLR(optimizer: oneflow.nn.optimizer.optimizer.Optimizer, decay_steps: int, alpha: float = 0.0, last_step: int = - 1, verbose: bool = False)

This operator creates a Cosine decayed learning rate scheduler.

Before the decay_steps are specified by user, the learning rate will be updated as:

\[ \begin{align}\begin{aligned}& cos\_decay = 0.5*(1+cos(\pi*\frac{current\_step}{decay\_steps}))\\& decay\_factor = (1-\alpha)*cos\_decay+\alpha\\& learning\_rate = base\_learning\_rate*decay\_factor\end{aligned}\end{align} \]

After the decay_steps specified by user, the learning rate will be :

\[learning\_rate = {base\_learning\_rate}*{\alpha}\]

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts.

Parameters
  • optimizer (Optimizer) – Wrapped optimizer.

  • decay_steps (int) – The decay steps in the scheduler.

  • alpha (float, optional) – The learning rate scale factor (\(\alpha\)). (default: 0.0)

  • last_step (int, optional) – The index of last step. (default: -1)

  • verbose (bool, optional) – If True, prints a message to stdout for each update. (default: False)

For example:

import oneflow as flow

...
cosine_decay_lr = flow.optim.lr_scheduler.CosineDecayLR(optimizer, decay_steps=100, alpha=0.0)
for epoch in range(num_epoch):
    train(...)
    cosine_decay_lr.step()
get_lr(base_lr, step)

Compute learning rate using chainable form of the scheduler

class oneflow.optim.lr_scheduler.ExponentialLR(optimizer: oneflow.nn.optimizer.optimizer.Optimizer, gamma: float, last_step: int = - 1, verbose: bool = False)

Decays the learning rate of each parameter group by gamma every epoch. When last_epoch=-1, sets initial lr as lr.

Parameters
  • optimizer (Optimizer) – Wrapped optimizer.

  • gamma (float) – Multiplicative factor of learning rate decay.

  • last_step (int) – The index of last step. Default: -1.

  • verbose (bool) – If True, prints a message to stdout for each update. Default: False.

get_lr(base_lr, step)

Compute learning rate using chainable form of the scheduler

class oneflow.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_step=- 1, verbose=False)

Sets the learning rate of each parameter group to the initial lr times a given function. When last_step=-1, sets initial lr as lr.

\[learning\_rate = base\_learning\_rate*lambda(last\_step)\]
Parameters
  • optimizer (Optimizer) – Wrapped optimizer.

  • lr_lambda (function or list) – A function which computes a multiplicative factor given an integer parameter epoch, or a list of such functions, one for each group in optimizer.param_groups.

  • last_step (int, optional) – The index of last step. (default: -1)

  • verbose (bool, optional) – If True, prints a message to stdout for each update. (default: False)

For example:

import oneflow as flow

...
lambda1 = lambda step: step // 30
lambda2 = lambda step: 0.95 * step
lambda_lr = flow.optim.lr_scheduler.LambdaLR(optimizer, [lambda1, lambda2])
for epoch in range(num_epoch):
    train(...)
    lambda_lr.step()
load_state_dict(state_dict)

Loads the schedulers state.

Parameters

state_dict (dict) – scheduler state. Should be an object returned from a call to state_dict().

state_dict()

Returns the state of the scheduler as a dict.

It contains an entry for every variable in self.__dict__ which is not the optimizer. The learning rate lambda functions will only be saved if they are callable objects and not if they are functions or lambdas.

step()

Performs a single learning rate schedule step.

class oneflow.optim.lr_scheduler.MultiStepLR(optimizer: oneflow.nn.optimizer.optimizer.Optimizer, milestones: list, gamma: float = 0.1, last_step: int = - 1, verbose: bool = False)

Decays the learning rate of each parameter group by gamma once the number of step reaches one of the milestones. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler.When last_step=-1, sets initial lr as lr.

Parameters
  • optimizer (Optimizer) – Wrapped optimizer.

  • milestones (list) – List of step indices. Must be increasing

  • gamma (float, optional) – Multiplicative factor of learning rate decay. (default: 0.1)

  • last_step (int, optional) – The index of last step. (default: -1)

  • verbose (bool, optional) – If True, prints a message to stdout for each update. (default: False)

For example:

import oneflow as flow

...
multistep_lr = flow.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30,80], gamma=0.1)
for epoch in range(num_epoch):
    train(...)
    multistep_lr.step()
get_lr(base_lr, step)

Compute learning rate using chainable form of the scheduler

class oneflow.optim.lr_scheduler.PolynomialLR(optimizer, decay_batch: int, end_learning_rate: float = 0.0001, power: float = 1.0, cycle: bool = False, last_step: int = - 1, verbose: bool = False)

This operator creates a polynomial decayed learning rate scheduler. The learning rate will be updated as follows:

If cycle is True, the equation is:

\[\begin{split}\begin{aligned} & decay\_batch = decay\_batch*ceil(\frac{current\_batch}{decay\_batch}) \\ & learning\_rate = (base\_lr-end\_lr)*(1-\frac{current\_batch}{decay\_batch})^{power}+end\_lr \end{aligned}\end{split}\]

If cycle is False, the equation is:

\[\begin{split}\begin{aligned} & current\_batch = min(decay\_batch, current\_batch) \\ & learning\_rate = (base\_lr-end\_lr)*(1-\frac{current\_batch}{decay\_batch})^{power}+end\_lr \end{aligned}\end{split}\]
Parameters
  • optimizer (Optimizer) – Wrapper optimizer.

  • decay_batch (int) – The decayed steps.

  • end_learning_rate (float, optional) – The final learning rate. Defaults to 0.0001.

  • power (float, optional) – The power of polynomial. Defaults to 1.0.

  • cycle (bool, optional) – If cycle is True, the scheduler will decay the learning rate every decay steps. Defaults to False.

For example:

import oneflow as flow

...
polynomial_scheduler = flow.optim.lr_scheduler.PolynomialLR(
    optimizer, decay_batch=5, end_learning_rate=0.00001, power=2
    )

for epoch in range(num_epoch):
    train(...)
    polynomial_scheduler.step()
get_lr(base_lr, step)

Compute learning rate using chainable form of the scheduler

class oneflow.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08, verbose=False)

Reduce learning rate when a metric has stopped improving. Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. This scheduler reads a metrics quantity and if no improvement is seen for a ‘patience’ number of epochs, the learning rate is reduced.

Parameters
  • optimizer (Optimizer) – Wrapped optimizer.

  • mode (str) – One of min, max. In min mode, lr will be reduced when the quantity monitored has stopped decreasing; in max mode it will be reduced when the quantity monitored has stopped increasing. Default: ‘min’.

  • factor (float) – Factor by which the learning rate will be reduced. new_lr = lr * factor. Default: 0.1.

  • patience (int) – Number of epochs with no improvement after which learning rate will be reduced. For example, if patience = 2, then we will ignore the first 2 epochs with no improvement, and will only decrease the LR after the 3rd epoch if the loss still hasn’t improved then. Default: 10.

  • threshold (float) – Threshold for measuring the new optimum, to only focus on significant changes. Default: 1e-4.

  • threshold_mode (str) – One of rel, abs. In rel mode, dynamic_threshold = best * ( 1 + threshold ) in ‘max’ mode or best * ( 1 - threshold ) in min mode. In abs mode, dynamic_threshold = best + threshold in max mode or best - threshold in min mode. Default: ‘rel’.

  • cooldown (int) – Number of epochs to wait before resuming normal operation after lr has been reduced. Default: 0.

  • min_lr (float or list) – A scalar or a list of scalars. A lower bound on the learning rate of all param groups or each group respectively. Default: 0.

  • eps (float) – Minimal decay applied to lr. If the difference between new and old lr is smaller than eps, the update is ignored. Default: 1e-8.

  • verbose (bool) – If True, prints a message to stdout for each update. Default: False.

For example:

optimizer = flow.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = flow.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min')
for epoch in range(10):
    train(...)
    val_loss = validate(...)
    # Note that step should be called after validate()
    scheduler.step(val_loss)
property in_cooldown

Whether the learning rate scheduler in cooldown phase.

is_better(a, best)

Whether the metric has improvement.

load_state_dict(state_dict)

Loads the schedulers state.

Parameters

state_dict (dict) – scheduler state. Should be an object returned from a call to state_dict().

state_dict()

Returns the state of the scheduler as a dict.

It contains an entry for every variable in self.__dict__ which is not the optimizer.

step(metrics)

Performs a single learning rate schedule step.

Parameters

metrics (float) – a metrics quantity of Measuring the effect of model training.

class oneflow.optim.lr_scheduler.StepLR(optimizer: oneflow.nn.optimizer.optimizer.Optimizer, step_size: int, gamma: float = 0.1, last_step: int = - 1, verbose: bool = False)

Decays the learning rate of each parameter group by gamma every step_size steps. Notice that such decay can happen simultaneously with other changes to the learning rate fromoutside this scheduler. When last_step=-1, sets initial lr as lr.

Parameters
  • optimizer (Optimizer) – Wrapped optimizer.

  • step_size (int) – Period of learning rate decay.

  • gamma (float, optional) – Multiplicative factor of learning rate decay. (default: 0.1)

  • last_step (int, optional) – The index of last step. (default: -1)

  • verbose (bool, optional) – If True, prints a message to stdout for each update. (default: False)

For example:

import oneflow as flow

...
step_lr = flow.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
for epoch in range(num_epoch):
    train(...)
    step_lr.step()
get_lr(base_lr, step)

Compute learning rate using chainable form of the scheduler