oneflow.optim

Optimizers

Copyright 2020 The OneFlow Authors. All rights reserved.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

class oneflow.optim.Adam(parameters: Union[Iterator[oneflow._oneflow_internal.nn.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, do_bias_correction: bool = True)

Implements Adam algorithm.

It has been proposed in Adam: A Method for Stochastic Optimization. The implementation of the L2 penalty follows changes proposed in Decoupled Weight Decay Regularization.

This algorithm can adjust the learning rate of each parameter dynamically according to the 1st-moment estimates and the 2nd-moment estimates of gradient.

the equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align} \]
Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm. (default: False)

  • do_bias_correction (bool, optional) – Whether do bias correction (default: True)

For example:

Example 1:

# Assume net is a custom model.
adam = flow.optim.Adam(net.parameters(), lr=1e-3)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adam.step()
    adam.zero_grad()

Example 2:

# Assume net is a custom model.
adam = flow.optim.Adam(
    [
        {
            "params": net.parameters(),
            "lr": learning_rate,
            "clip_grad_max_norm": 0.5,
            "clip_grad_norm_type": 2.0,
        }
    ],
)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adam.clip_grad()
    adam.step()
    adam.zero_grad()

If you want to use clip_grad, you can refer this example.

For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to oneflow.nn.utils.clip_grad_norm_().

step(closure: Optional[Callable] = None)

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class oneflow.optim.AdamW(parameters: Union[Iterator[oneflow._oneflow_internal.nn.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, do_bias_correction: bool = True)

Implements AdamW algorithm.

The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.

The optimizer of the Adam-weight-decay algorithm.

(More details please refer to Adam-weight-decay).

So we use Adam-weight-decay algorithm to solve this problem.

the equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*(\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}+\lambda*param_{old})\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align} \]
Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (In the equation is λ, default: 0)

  • amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm. (default: False)

  • do_bias_correction (bool, optional) – Whether do bias correction (default: True)

For example:

Example 1:

# Assume net is a custom model.
adamw = flow.optim.AdamW(net.parameters(), lr=1e-3)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adamw.step()
    adamw.zero_grad()

Example 2:

# Assume net is a custom model.
adamw = flow.optim.AdamW(
    [
        {
            "params": net.parameters(),
            "lr": learning_rate,
            "clip_grad_max_norm": 0.5,
            "clip_grad_norm_type": 2.0,
        }
    ],
)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adamw.clip_grad()
    adamw.step()
    adamw.zero_grad()

If you want to use clip_grad, you can refer this example.

For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to oneflow.nn.utils.clip_grad_norm_().

step(closure: Optional[Callable] = None)

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class oneflow.optim.Optimizer(parameters, options)
add_param_group(param_group)None
clip_grad()

Clips gradient norm of an iterable of parameters. The norm is computed over all gradients together, as if they were concatenated into a single vector.

You can set the max_norm and norm_type.

For more details, you can refer to the documentation of each optimizer(like Adam, SGD and so on).

You can also refer the code in oneflow.nn.utils.clip_grad_norm_()

load_state_dict(state_dict)None

Load the state of the optimizer which is created by state_dict function.

It almost copied from: https://pytorch.org/docs/stable/_modules/torch/optim/optimizer.html#Optimizer.load_state_dict

state_dict()

Returns the state of the optimizer as a dict.

It contains two entries:

  • state - a dict holding current optimization state. Its content differs between optimizer classes.

  • param_group - a dict containing all parameter groups.

It almost copied from: https://pytorch.org/docs/stable/_modules/torch/optim/optimizer.html#Optimizer.state_dict

step(closure: Optional[Callable] = None)Optional[oneflow._oneflow_internal.Tensor]
zero_grad(set_to_none: bool = False)

Sets the gradients of all optimized torch.Tensor s to zero.

Parameters

set_to_none (bool) – instead of setting to zero, set the grads to None. This will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors.

For example:

1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently.

2. If the user requests zero_grad(set_to_none=True) followed by a backward pass, grads are guaranteed to be None for params that did not receive a gradient.

3. Optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).

class oneflow.optim.RMSprop(parameters: Union[Iterator[oneflow._oneflow_internal.nn.Parameter], List[Dict]], lr: float = 0.001, alpha: float = 0.99, eps: float = 1e-08, weight_decay: float = 0, momentum: float = 0.0, centered: bool = False)

Implements RMSprop algorithm.

oot Mean Squared Propagation (RMSProp) is an unpublished, adaptive learning rate method. The original slides proposed RMSProp: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf .

The original equation is as follows:

\[ \begin{align}\begin{aligned}r(w, t) = \alpha r(w, t-1) + (1 - \alpha)(\nabla Q_{i}(w))^2\\\begin{split}W = w - \frac{\eta} {\\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{split}\end{aligned}\end{align} \]

The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by \(sqrt{v(w,t)}\). In some cases, adding a momentum term :math: beta is beneficial. In our implementation, Nesterov momentum is used:

\[ \begin{align}\begin{aligned}r(w, t) = \alpha r(w, t-1) + (1 - \alpha)(\nabla Q_{i}(w))^2\\\begin{split}v(w, t) = \beta v(w, t-1) + \frac{\eta} {\\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{split}\\w = w - v(w, t)\end{aligned}\end{align} \]

if centered is True:

\[ \begin{align}\begin{aligned}r(w, t) = \alpha r(w, t-1) + (1 - \alpha)(\nabla Q_{i}(w))^2\\g(w, t) = \alpha g(w, t-1) + (1 - \alpha)\nabla Q_{i}(w)\\\begin{split}v(w, t) = \beta v(w, t-1) + \frac{\eta} {\\sqrt{r(w,t) - (g(w, t))^2 + \epsilon}} \nabla Q_{i}(w)\end{split}\\w = w - v(w, t)\end{aligned}\end{align} \]

where, \(\alpha\) is a hyperparameter and typical values are 0.99, 0.95 and so on. \(\beta\) is the momentum term. \(\epsilon\) is a smoothing term to avoid division by zero, usually set somewhere in range from 1e-4 to 1e-8.

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-2)

  • momentum (float, optional) – momentum factor (default: 0, oneflow not support momenmtum > 0 now!)

  • alpha (float, optional) – smoothing constant (default: 0.99)

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • centered (bool, optional) – if True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

For example:

Example 1:

# Assume net is a custom model.
rmsprop = flow.optim.RMSprop(net.parameters(), lr=1e-3)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    rmsprop.step()
    rmsprop.zero_grad()

Example 2:

# Assume net is a custom model.
rmsprop = flow.optim.RMSprop(
    [
        {
            "params": net.parameters(),
            "lr": learning_rate,
            "clip_grad_max_norm": 0.5,
            "clip_grad_norm_type": 2.0,
        }
    ],
)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    rmsprop.clip_grad()
    rmsprop.step()
    rmsprop.zero_grad()

If you want to use clip_grad, you can refer this example.

For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to oneflow.nn.utils.clip_grad_norm_().

step(closure: Optional[Callable] = None)

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class oneflow.optim.SGD(parameters: Union[Iterator[oneflow._oneflow_internal.nn.Parameter], List[Dict]], lr: float = 0.001, momentum: float = 0.0, weight_decay: float = 0.0)

Implements SGD algorithm.

This algorithm takes a random sample’s gradient as an approximate estimate of the overall gradient in small batch gradient descent.

When the momentum = 0, the equation of parameters updating is:

\[param_{new} = param_{old} - learning\_rate * grad\]

With momentum, the equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_t = \beta * V_{t-1} - learning\_rate * (g_t + param_{old} * weight\_decay)\\& param_{new} = param_{old} + V_t\end{aligned}\end{align} \]
Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-3)

  • momentum (float, optional) – Momentum factor (default: 0.0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0.0)

For example:

Example 1:

# Assume net is a custom model.
sgd = flow.optim.SGD(net.parameters(), lr=1e-3)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    sgd.step()
    sgd.zero_grad()

Example 2:

# Assume net is a custom model.
sgd = flow.optim.SGD(
    [
        {
            "params": net.parameters(),
            "lr": learning_rate,
            "clip_grad_max_norm": 0.5,
            "clip_grad_norm_type": 2.0,
        }
    ],
)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    sgd.clip_grad()
    sgd.step()
    sgd.zero_grad()

If you want to use clip_grad, you can refer this example.

For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to oneflow.nn.utils.clip_grad_norm_().

step(closure: Optional[Callable] = None)

Copyright 2020 The OneFlow Authors. All rights reserved.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

class oneflow.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max: int, eta_min: float = 0.0, last_step=- 1, verbose=False)

The documentation is referenced from: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html?highlight=cosine#torch.optim.lr_scheduler.CosineAnnealingLR

Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr and \(T_{cur}\) is the number of epochs since the last restart in SGDR:

\[\begin{split}\begin{aligned} \eta_t & = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right), & T_{cur} \neq (2k+1)T_{max}; \\ \eta_{t+1} & = \eta_{t} + \frac{1}{2}(\eta_{max} - \eta_{min}) \left(1 - \cos\left(\frac{1}{T_{max}}\pi\right)\right), & T_{cur} = (2k+1)T_{max}. \end{aligned}\end{split}\]

When last_step=-1, sets initial lr as lr. Notice that because the schedule is defined recursively, the learning rate can be simultaneously modified outside this scheduler by other operators. If the learning rate is set solely by this scheduler, the learning rate at each step becomes:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right)\]

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts.

Parameters
  • optimizer (Optimizer) – Wrapped optimizer.

  • T_max (int) – Maximum number of iterations.

  • eta_min (float) – Minimum learning rate. Default: 0.

  • last_step (int) – The index of last epoch. Default: -1.

  • verbose (bool) – If True, prints a message to stdout for each update. Default: False.

get_lr()

Compute learning rate using chainable form of the scheduler

class oneflow.optim.lr_scheduler.CosineDecayLR(optimizer, decay_steps: int, alpha: float = 0.0, last_step=- 1, verbose=False)

This operator creates a Cosine decayed learning rate scheduler.

Before the decay_steps are specified by user, the learning rate will be updated as:

\[ \begin{align}\begin{aligned}& cos\_decay = 0.5*(1+cos(\pi*\frac{current\_step}{decay\_steps}))\\& decay\_factor = (1-\alpha)*cos\_decay+\alpha\\& learning\_rate = base\_learning\_rate*decay\_factor\end{aligned}\end{align} \]

After the decay_steps specified by user, the learning rate will be :

\[learning\_rate = {base\_learning\_rate}*{\alpha}\]

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts.

Parameters
  • optimizer (Optimizer) – Wrapped optimizer.

  • decay_steps (int) – The decay steps in the scheduler.

  • alpha (float, optional) – The learning rate scale factor (\(\alpha\)). (default: 0.0)

  • last_step (int, optional) – The index of last step. (default: -1)

  • verbose (bool, optional) – If True, prints a message to stdout for each update. (default: False)

For example:

import oneflow as flow

...
cosine_decay_lr = flow.optim.lr_scheduler.CosineDecayLR(optimizer, decay_steps=100, alpha=0.0)
for epoch in range(num_epoch):
    train(...)
    cosine_decay_lr.step()
get_lr()

Compute learning rate using chainable form of the scheduler

class oneflow.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_step=- 1, verbose=False)

Sets the learning rate of each parameter group to the initial lr times a given function. When last_step=-1, sets initial lr as lr.

\[learning\_rate = base\_learning\_rate*lambda(last\_step)\]
Parameters
  • optimizer (Optimizer) – Wrapped optimizer.

  • lr_lambda (function or list) – A function which computes a multiplicative factor given an integer parameter epoch, or a list of such functions, one for each group in optimizer.param_groups.

  • last_step (int, optional) – The index of last step. (default: -1)

  • verbose (bool, optional) – If True, prints a message to stdout for each update. (default: False)

For example:

import oneflow as flow

...
lambda1 = lambda step: step // 30
lambda2 = lambda step: 0.95 * step
lambda_lr = flow.optim.lr_scheduler.LambdaLR(optimizer, [lambda1, lambda2])
for epoch in range(num_epoch):
    train(...)
    lambda_lr.step()
get_lr()

Compute learning rate using chainable form of the scheduler

load_state_dict(state_dict)

Loads the schedulers state.

Parameters

state_dict (dict) – scheduler state. Should be an object returned from a call to state_dict().

state_dict()

Returns the state of the scheduler as a dict.

It contains an entry for every variable in self.__dict__ which is not the optimizer. The learning rate lambda functions will only be saved if they are callable objects and not if they are functions or lambdas.

class oneflow.optim.lr_scheduler.MultiStepLR(optimizer, milestones: list, gamma: float = 0.1, last_step=- 1, verbose=False)

Decays the learning rate of each parameter group by gamma once the number of step reaches one of the milestones. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler.When last_step=-1, sets initial lr as lr.

Parameters
  • optimizer (Optimizer) – Wrapped optimizer.

  • milestones (list) – List of step indices. Must be increasing

  • gamma (float, optional) – Multiplicative factor of learning rate decay. (default: 0.1)

  • last_step (int, optional) – The index of last step. (default: -1)

  • verbose (bool, optional) – If True, prints a message to stdout for each update. (default: False)

For example:

import oneflow as flow

...
multistep_lr = flow.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30,80], gamma=0.1)
for epoch in range(num_epoch):
    train(...)
    multistep_lr.step()
get_lr()

Compute learning rate using chainable form of the scheduler

class oneflow.optim.lr_scheduler.StepLR(optimizer, step_size: int, gamma: float = 0.1, last_step=- 1, verbose=False)

Decays the learning rate of each parameter group by gamma every step_size steps. Notice that such decay can happen simultaneously with other changes to the learning rate fromoutside this scheduler. When last_step=-1, sets initial lr as lr.

Parameters
  • optimizer (Optimizer) – Wrapped optimizer.

  • step_size (int) – Period of learning rate decay.

  • gamma (float, optional) – Multiplicative factor of learning rate decay. (default: 0.1)

  • last_step (int, optional) – The index of last step. (default: -1)

  • verbose (bool, optional) – If True, prints a message to stdout for each update. (default: False)

For example:

import oneflow as flow

...
step_lr = flow.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
for epoch in range(num_epoch):
    train(...)
    step_lr.step()
get_lr()

Compute learning rate using chainable form of the scheduler