oneflow.optimizer¶

Optimizers¶

class oneflow.optimizer.Adam(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, beta1=0.9, beta2=0.999, epsilon=1e-08, do_bias_correction=False, loss_scale_factor: Optional[float] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶

The optimizer of the Adam algorithm.

This algorithm can adjust the learning rate of each parameter dynamically according to the 1st-moment estimates

and the 2nd-moment estimates of gradient.

With bias correction, the equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{V_t} = \frac{V_t}{1-\beta_1^t}\\& \hat{S_t} = \frac{S_t}{1-\beta_2^t}\\& \hat{g} = learning\_rate*\frac{\hat{V_t}}{\sqrt{\hat{S_t}}+\epsilon}\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align} \]

Without bias correction, the equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align} \]

More details please refer to Adam

Parameters

lr_scheduler (LrScheduler) – The scheduler of learning rate.
beta1 (float, optional) – The exponential weighted average decay rate for the 1st-moment estimates (\(\beta_1\)). Defaults to 0.9.
beta2 (float, optional) – The exponential weighted average decay rate for the 2rd-moment estimates (\(\beta_2\)). Defaults to 0.999.
epsilon ([type], optional) – A small float constant value for numerical stability (\(\epsilon\)). Defaults to 1e-8.
do_bias_correction (bool, optional) – Whether to do the bias correction. Defaults to False.
loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.
grad_clipping (Optional[ClipGradientConf], optional) – The gradient clipping strategy. Defaults to None.
train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.
loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    # Set learning rate as 0.001
    lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.001])
    # Set Adam optimizer
    flow.optimizer.Adam(lr_scheduler, do_bias_correction=False).minimize(loss)

    return loss

__init__(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, beta1=0.9, beta2=0.999, epsilon=1e-08, do_bias_correction=False, loss_scale_factor: Optional[float] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶: Initialize self. See help(type(self)) for accurate signature.

class oneflow.optimizer.AdamW(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, beta1=0.9, beta2=0.999, epsilon=1e-08, do_bias_correction=False, loss_scale_factor: Optional[float] = None, weight_decay: Optional[float] = None, weight_decay_includes: Union[Sequence[str], str, None] = None, weight_decay_excludes: Union[Sequence[str], str, None] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶

The optimizer of the Adam-weight-decay algorithm.

If we use L2 regularization,

it will be invalid due to the adaptive learning rate in Adam optimizer

(More details please refer to Adam-weight-decay).

So we use Adam-weight-decay algorithm to solve this problem.

With bias correction, the equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{V_t} = \frac{V_t}{1-\beta_1^t}\\& \hat{S_t} = \frac{S_t}{1-\beta_2^t}\\& \hat{g} = learning\_rate*(\frac{\hat{V_t}}{\sqrt{\hat{S_t}}+\epsilon}+\lambda*param_{old})\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align} \]

Without bias correction, the equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*(\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}+\lambda*param_{old})\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align} \]

Parameters

lr_scheduler (LrScheduler) – The scheduler of learning rate.
beta1 (float, optional) – The exponential weighted average decay rate for the 1st-moment estimates (\(\beta_1\)). Defaults to 0.9.
beta2 (float, optional) – The exponential weighted average decay rate for the 2rd-moment estimates (\(\beta_2\)). Defaults to 0.999.
epsilon ([type], optional) – A small float constant value for numerical stability (\(\epsilon\)). Defaults to 1e-8.
do_bias_correction (bool, optional) – Whether to do the bias correction. Defaults to False.
loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.
weight_decay (Optional[float], optional) – The weight decay factor (In the equation is \(\lambda\)). Defaults to None.
weight_decay_includes (Optional[Union[Sequence[Text], Text]], optional) – The name of the model parameters that use weight decay. Defaults to None.
weight_decay_excludes (Optional[Union[Sequence[Text], Text]], optional) – The name of the model parameters that do not use weight decay. Defaults to None.
grad_clipping (Optional[ClipGradientConf], optional) – The gradient clipping strategy. Defaults to None.
train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.
loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

Note

Only one of weight_decay_includes and weight_decay_excludes can be set. If both are None, all the model parameters will use weight decay.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    # Set learning rate as 0.001
    lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.001])
    # Set AdamW optimizer, weight_decay factor is 0.00005
    flow.optimizer.AdamW(lr_scheduler,
            do_bias_correction=False, weight_decay=0.00005).minimize(loss)

    return loss

__init__(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, beta1=0.9, beta2=0.999, epsilon=1e-08, do_bias_correction=False, loss_scale_factor: Optional[float] = None, weight_decay: Optional[float] = None, weight_decay_includes: Union[Sequence[str], str, None] = None, weight_decay_excludes: Union[Sequence[str], str, None] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶: Initialize self. See help(type(self)) for accurate signature.

class oneflow.optimizer.CosineScheduler(base_lr: float, steps: int, alpha: float = 0.0, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶

This operator creates a Cosine decayed learning rate scheduler.

Before the steps are specified by user, the learning rate will be updated as:

\[ \begin{align}\begin{aligned}& cos\_decay = 0.5*(1+cos(\pi*\frac{current\_batch}{decayed\_batch}))\\& decay\_factor = (1-\alpha)*cos\_decay+\alpha\\& learning\_rate = base\_learning\_rate*decay\_factor\end{aligned}\end{align} \]

After the steps specified by user, the learning rate will be :

\[learning\_rate = {base\_learning\_rate}*{\alpha}\]

Parameters

base_lr (float) – The base learning rate (\(base\_learning\_rate\))
steps (int) – The decay steps in the scheduler (\(decayed\_batch\))
alpha (float, optional) – The learning rate scale factor (\(\alpha\)). Defaults to 0.0.
warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    lr_scheduler = flow.optimizer.CosineScheduler(base_lr=0.01,
                                                  steps=10,
                                                  alpha=0.1)
    flow.optimizer.Adam(lr_scheduler).minimize(loss)

    return loss

__init__(base_lr: float, steps: int, alpha: float = 0.0, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶: Initialize self. See help(type(self)) for accurate signature.

property learning_rate_decay_conf¶

class oneflow.optimizer.CustomScheduler(lbn: str)¶

__init__(lbn: str)¶: Initialize self. See help(type(self)) for accurate signature.

property learning_rate_decay_conf¶

class oneflow.optimizer.ExponentialScheduler(base_lr: float, steps: int, decay_rate: float, staircase=False, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶

This operator creates a exponential decayed learning rate scheduler.

The learning rate will be updated as follows:

If staircase is set to False, the equation is:

\[ \begin{align}\begin{aligned}& pow = \frac{current\_batch}{decay\_batch}\\& learning\_rate = base\_learning\_rate*decay\_rate^{pow}\end{aligned}\end{align} \]

If staircase is set to True, the equation is:

\[ \begin{align}\begin{aligned}& pow = floor(\frac{current\_batch}{decay\_batch})\\& learning\_rate = base\_learning\_rate*decay\_rate^{pow}\end{aligned}\end{align} \]

Parameters

base_lr (float) – The base learning rate
steps (int) – The decay steps
decay_rate (float) – The decay rate
staircase (bool, optional) – If staircase is True, the scheduler decay the learning rate at discrete intervals. Defaults to False.
warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

__init__(base_lr: float, steps: int, decay_rate: float, staircase=False, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶: Initialize self. See help(type(self)) for accurate signature.

property learning_rate_decay_conf¶

class oneflow.optimizer.InverseTimeScheduler(base_lr: float, steps: int, decay_rate: float, staircase: bool = False, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶

This operator creates a inverse time decayed learning rate scheduler.

The learning rate will be updated as follows:

If staircase is set to False, the equation is:

\[ \begin{align}\begin{aligned}& step\_ratio = \frac{current\_batch}{decay\_batch}\\& learning\_rate = \frac{base\_learning\_rate}{1+decay\_rate*step\_ratio}\end{aligned}\end{align} \]

If staircase is set to True, the equation is:

\[ \begin{align}\begin{aligned}& step\_ratio = \frac{current\_batch}{decay\_batch}\\& learning\_rate = \frac{base\_learning\_rate}{1+floor(decay\_rate*step\_ratio)}\end{aligned}\end{align} \]

Parameters

base_lr (float) – The base learning rate
steps (int) – The decay steps
decay_rate (float) – The decay rate
staircase (bool, optional) – If staircase is True, the scheduler decay the learning rate at discrete intervals. Defaults to False.
warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
        images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
        labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    lr_scheduler = flow.optimizer.InverseTimeScheduler(base_lr=0.1,
                                                       steps=5,
                                                       decay_rate=0.9)
    flow.optimizer.SGD(lr_scheduler, momentum=0.9).minimize(loss)

    return loss

__init__(base_lr: float, steps: int, decay_rate: float, staircase: bool = False, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶: Initialize self. See help(type(self)) for accurate signature.

property learning_rate_decay_conf¶

class oneflow.optimizer.LAMB(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-06, loss_scale_factor: Optional[float] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶

Parameters

lr_scheduler (LrScheduler) – The scheduler of learning rate.
beta1 (float, optional) – The exponential weighted average decay rate for the 1st-moment estimates (\(\beta_1\)). Defaults to 0.9.
beta2 (float, optional) – The exponential weighted average decay rate for the 2rd-moment estimates (\(\beta_2\)). Defaults to 0.999.
epsilon ([type], optional) – A small float constant value for numerical stability (\(\epsilon\)). Defaults to 1e-6.
loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.
grad_clipping (Optional[ClipGradientConf], optional) – The gradient clipping strategy. Defaults to None.
train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.
loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

__init__(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-06, loss_scale_factor: Optional[float] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶: Initialize self. See help(type(self)) for accurate signature.

class oneflow.optimizer.LARS(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, momentum_beta: float = 0.9, epsilon: float = 1e-09, lars_coefficient: float = 0.0001, loss_scale_factor: Optional[float] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶

The optimizer of the LARS algorithm.

The equation of parameters updating is:

\[ \begin{align}\begin{aligned}& local\_learning\_rate = learning\_rate*lars\_coeff*\frac{\lVert{parm_{old}\rVert}}{\epsilon+\lVert{grad\rVert}}\\& momentum_t = \beta*momentum_{t-1} + local\_learning\_rate*(grad)\\& param_{new} = param_{old} - momentum_t\end{aligned}\end{align} \]

Parameters

lr_scheduler (LrScheduler) – The scheduler of learning rate.
momentum_beta (float, optional) – The momentum factor (\(\beta\)). Defaults to 0.9.
epsilon (float, optional) – A small float constant value for numerical stability (\(\epsilon\)). Defaults to 1e-9.
lars_coefficient (float, optional) – The coefficient factor, it defines how much we trust the layer to change its weights (\(lars\_coeff\)). Defaults to 0.0001.
loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.
grad_clipping (Optional[ClipGradientConf], optional) – The gradient clipping strategy. Defaults to None.
train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.
loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
        images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
        labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )
    # Set learning rate as 0.1
    lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.1])
    # Set LARS optimizer, momentum factor is 0.9
    flow.optimizer.LARS(lr_scheduler, momentum_beta=0.9).minimize(loss)

    return loss

__init__(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, momentum_beta: float = 0.9, epsilon: float = 1e-09, lars_coefficient: float = 0.0001, loss_scale_factor: Optional[float] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶: Initialize self. See help(type(self)) for accurate signature.

class oneflow.optimizer.LazyAdam(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-08, loss_scale_factor: Optional[float] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶

The optimizer of the LazyAdam algorithm.

This algorithm can adjust the learning rate of each parameter dynamically according to the 1st-moment estimates and the 2nd-moment estimates of the gradient.

The difference between Adam optimizer and LazyAdam optimizer is that LazyAdam only updates the element that has gradient in the current batch, it is faster than Adam optimizer.

\[ \begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align} \]

Parameters

lr_scheduler (LrScheduler) – The scheduler of learning rate.
beta1 (float, optional) – The exponential weighted average decay rate for the 1st-moment estimates (\(\beta_1\)). Defaults to 0.9.
beta2 (float, optional) – The exponential weighted average decay rate for the 2rd-moment estimates (\(\beta_2\)). Defaults to 0.999.
epsilon ([type], optional) – A small float constant value for numerical stability (\(\epsilon\)). Defaults to 1e-8.
loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.
grad_clipping (Optional[ClipGradientConf], optional) – The gradient clipping strategy. Defaults to None.
train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.
loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )
    # Set learning rate as 0.001
    lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.001])
    # Set LazyAdam optimizer
    flow.optimizer.LazyAdam(lr_scheduler).minimize(loss)

    return loss

__init__(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-08, loss_scale_factor: Optional[float] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶: Initialize self. See help(type(self)) for accurate signature.

class oneflow.optimizer.LinearCosineScheduler(base_lr: float, steps: int, num_periods: float = 0.5, alpha: float = 0.0, beta: float = 0.001, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶

This operator creates a linear cosine decayed learning rate scheduler.

The learning rate will be updated as follows:

\[ \begin{align}\begin{aligned}& current\_batch = min(current\_batch, decay\_batch)\\& linear\_decay = \frac{(decay\_batch - current\_batch)}{decay\_batch}\\& cosine\_decay = 0.5*(1.0+cos(2*\pi*num\_periods*\frac{current\_batch}{decay\_batch}))\\& decay\_factor = (\alpha+linear\_decay)*cosine\_decay + \beta\\& learning\_rate = base\_learning\_rate*decay\_factor\end{aligned}\end{align} \]

Parameters

base_lr (float) – The base learning rate
steps (int) – The decay steps
num_periods (float, optional) – The number of decay periods. Defaults to 0.5.
alpha (float, optional) – The \(\alpha\) in equation. Defaults to 0.0.
beta (float, optional) – The \(\beta\) in equation. Defaults to 0.001.
warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
        images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
        labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    lr_scheduler = flow.optimizer.LinearCosineScheduler(base_lr=0.1,
                                                        steps=10)
    flow.optimizer.SGD(lr_scheduler, momentum=0.9).minimize(loss)

    return loss

__init__(base_lr: float, steps: int, num_periods: float = 0.5, alpha: float = 0.0, beta: float = 0.001, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶: Initialize self. See help(type(self)) for accurate signature.

property learning_rate_decay_conf¶

class oneflow.optimizer.NaturalExpScheduler(base_lr: float, steps: int, decay_rate: float, staircase: bool = False, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶

This operator creates a natural exponential decayed learning rate scheduler.

The learning rate will be updated as follows:

If staircase is set to False, the equation is:

\[ \begin{align}\begin{aligned}& step\_ratio = \frac{current\_batch}{decay\_batch}\\& learning\_rate = {base\_learning\_rate}*e^{-decay\_rate*step\_ratio}\end{aligned}\end{align} \]

If staircase is set to True, the equation is:

\[ \begin{align}\begin{aligned}& step\_ratio = \frac{current\_batch}{decay\_batch}\\& learning\_rate = {base\_learning\_rate}*e^{-decay\_rate*floor(step\_ratio)}\end{aligned}\end{align} \]

Parameters

base_lr (float) – The base learning rate
steps (int) – The decay steps
decay_rate (float) – The decay rate
staircase (bool, optional) – If staircase is True, the scheduler decay the learning rate at discrete intervals. Defaults to False.
warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
        images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
        labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    lr_scheduler = flow.optimizer.NaturalExpScheduler(base_lr=0.1,
                                                      steps=10,
                                                      decay_rate=0.5)
    flow.optimizer.SGD(lr_scheduler, momentum=0.9).minimize(loss)

    return loss

__init__(base_lr: float, steps: int, decay_rate: float, staircase: bool = False, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶: Initialize self. See help(type(self)) for accurate signature.

property learning_rate_decay_conf¶

class oneflow.optimizer.PiecewiseConstantScheduler(boundaries: Sequence[int], values: Sequence[float], warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶

This operator creates a piecewise constant learning rate scheduler.

The change in learning rate can be described as follows:

boundaries = [1000, 2000]
values = [0.1, 0.01, 0.001]

if current_step < 1000:
    learning_rate = 0.1
elif 1000 < current_step < 2000:
    learning_rate = 0.01
else:
    learning_rate = 0.001

Parameters

boundaries (Sequence[int]) – A list of train steps.
values (Sequence[float]) – A list of learning rate values during the different train step boundary.
warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
        images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
        labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    lr_scheduler = flow.optimizer.PiecewiseConstantScheduler(boundaries=[10, 20],
                                                             values=[0.1, 0.01, 0.001])
    flow.optimizer.Adam(lr_scheduler).minimize(loss)

    return loss

__init__(boundaries: Sequence[int], values: Sequence[float], warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶: Initialize self. See help(type(self)) for accurate signature.

property learning_rate_decay_conf¶

class oneflow.optimizer.PiecewiseScalingScheduler(base_lr: float, boundaries: Sequence[int], scale: Union[float, Sequence[float]], warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶

This operator creates a piecewise scaled decayed learning rate scheduler.

The change in learning rate can be described as follows:

boundaries = [1000, 2000]
scale = [0.1, 0.01]
base_lr = 0.1

if current_step < 1000:
    learning_rate = base_lr
elif 1000 < current_step < 2000:
    learning_rate = 0.1*base_lr
else:
    learning_rate = 0.01*base_lr

Parameters

base_lr (float) – The base learning rate
boundaries (Sequence[int]) – A list of train steps.
scale (Union[float, Sequence[float]]) – A list of learning rate scaled factors during the different train step boundary.
warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    lr_scheduler = flow.optimizer.PiecewiseScalingScheduler(base_lr=0.1,
                                                            boundaries=[5, 10],
                                                            scale=[0.5, 0.1])
    flow.optimizer.SGD(lr_scheduler, momentum=0).minimize(loss)

    return loss

__init__(base_lr: float, boundaries: Sequence[int], scale: Union[float, Sequence[float]], warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶: Initialize self. See help(type(self)) for accurate signature.

property learning_rate_decay_conf¶

class oneflow.optimizer.PolynomialScheduler(base_lr: float, steps: int, end_learning_rate: float = 0.0001, power: float = 1.0, cycle: bool = False, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶

This operator creates a polynomial decayed learning rate scheduler.

The learning rate will be updated as follows:

If cycle is True, the equation is:

\[ \begin{align}\begin{aligned}& decay\_batch = decay\_batch*ceil(\frac{current\_batch}{decay\_batch})\\& learning\_rate = (base\_lr-end\_lr)*(1-\frac{current\_batch}{decay\_batch})^{pow}+end\_lr\end{aligned}\end{align} \]

If cycle is False, the equation is:

\[ \begin{align}\begin{aligned}& decay\_batch = min(decay\_batch, current\_batch)\\& learning\_rate = (base\_lr-end\_lr)*(1-\frac{current\_batch}{decay\_batch})^{pow}+end\_lr\end{aligned}\end{align} \]

Parameters

base_lr (float) – The base learning rate
steps (int) – The decayed steps
end_learning_rate (float, optional) – The final learning rate. Defaults to 0.0001.
power (float, optional) – The power of polynomial. Defaults to 1.0.
cycle (bool, optional) – If cycle is true, the scheduler will decay the learning rate every decay steps. Defaults to False.
warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
        images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
        labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    lr_scheduler = flow.optimizer.PolynomialScheduler(base_lr=0.001,
                                                     steps=5,
                                                     end_learning_rate=0.00001,
                                                     power=2)
    flow.optimizer.Adam(lr_scheduler).minimize(loss)

    return loss

__init__(base_lr: float, steps: int, end_learning_rate: float = 0.0001, power: float = 1.0, cycle: bool = False, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)¶: Initialize self. See help(type(self)) for accurate signature.

property learning_rate_decay_conf¶

class oneflow.optimizer.RMSProp(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, decay_rate: float = 0.99, epsilon: float = 1e-08, centered: bool = False, loss_scale_factor: Optional[float] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶

The optimizer of the RMSProp algorithm.

This algorithm uses mean squared gradient to adjust the learning rate.

The equation of parameters updating is:

if centered:

\[ \begin{align}\begin{aligned}& mg_t = mg * \beta_1 + (1 - \beta_1) * grad\\& denom_t = S_t - mg_t * mg_t\end{aligned}\end{align} \]

else:

\[denom_t = S_t\]

\[param_{new} = param_{old} - \frac{learning\_rate}{\sqrt{denom_t+\epsilon}} \odot grad\]

Parameters

lr_scheduler (LrScheduler) – The scheduler of learning rate.
decay_rate (float, optional) – The decay factor (\(\beta_1\)). Defaults to 0.99.
epsilon (float, optional) – A small float constant value for numerical stability (\(\epsilon\)). Defaults to 1e-8.
centered (bool, optional) – If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.
loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.
grad_clipping (Optional[ClipGradientConf], optional) – The gradient clipping strategy. Defaults to None.
train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.
loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )
    # Set learning rate as 0.001
    lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.001])
    # Set RMSProp optimizer
    flow.optimizer.RMSProp(lr_scheduler).minimize(loss)

    return loss

__init__(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, decay_rate: float = 0.99, epsilon: float = 1e-08, centered: bool = False, loss_scale_factor: Optional[float] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶: Initialize self. See help(type(self)) for accurate signature.

class oneflow.optimizer.SGD(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, loss_scale_factor: Optional[float] = None, momentum: float = 0.9, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶

The optimizer of the stochastic gradient descent algorithm.

This algorithm takes a random sample’s gradient as an approximate estimate of the overall gradient in small batch gradient descent.

When the momentum = 0, the equation of parameters updating is:

\[param_{new} = param_{old} - learning\_rate*grad\]

With momentum, the equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_{t} = \beta*V_{t-1} + learning\_rate*g_t\\& param_{new} = param_{old} - V_{t}\end{aligned}\end{align} \]

Parameters

lr_scheduler (LrScheduler) – The scheduler of learning rate.
loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.
momentum (float, optional) – Momentum factor (\(\beta\)). Defaults to 0.9.
grad_clipping (Optional[ClipGradientConf], optional) – The gradient clipping strategy. Defaults to None.
train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.
loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    # Set Learning rate as 0.1
    lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.1])
    # Set Momentum=0.9 SGD optimizer
    flow.optimizer.SGD(lr_scheduler, momentum=0.9).minimize(loss)

    return loss

__init__(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, loss_scale_factor: Optional[float] = None, momentum: float = 0.9, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶: Initialize self. See help(type(self)) for accurate signature.

class oneflow.optimizer.SGDW(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, loss_scale_factor: Optional[float] = None, momentum: float = 0.9, weight_decay: Optional[float] = None, weight_decay_includes: Union[Sequence[str], str, None] = None, weight_decay_excludes: Union[Sequence[str], str, None] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶

The optimizer of the stochastic-gradient-descent-weight-decay algorithm.

(More details please refer to Decoupled Weight Decay Regularization).

When the momentum = 0, the equation of parameters updating is:

\[param_{new} = param_{old} - learning\_rate*(grad + \lambda*param_{old}))\]

With momentum, the equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_{t} = \beta*V_{t-1} - learning\_rate*g_t\\& param_{new} = param_{old} + V_{t} - learning\_rate * \lambda*param_{old}\end{aligned}\end{align} \]

Parameters

lr_scheduler (LrScheduler) – The scheduler of learning rate.
loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.
momentum (float, optional) – Momentum factor (\(\beta\)). Defaults to 0.9.
weight_decay (Optional[float], optional) – The weight decay factor (In the equation is \(\lambda\)). Defaults to None.
weight_decay_includes (Optional[Union[Sequence[Text], Text]], optional) – The name of the model parameters that use weight decay. Defaults to None.
weight_decay_excludes (Optional[Union[Sequence[Text], Text]], optional) – The name of the model parameters that do not use weight decay. Defaults to None.
grad_clipping (Optional[ClipGradientConf], optional) – The gradient clipping strategy. Defaults to None.
train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.
loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

Note

Only one of weight_decay_includes and weight_decay_excludes can be set. If both are None, all the model parameters will use weight decay.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    # Set Learning rate as 0.1
    lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.1])
    # Set Momentum=0.9 SGDW optimizer, weight_decay factor is 0.00005
    flow.optimizer.SGDW(lr_scheduler, momentum=0.9, weight_decay=0.00005).minimize(loss)

    return loss

__init__(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, loss_scale_factor: Optional[float] = None, momentum: float = 0.9, weight_decay: Optional[float] = None, weight_decay_includes: Union[Sequence[str], str, None] = None, weight_decay_excludes: Union[Sequence[str], str, None] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)¶: Initialize self. See help(type(self)) for accurate signature.

class oneflow.optimizer.warmup.constant(steps, multiplier)¶

This operator use the constant warmup strategy to adjust the learning rate.

Before the steps are specified by user, the learning rate is:

\[learning\_rate = base\_learning\_rate*multiplier\]

After the steps are specified by user, the learning rate is:

\[learning\_rate = base\_learning\_rate\]

Parameters

steps (int) – [description]
multiplier (float) – The scale factor \(multiplier\), it should be greater than 0. and less than 1.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    # Before 10 epochs, the learning rate is 0.001
    # After 10 epochs, the learning rate is 0.01
    warmup_scheduler = flow.optimizer.warmup.constant(10, 0.1)
    lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.01], warmup=warmup_scheduler)
    flow.optimizer.Adam(lr_scheduler).minimize(loss)

    return loss

__init__(steps, multiplier)¶: Initialize self. See help(type(self)) for accurate signature.

property warmup_conf¶

class oneflow.optimizer.warmup.linear(steps, start_multiplier)¶

This operator uses the linear warmup strategy to adjust the learning rate.

When current train step is less than warmup steps, the learning rate will be updated as:

\[ \begin{align}\begin{aligned}& current\_multiplier = start\_multiplier + (1-start\_multiplier)*\frac{train\_step}{warmup\_step}\\& current\_learning\_rate = learning\_rate*current\_multiplier\end{aligned}\end{align} \]

Parameters

steps (int) – The warmup steps.
start_multiplier (float) – The start multiplier(\(start\_multiplier\)). It should be greater than 0. and less than 1.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )

    # Before 10 epochs, the learning rate will increase from 0.001 to 0.01 in linear.
    warmup_scheduler = flow.optimizer.warmup.linear(10, 0.1)
    lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.01], warmup=warmup_scheduler)
    flow.optimizer.Adam(lr_scheduler).minimize(loss)

    return loss

__init__(steps, start_multiplier)¶: Initialize self. See help(type(self)) for accurate signature.

property warmup_conf¶

class oneflow.optimizer.grad_clipping.by_global_norm(clip_norm)¶

This operator limits the norm of Input with clip_norm.

If the norm of Input is less than the clip_norm,

the Output will be the same as Input.

If the norm of Input is greater than the clip_norm, the Output will be scaled.

The equation is:

\[Output = \frac{clip\_norm*Input}{norm(Input)}\]

Parameters: clip_norm (float) – The maximum norm value.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("gpu", "0:0"):
        logits = lenet(images, train=True)
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
            labels, logits, name="softmax_loss"
        )
    # Set learning rate as 0.001
    lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.001])
    # Set gradient_clip
    gradient_clip = flow.optimizer.grad_clipping.by_global_norm(1.0)
    # Set AdamW optimizer with gradient clip
    flow.optimizer.AdamW(lr_scheduler,
                do_bias_correction=False, weight_decay=0.00005,
                grad_clipping=gradient_clip).minimize(loss)

    return loss

__init__(clip_norm)¶: Initialize self. See help(type(self)) for accurate signature.

property clip_conf¶