# oneflow.optimizer¶

## Optimizers¶

class oneflow.optimizer.Adam(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, beta1=0.9, beta2=0.999, epsilon=1e-08, do_bias_correction=False, loss_scale_factor: Optional[float] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None, variables: Union[Sequence[str], Callable[[], Sequence[str]], None] = <function GetVariablesForCurrentJob>)

The optimizer of the Adam algorithm.

This algorithm can adjust the learning rate of each parameter dynamically according to the 1st-moment estimates

and the 2nd-moment estimates of gradient.

With bias correction, the equation of parameters updating is:

\begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{V_t} = \frac{V_t}{1-\beta_1^t}\\& \hat{S_t} = \frac{S_t}{1-\beta_2^t}\\& \hat{g} = learning\_rate*\frac{\hat{V_t}}{\sqrt{\hat{S_t}}+\epsilon}\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align}

Without bias correction, the equation of parameters updating is:

\begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align}

Parameters
• lr_scheduler (LrScheduler) – The scheduler of learning rate.

• beta1 (float, optional) – The exponential weighted average decay rate for the 1st-moment estimates ($$\beta_1$$). Defaults to 0.9.

• beta2 (float, optional) – The exponential weighted average decay rate for the 2rd-moment estimates ($$\beta_2$$). Defaults to 0.999.

• epsilon ([type], optional) – A small float constant value for numerical stability ($$\epsilon$$). Defaults to 1e-8.

• do_bias_correction (bool, optional) – Whether to do the bias correction. Defaults to False.

• loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.

• train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.

• loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

• variables(Optional[ – Union[Sequence[Text], Callable[[], Sequence[Text]]]

• ]) – maintained variables.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)

# Set learning rate as 0.001
lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.001])

return loss

class oneflow.optimizer.AdamW(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, beta1=0.9, beta2=0.999, epsilon=1e-08, do_bias_correction=False, loss_scale_factor: Optional[float] = None, weight_decay: Optional[float] = None, weight_decay_includes: Union[Sequence[str], str, None] = None, weight_decay_excludes: Union[Sequence[str], str, None] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None, variables: Union[Sequence[str], Callable[[], Sequence[str]], None] = <function GetVariablesForCurrentJob>)

The optimizer of the Adam-weight-decay algorithm.

If we use L2 regularization,

it will be invalid due to the adaptive learning rate in Adam optimizer

So we use Adam-weight-decay algorithm to solve this problem.

With bias correction, the equation of parameters updating is:

\begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{V_t} = \frac{V_t}{1-\beta_1^t}\\& \hat{S_t} = \frac{S_t}{1-\beta_2^t}\\& \hat{g} = learning\_rate*(\frac{\hat{V_t}}{\sqrt{\hat{S_t}}+\epsilon}+\lambda*param_{old})\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align}

Without bias correction, the equation of parameters updating is:

\begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*(\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}+\lambda*param_{old})\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align}
Parameters
• lr_scheduler (LrScheduler) – The scheduler of learning rate.

• beta1 (float, optional) – The exponential weighted average decay rate for the 1st-moment estimates ($$\beta_1$$). Defaults to 0.9.

• beta2 (float, optional) – The exponential weighted average decay rate for the 2rd-moment estimates ($$\beta_2$$). Defaults to 0.999.

• epsilon ([type], optional) – A small float constant value for numerical stability ($$\epsilon$$). Defaults to 1e-8.

• do_bias_correction (bool, optional) – Whether to do the bias correction. Defaults to False.

• loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.

• weight_decay (Optional[float], optional) – The weight decay factor (In the equation is $$\lambda$$). Defaults to None.

• weight_decay_includes (Optional[Union[Sequence[Text], Text]], optional) – The name of the model parameters that use weight decay. Defaults to None.

• weight_decay_excludes (Optional[Union[Sequence[Text], Text]], optional) – The name of the model parameters that do not use weight decay. Defaults to None.

• train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.

• loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

• variables(Optional[ – Union[Sequence[Text], Callable[[], Sequence[Text]]]

• ]) – maintained variables.

Note

Only one of weight_decay_includes and weight_decay_excludes can be set. If both are None, all the model parameters will use weight decay.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)

# Set learning rate as 0.001
lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.001])
# Set AdamW optimizer, weight_decay factor is 0.00005
do_bias_correction=False, weight_decay=0.00005).minimize(loss)

return loss

class oneflow.optimizer.CombinedOptimizer(optimizers: Sequence[oneflow.python.ops.optimizer.Optimizer], loss_scale_factor: Optional[float] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None)

Combined optimizer for multi optimizer case.

Parameters
• optimizers (Sequence[Optimizer]) – optimizers to work together

• loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.

• train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.

• loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

• Example – see test_multi_optimizer.py

Variables() → List[str]
class oneflow.optimizer.CosineScheduler(base_lr: float, steps: int, alpha: float = 0.0, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)

This operator creates a Cosine decayed learning rate scheduler.

Before the steps are specified by user, the learning rate will be updated as:

\begin{align}\begin{aligned}& cos\_decay = 0.5*(1+cos(\pi*\frac{current\_batch}{decayed\_batch}))\\& decay\_factor = (1-\alpha)*cos\_decay+\alpha\\& learning\_rate = base\_learning\_rate*decay\_factor\end{aligned}\end{align}

After the steps specified by user, the learning rate will be :

$learning\_rate = {base\_learning\_rate}*{\alpha}$
Parameters
• base_lr (float) – The base learning rate ($$base\_learning\_rate$$)

• steps (int) – The decay steps in the scheduler ($$decayed\_batch$$)

• alpha (float, optional) – The learning rate scale factor ($$\alpha$$). Defaults to 0.0.

• warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)

lr_scheduler = flow.optimizer.CosineScheduler(base_lr=0.01,
steps=10,
alpha=0.1)

return loss

property learning_rate_decay_conf
class oneflow.optimizer.CustomScheduler(lbn: str)
property learning_rate_decay_conf
class oneflow.optimizer.ExponentialScheduler(base_lr: float, steps: int, decay_rate: float, staircase=False, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)

This operator creates a exponential decayed learning rate scheduler.

The learning rate will be updated as follows:

If staircase is set to False, the equation is:

\begin{align}\begin{aligned}& pow = \frac{current\_batch}{decay\_batch}\\& learning\_rate = base\_learning\_rate*decay\_rate^{pow}\end{aligned}\end{align}

If staircase is set to True, the equation is:

\begin{align}\begin{aligned}& pow = floor(\frac{current\_batch}{decay\_batch})\\& learning\_rate = base\_learning\_rate*decay\_rate^{pow}\end{aligned}\end{align}
Parameters
• base_lr (float) – The base learning rate

• steps (int) – The decay steps

• decay_rate (float) – The decay rate

• staircase (bool, optional) – If staircase is True, the scheduler decay the learning rate at discrete intervals. Defaults to False.

• warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

property learning_rate_decay_conf
class oneflow.optimizer.InverseTimeScheduler(base_lr: float, steps: int, decay_rate: float, staircase: bool = False, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)

This operator creates a inverse time decayed learning rate scheduler.

The learning rate will be updated as follows:

If staircase is set to False, the equation is:

\begin{align}\begin{aligned}& step\_ratio = \frac{current\_batch}{decay\_batch}\\& learning\_rate = \frac{base\_learning\_rate}{1+decay\_rate*step\_ratio}\end{aligned}\end{align}

If staircase is set to True, the equation is:

\begin{align}\begin{aligned}& step\_ratio = \frac{current\_batch}{decay\_batch}\\& learning\_rate = \frac{base\_learning\_rate}{1+floor(decay\_rate*step\_ratio)}\end{aligned}\end{align}
Parameters
• base_lr (float) – The base learning rate

• steps (int) – The decay steps

• decay_rate (float) – The decay rate

• staircase (bool, optional) – If staircase is True, the scheduler decay the learning rate at discrete intervals. Defaults to False.

• warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)

lr_scheduler = flow.optimizer.InverseTimeScheduler(base_lr=0.1,
steps=5,
decay_rate=0.9)
flow.optimizer.SGD(lr_scheduler, momentum=0.9).minimize(loss)

return loss

property learning_rate_decay_conf
class oneflow.optimizer.LAMB(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-06, loss_scale_factor: Optional[float] = None, weight_decay: Optional[float] = None, weight_decay_includes: Union[Sequence[str], str, None] = None, weight_decay_excludes: Union[Sequence[str], str, None] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None, variables: Union[Sequence[str], Callable[[], Sequence[str]], None] = <function GetVariablesForCurrentJob>)
Parameters
• lr_scheduler (LrScheduler) – The scheduler of learning rate.

• beta1 (float, optional) – The exponential weighted average decay rate for the 1st-moment estimates ($$\beta_1$$). Defaults to 0.9.

• beta2 (float, optional) – The exponential weighted average decay rate for the 2rd-moment estimates ($$\beta_2$$). Defaults to 0.999.

• epsilon ([type], optional) – A small float constant value for numerical stability ($$\epsilon$$). Defaults to 1e-6.

• loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.

• weight_decay (Optional[float], optional) – The weight decay factor (In the equation is $$\lambda$$). Defaults to None.

• weight_decay_includes (Optional[Union[Sequence[Text], Text]], optional) – The name of the model parameters that use weight decay. Defaults to None.

• weight_decay_excludes (Optional[Union[Sequence[Text], Text]], optional) – The name of the model parameters that do not use weight decay. Defaults to None.

• train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.

• loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

• variables(Optional[ – Union[Sequence[Text], Callable[[], Sequence[Text]]]

• ]) – maintained variables.

Note

Only one of weight_decay_includes and weight_decay_excludes can be set. If both are None, all the model parameters will use weight decay.

class oneflow.optimizer.LARS(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, momentum_beta: float = 0.9, epsilon: float = 1e-09, lars_coefficient: float = 0.0001, loss_scale_factor: Optional[float] = None, weight_decay: Optional[float] = None, weight_decay_includes: Union[Sequence[str], str, None] = None, weight_decay_excludes: Union[Sequence[str], str, None] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None, variables: Union[Sequence[str], Callable[[], Sequence[str]], None] = <function GetVariablesForCurrentJob>)

The optimizer of the LARS algorithm.

The equation of parameters updating is:

\begin{align}\begin{aligned}& local\_learning\_rate = learning\_rate*lars\_coeff*\frac{\lVert{parm_{old}\rVert}}{\epsilon+\lVert{grad\rVert}+weight_decay*\lVert{parm_{old}\rVert}}\\& momentum_t = \beta*momentum_{t-1} + local\_learning\_rate*(grad)\\& param_{new} = param_{old} - momentum_t - local_learning_rate * weight_decay * param_{old}\end{aligned}\end{align}
Parameters
• lr_scheduler (LrScheduler) – The scheduler of learning rate.

• momentum_beta (float, optional) – The momentum factor ($$\beta$$). Defaults to 0.9.

• epsilon (float, optional) – A small float constant value for numerical stability ($$\epsilon$$). Defaults to 1e-9.

• lars_coefficient (float, optional) – The coefficient factor, it defines how much we trust the layer to change its weights ($$lars\_coeff$$). Defaults to 0.0001.

• loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.

• weight_decay (Optional[float], optional) – The weight decay factor (In the equation is $$\lambda$$). Defaults to None.

• weight_decay_includes (Optional[Union[Sequence[Text], Text]], optional) – The name of the model parameters that use weight decay. Defaults to None.

• weight_decay_excludes (Optional[Union[Sequence[Text], Text]], optional) – The name of the model parameters that do not use weight decay. Defaults to None.

• train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.

• loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

Note

Only one of weight_decay_includes and weight_decay_excludes can be set. If both are None, all the model parameters will use weight decay.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)
# Set learning rate as 0.1
lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.1])
# Set LARS optimizer, momentum factor is 0.9
flow.optimizer.LARS(lr_scheduler, momentum_beta=0.9).minimize(loss)

return loss

class oneflow.optimizer.LazyAdam(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-08, loss_scale_factor: Optional[float] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None, variables: Union[Sequence[str], Callable[[], Sequence[str]], None] = <function GetVariablesForCurrentJob>)

The optimizer of the LazyAdam algorithm.

This algorithm can adjust the learning rate of each parameter dynamically according to the 1st-moment estimates and the 2nd-moment estimates of the gradient.

\begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align}
Parameters
• lr_scheduler (LrScheduler) – The scheduler of learning rate.

• beta1 (float, optional) – The exponential weighted average decay rate for the 1st-moment estimates ($$\beta_1$$). Defaults to 0.9.

• beta2 (float, optional) – The exponential weighted average decay rate for the 2rd-moment estimates ($$\beta_2$$). Defaults to 0.999.

• epsilon ([type], optional) – A small float constant value for numerical stability ($$\epsilon$$). Defaults to 1e-8.

• loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.

• train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.

• loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

• variables(Optional[ – Union[Sequence[Text], Callable[[], Sequence[Text]]]

• ]) – maintained variables.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)
# Set learning rate as 0.001
lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.001])

return loss

class oneflow.optimizer.LinearCosineScheduler(base_lr: float, steps: int, num_periods: float = 0.5, alpha: float = 0.0, beta: float = 0.001, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)

This operator creates a linear cosine decayed learning rate scheduler.

The learning rate will be updated as follows:

\begin{align}\begin{aligned}& current\_batch = min(current\_batch, decay\_batch)\\& linear\_decay = \frac{(decay\_batch - current\_batch)}{decay\_batch}\\& cosine\_decay = 0.5*(1.0+cos(2*\pi*num\_periods*\frac{current\_batch}{decay\_batch}))\\& decay\_factor = (\alpha+linear\_decay)*cosine\_decay + \beta\\& learning\_rate = base\_learning\_rate*decay\_factor\end{aligned}\end{align}
Parameters
• base_lr (float) – The base learning rate

• steps (int) – The decay steps

• num_periods (float, optional) – The number of decay periods. Defaults to 0.5.

• alpha (float, optional) – The $$\alpha$$ in equation. Defaults to 0.0.

• beta (float, optional) – The $$\beta$$ in equation. Defaults to 0.001.

• warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)

lr_scheduler = flow.optimizer.LinearCosineScheduler(base_lr=0.1,
steps=10)
flow.optimizer.SGD(lr_scheduler, momentum=0.9).minimize(loss)

return loss

property learning_rate_decay_conf
class oneflow.optimizer.NaturalExpScheduler(base_lr: float, steps: int, decay_rate: float, staircase: bool = False, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)

This operator creates a natural exponential decayed learning rate scheduler.

The learning rate will be updated as follows:

If staircase is set to False, the equation is:

\begin{align}\begin{aligned}& step\_ratio = \frac{current\_batch}{decay\_batch}\\& learning\_rate = {base\_learning\_rate}*e^{-decay\_rate*step\_ratio}\end{aligned}\end{align}

If staircase is set to True, the equation is:

\begin{align}\begin{aligned}& step\_ratio = \frac{current\_batch}{decay\_batch}\\& learning\_rate = {base\_learning\_rate}*e^{-decay\_rate*floor(step\_ratio)}\end{aligned}\end{align}
Parameters
• base_lr (float) – The base learning rate

• steps (int) – The decay steps

• decay_rate (float) – The decay rate

• staircase (bool, optional) – If staircase is True, the scheduler decay the learning rate at discrete intervals. Defaults to False.

• warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)

lr_scheduler = flow.optimizer.NaturalExpScheduler(base_lr=0.1,
steps=10,
decay_rate=0.5)
flow.optimizer.SGD(lr_scheduler, momentum=0.9).minimize(loss)

return loss

property learning_rate_decay_conf
class oneflow.optimizer.PiecewiseConstantScheduler(boundaries: Sequence[int], values: Sequence[float], warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)

This operator creates a piecewise constant learning rate scheduler.

The change in learning rate can be described as follows:

boundaries = [1000, 2000]
values = [0.1, 0.01, 0.001]

if current_step < 1000:
learning_rate = 0.1
elif 1000 < current_step < 2000:
learning_rate = 0.01
else:
learning_rate = 0.001

Parameters
• boundaries (Sequence[int]) – A list of train steps.

• values (Sequence[float]) – A list of learning rate values during the different train step boundary.

• warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)

lr_scheduler = flow.optimizer.PiecewiseConstantScheduler(boundaries=[10, 20],
values=[0.1, 0.01, 0.001])

return loss

property learning_rate_decay_conf
class oneflow.optimizer.PiecewiseScalingScheduler(base_lr: float, boundaries: Sequence[int], scale: Union[float, Sequence[float]], warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)

This operator creates a piecewise scaled decayed learning rate scheduler.

The change in learning rate can be described as follows:

boundaries = [1000, 2000]
scale = [0.1, 0.01]
base_lr = 0.1

if current_step < 1000:
learning_rate = base_lr
elif 1000 < current_step < 2000:
learning_rate = 0.1*base_lr
else:
learning_rate = 0.01*base_lr

Parameters
• base_lr (float) – The base learning rate

• boundaries (Sequence[int]) – A list of train steps.

• scale (Union[float, Sequence[float]]) – A list of learning rate scaled factors during the different train step boundary.

• warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)

lr_scheduler = flow.optimizer.PiecewiseScalingScheduler(base_lr=0.1,
boundaries=[5, 10],
scale=[0.5, 0.1])
flow.optimizer.SGD(lr_scheduler, momentum=0).minimize(loss)

return loss

property learning_rate_decay_conf
class oneflow.optimizer.PolynomialScheduler(base_lr: float, steps: int, end_learning_rate: float = 0.0001, power: float = 1.0, cycle: bool = False, warmup: Optional[oneflow.python.ops.optimizer.WarmupConf] = None)

This operator creates a polynomial decayed learning rate scheduler.

The learning rate will be updated as follows:

If cycle is True, the equation is:

\begin{align}\begin{aligned}& decay\_batch = decay\_batch*ceil(\frac{current\_batch}{decay\_batch})\\& learning\_rate = (base\_lr-end\_lr)*(1-\frac{current\_batch}{decay\_batch})^{pow}+end\_lr\end{aligned}\end{align}

If cycle is False, the equation is:

\begin{align}\begin{aligned}& decay\_batch = min(decay\_batch, current\_batch)\\& learning\_rate = (base\_lr-end\_lr)*(1-\frac{current\_batch}{decay\_batch})^{pow}+end\_lr\end{aligned}\end{align}
Parameters
• base_lr (float) – The base learning rate

• steps (int) – The decayed steps

• end_learning_rate (float, optional) – The final learning rate. Defaults to 0.0001.

• power (float, optional) – The power of polynomial. Defaults to 1.0.

• cycle (bool, optional) – If cycle is true, the scheduler will decay the learning rate every decay steps. Defaults to False.

• warmup (Optional[WarmupConf], optional) – The warmup strategy. Defaults to None.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)

lr_scheduler = flow.optimizer.PolynomialScheduler(base_lr=0.001,
steps=5,
end_learning_rate=0.00001,
power=2)

return loss

property learning_rate_decay_conf
class oneflow.optimizer.RMSProp(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, decay_rate: float = 0.99, epsilon: float = 1e-08, centered: bool = False, loss_scale_factor: Optional[float] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None, variables: Union[Sequence[str], Callable[[], Sequence[str]], None] = <function GetVariablesForCurrentJob>)

The optimizer of the RMSProp algorithm.

The equation of parameters updating is:

if centered:

\begin{align}\begin{aligned}& mg_t = mg * \beta_1 + (1 - \beta_1) * grad\\& denom_t = S_t - mg_t * mg_t\end{aligned}\end{align}

else:

$denom_t = S_t$
$param_{new} = param_{old} - \frac{learning\_rate}{\sqrt{denom_t+\epsilon}} \odot grad$
Parameters
• lr_scheduler (LrScheduler) – The scheduler of learning rate.

• decay_rate (float, optional) – The decay factor ($$\beta_1$$). Defaults to 0.99.

• epsilon (float, optional) – A small float constant value for numerical stability ($$\epsilon$$). Defaults to 1e-8.

• centered (bool, optional) – If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.

• loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.

• train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.

• loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

• variables(Optional[ – Union[Sequence[Text], Callable[[], Sequence[Text]]]

• ]) – maintained variables.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)
# Set learning rate as 0.001
lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.001])
# Set RMSProp optimizer
flow.optimizer.RMSProp(lr_scheduler).minimize(loss)

return loss

class oneflow.optimizer.SGD(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, loss_scale_factor: Optional[float] = None, momentum: float = 0.9, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None, variables: Union[Sequence[str], Callable[[], Sequence[str]], None] = <function GetVariablesForCurrentJob>)

The optimizer of the stochastic gradient descent algorithm.

This algorithm takes a random sample’s gradient as an approximate estimate of the overall gradient in small batch gradient descent.

When the momentum = 0, the equation of parameters updating is:

$param_{new} = param_{old} - learning\_rate*grad$

With momentum, the equation of parameters updating is:

\begin{align}\begin{aligned}& V_{t} = \beta*V_{t-1} + learning\_rate*g_t\\& param_{new} = param_{old} - V_{t}\end{aligned}\end{align}
Parameters
• lr_scheduler (LrScheduler) – The scheduler of learning rate.

• loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.

• momentum (float, optional) – Momentum factor ($$\beta$$). Defaults to 0.9.

• train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.

• loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

• variables(Optional[ – Union[Sequence[Text], Callable[[], Sequence[Text]]]

• ]) – maintained variables.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)

# Set Learning rate as 0.1
lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.1])
# Set Momentum=0.9 SGD optimizer
flow.optimizer.SGD(lr_scheduler, momentum=0.9).minimize(loss)

return loss

class oneflow.optimizer.SGDW(lr_scheduler: oneflow.python.ops.optimizer.LrScheduler, loss_scale_factor: Optional[float] = None, momentum: float = 0.9, weight_decay: Optional[float] = None, weight_decay_includes: Union[Sequence[str], str, None] = None, weight_decay_excludes: Union[Sequence[str], str, None] = None, grad_clipping: Optional[oneflow.python.ops.optimizer.ClipGradientConf] = None, train_step_lbn: Optional[str] = None, loss_scale_policy: Optional[oneflow.python.ops.optimizer.LossScalePolicy] = None, variables: Union[Sequence[str], Callable[[], Sequence[str]], None] = <function GetVariablesForCurrentJob>)

The optimizer of the stochastic-gradient-descent-weight-decay algorithm.

(More details please refer to Decoupled Weight Decay Regularization).

When the momentum = 0, the equation of parameters updating is:

$param_{new} = param_{old} - learning\_rate*(grad + \lambda*param_{old}))$

With momentum, the equation of parameters updating is:

\begin{align}\begin{aligned}& V_{t} = \beta*V_{t-1} - learning\_rate*g_t\\& param_{new} = param_{old} + V_{t} - learning\_rate * \lambda*param_{old}\end{aligned}\end{align}
Parameters
• lr_scheduler (LrScheduler) – The scheduler of learning rate.

• loss_scale_factor (Optional[float], optional) – The scale factor of loss. Defaults to None.

• momentum (float, optional) – Momentum factor ($$\beta$$). Defaults to 0.9.

• weight_decay (Optional[float], optional) – The weight decay factor (In the equation is $$\lambda$$). Defaults to None.

• weight_decay_includes (Optional[Union[Sequence[Text], Text]], optional) – The name of the model parameters that use weight decay. Defaults to None.

• weight_decay_excludes (Optional[Union[Sequence[Text], Text]], optional) – The name of the model parameters that do not use weight decay. Defaults to None.

• train_step_lbn (Optional[Text], optional) – [description]. Defaults to None.

• loss_scale_policy (Optional[LossScalePolicy]) – The policy of loss scale.

• variables(Optional[ – Union[Sequence[Text], Callable[[], Sequence[Text]]]

• ]) – maintained variables.

Note

Only one of weight_decay_includes and weight_decay_excludes can be set. If both are None, all the model parameters will use weight decay.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)

# Set Learning rate as 0.1
lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.1])
# Set Momentum=0.9 SGDW optimizer, weight_decay factor is 0.00005
flow.optimizer.SGDW(lr_scheduler, momentum=0.9, weight_decay=0.00005).minimize(loss)

return loss

class oneflow.optimizer.warmup.constant(steps, multiplier)

This operator use the constant warmup strategy to adjust the learning rate.

Before the steps are specified by user, the learning rate is:

$learning\_rate = base\_learning\_rate*multiplier$

After the steps are specified by user, the learning rate is:

$learning\_rate = base\_learning\_rate$
Parameters
• steps (int) – [description]

• multiplier (float) – The scale factor $$multiplier$$, it should be greater than 0. and less than 1.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)

# Before 10 epochs, the learning rate is 0.001
# After 10 epochs, the learning rate is 0.01
warmup_scheduler = flow.optimizer.warmup.constant(10, 0.1)
lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.01], warmup=warmup_scheduler)

return loss

property warmup_conf
class oneflow.optimizer.warmup.linear(steps, start_multiplier)

This operator uses the linear warmup strategy to adjust the learning rate.

When current train step is less than warmup steps, the learning rate will be updated as:

\begin{align}\begin{aligned}& current\_multiplier = start\_multiplier + (1-start\_multiplier)*\frac{train\_step}{warmup\_step}\\& current\_learning\_rate = learning\_rate*current\_multiplier\end{aligned}\end{align}
Parameters
• steps (int) – The warmup steps.

• start_multiplier (float) – The start multiplier($$start\_multiplier$$). It should be greater than 0. and less than 1.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)

# Before 10 epochs, the learning rate will increase from 0.001 to 0.01 in linear.
warmup_scheduler = flow.optimizer.warmup.linear(10, 0.1)
lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.01], warmup=warmup_scheduler)

return loss

property warmup_conf
class oneflow.optimizer.grad_clipping.by_global_norm(clip_norm)

This operator limits the norm of Input with clip_norm.

If the norm of Input is less than the clip_norm,

the Output will be the same as Input.

If the norm of Input is greater than the clip_norm, the Output will be scaled.

The equation is:

$Output = \frac{clip\_norm*Input}{norm(Input)}$
Parameters

clip_norm (float) – The maximum norm value.

For example:

import oneflow as flow
import oneflow.typing as tp

@flow.global_function(type="train")
def train_job(
images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
with flow.scope.placement("gpu", "0:0"):
logits = lenet(images, train=True)
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, logits, name="softmax_loss"
)
# Set learning rate as 0.001
lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.001])

property clip_conf