class oneflow.optim.AdamW(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, do_bias_correction: bool = True, contiguous_params: bool = False, fused: bool = False)

The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.

The optimizer of the Adam-weight-decay algorithm.

So we use Adam-weight-decay algorithm to solve this problem.

the equation of parameters updating is:

\begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*(\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}+\lambda*param_{old})\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align}
Parameters
• params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

• lr (float, optional) – learning rate (default: 1e-3)

• betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

• eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

• weight_decay (float, optional) – weight decay (L2 penalty) (In the equation is λ, default: 0)

• amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm. (default: False)

• do_bias_correction (bool, optional) – whether to do bias correction (default: True)

• contiguous_params (bool, optional) – whether to use contiguous ParamGroup which puts all parameters of the same type, device and group into the same tensor and update them together. (default: False)

• fused (bool, optional) – whether to divide all the parameters into several groups, then update each group of parameters with the fused kernel. (default: False)

For example:

Example 1:

# Assume net is a custom model.

for epoch in range(epochs):
# Read data, Compute the loss and so on.
# ...
loss.backward()


Example 2:

# Assume net is a custom model.
[
{
"params": net.parameters(),
"lr": learning_rate,
}
],
)

for epoch in range(epochs):
# Read data, Compute the loss and so on.
# ...
loss.backward()


If you want to use clip_grad, you can refer this example.

For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to oneflow.nn.utils.clip_grad_norm_().

__init__(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, do_bias_correction: bool = True, contiguous_params: bool = False, fused: bool = False)

Initialize self. See help(type(self)) for accurate signature.

Methods

 __delattr__(name, /) Implement delattr(self, name). __dir__() Default dir() implementation. __eq__(value, /) Return self==value. __format__(format_spec, /) Default object formatter. __ge__(value, /) Return self>=value. __getattribute__(name, /) Return getattr(self, name). __gt__(value, /) Return self>value. __hash__() Return hash(self). __init__(params[, lr, betas, eps, …]) Initialize self. __init_subclass__ This method is called when a class is subclassed. __le__(value, /) Return self<=value. __lt__(value, /) Return self

Attributes

 support_sparse Whether AdamW Optimizer support sparse update.