oneflow.optim.AdamW¶

class oneflow.optim.AdamW(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, do_bias_correction: bool = True, contiguous_params: bool = False, fused: bool = False)¶

Implements AdamW algorithm.

The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.

The optimizer of the Adam-weight-decay algorithm.

(More details please refer to Adam-weight-decay).

So we use Adam-weight-decay algorithm to solve this problem.

the equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*(\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}+\lambda*param_{old})\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align} \]

Parameters

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (In the equation is λ, default: 0)
amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm. (default: False)
do_bias_correction (bool, optional) – whether to do bias correction (default: True)
contiguous_params (bool, optional) – whether to use contiguous ParamGroup which puts all parameters of the same type, device and group into the same tensor and update them together. (default: False)
fused (bool, optional) – whether to divide all the parameters into several groups, then update each group of parameters with the fused kernel. (default: False)

For example:

Example 1:

# Assume net is a custom model.
adamw = flow.optim.AdamW(net.parameters(), lr=1e-3)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adamw.step()
    adamw.zero_grad()

Example 2:

# Assume net is a custom model.
adamw = flow.optim.AdamW(
    [
        {
            "params": net.parameters(),
            "lr": learning_rate,
            "clip_grad_max_norm": 0.5,
            "clip_grad_norm_type": 2.0,
        }
    ],
)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adamw.clip_grad()
    adamw.step()
    adamw.zero_grad()

If you want to use clip_grad, you can refer this example.

For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to oneflow.nn.utils.clip_grad_norm_().

__init__(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, do_bias_correction: bool = True, contiguous_params: bool = False, fused: bool = False)¶: Initialize self. See help(type(self)) for accurate signature.

Methods

`__delattr__`(name, /)	Implement delattr(self, name).
`__dir__`()	Default dir() implementation.
`__eq__`(value, /)	Return self==value.
`__format__`(format_spec, /)	Default object formatter.
`__ge__`(value, /)	Return self>=value.
`__getattribute__`(name, /)	Return getattr(self, name).
`__gt__`(value, /)	Return self>value.
`__hash__`()	Return hash(self).
`__init__`(params[, lr, betas, eps, …])	Initialize self.
`__init_subclass__`	This method is called when a class is subclassed.
`__le__`(value, /)	Return self<=value.
`__lt__`(value, /)	Return self<value.
`__ne__`(value, /)	Return self!=value.
`__new__`(**kwargs)	Create and return a new object.
`__reduce__`()	Helper for pickle.
`__reduce_ex__`(protocol, /)	Helper for pickle.
`__repr__`()	Return repr(self).
`__setattr__`(name, value, /)	Implement setattr(self, name, value).
`__sizeof__`()	Size of object in memory, in bytes.
`__str__`()	Return str(self).
`__subclasshook__`	Abstract classes can override this to customize issubclass().
`_check_variables_in_graph`(vars_conf)
`_check_variables_optimizer_bound`(vars_conf)
`_fused_update`(param_group)
`_generate_conf_for_graph`(train_conf, vars_conf)
`_generate_grad_clip_conf_for_optim_conf`(…)
`_generate_indexed_slices_optimizer_conf`(…)
`_generate_lr_scale_for_optim_conf`(…)
`_parse_input_parameters`(parameters)	Supports such parameters:
`_single_tensor_update`(param_group)
`add_param_group`(param_group)	Add a param group to the `Optimizer` s param_groups.
`clip_grad`([error_if_nonfinite])	Clips gradient norm of an iterable of parameters.
`load_state_dict`(state_dict)	Load the state of the optimizer which is created by state_dict function.
`state_dict`()	Returns the state of the optimizer as a `dict`.
`step`([closure])	Performs a single optimization step.
`zero_grad`([set_to_none])	Sets the gradients of all optimized `oneflow.Tensor` s to zero.

Attributes

support_sparse

Whether AdamW Optimizer support sparse update.