oneflow.optim.AdamW¶
-
class
oneflow.optim.AdamW(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, do_bias_correction: bool = True)¶ Implements AdamW algorithm.
The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.
The optimizer of the Adam-weight-decay algorithm.
(More details please refer to Adam-weight-decay).
So we use Adam-weight-decay algorithm to solve this problem.
the equation of parameters updating is:
\[ \begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*(\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}+\lambda*param_{old})\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align} \]- Parameters
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (In the equation is λ, default: 0)
amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm. (default: False)
do_bias_correction (bool, optional) – Whether do bias correction (default: True)
For example:
Example 1:
# Assume net is a custom model. adamw = flow.optim.AdamW(net.parameters(), lr=1e-3) for epoch in range(epochs): # Read data, Compute the loss and so on. # ... loss.backward() adamw.step() adamw.zero_grad()
Example 2:
# Assume net is a custom model. adamw = flow.optim.AdamW( [ { "params": net.parameters(), "lr": learning_rate, "clip_grad_max_norm": 0.5, "clip_grad_norm_type": 2.0, } ], ) for epoch in range(epochs): # Read data, Compute the loss and so on. # ... loss.backward() adamw.clip_grad() adamw.step() adamw.zero_grad()
If you want to use clip_grad, you can refer this example.
For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to
oneflow.nn.utils.clip_grad_norm_().-
__init__(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, do_bias_correction: bool = True)¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__delattr__(name, /)Implement delattr(self, name).
__dir__()Default dir() implementation.
__eq__(value, /)Return self==value.
__format__(format_spec, /)Default object formatter.
__ge__(value, /)Return self>=value.
__getattribute__(name, /)Return getattr(self, name).
__gt__(value, /)Return self>value.
__hash__()Return hash(self).
__init__(params[, lr, betas, eps, …])Initialize self.
__init_subclass__This method is called when a class is subclassed.
__le__(value, /)Return self<=value.
__lt__(value, /)Return self<value.
__ne__(value, /)Return self!=value.
__new__(**kwargs)Create and return a new object.
__reduce__()Helper for pickle.
__reduce_ex__(protocol, /)Helper for pickle.
__repr__()Return repr(self).
__setattr__(name, value, /)Implement setattr(self, name, value).
__sizeof__()Size of object in memory, in bytes.
__str__()Return str(self).
__subclasshook__Abstract classes can override this to customize issubclass().
_check_variables_in_graph(vars_conf)_check_variables_optimizer_bound(vars_conf)_generate_conf_for_graph(train_conf, vars_conf)_generate_grad_clip_conf_for_optim_conf(…)_generate_indexed_slices_optimizer_conf(…)_parse_input_parameters(parameters)Supports such parameters:
add_param_group(param_group)Add a param group to the
Optimizers param_groups.clip_grad()Clips gradient norm of an iterable of parameters.
load_state_dict(state_dict)Load the state of the optimizer which is created by state_dict function.
state_dict()Returns the state of the optimizer as a
dict.step([closure])Performs a single optimization step.
zero_grad([set_to_none])Sets the gradients of all optimized
oneflow.Tensors to zero.Attributes
support_sparseWhether AdamW Optimizer support sparse update.