oneflow.optim.Adam¶

class oneflow.optim.Adam(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, do_bias_correction: bool = True, contiguous_params: bool = False, fused: bool = False)¶

Implements Adam algorithm.

It has been proposed in Adam: A Method for Stochastic Optimization. The implementation of the L2 penalty follows changes proposed in Decoupled Weight Decay Regularization.

This algorithm can adjust the learning rate of each parameter dynamically according to the 1st-moment estimates and the 2nd-moment estimates of gradient.

the equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align} \]

Parameters

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm. (default: False)
do_bias_correction (bool, optional) – whether to do bias correction (default: True)
contiguous_params (bool, optional) – whether to use contiguous ParamGroup which puts all parameters of the same type, device and group into the same tensor and update them together. (default: False)
fused (bool, optional) – whether to divide all the parameters into several groups, then update each group of parameters with the fused kernel. (default: False)

For example:

Example 1:

# Assume net is a custom model.
adam = flow.optim.Adam(net.parameters(), lr=1e-3)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adam.step()
    adam.zero_grad()

Example 2:

# Assume net is a custom model.
adam = flow.optim.Adam(
    [
        {
            "params": net.parameters(),
            "lr": learning_rate,
            "clip_grad_max_norm": 0.5,
            "clip_grad_norm_type": 2.0,
        }
    ],
)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adam.clip_grad()
    adam.step()
    adam.zero_grad()

If you want to use clip_grad, you can refer this example.

For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to oneflow.nn.utils.clip_grad_norm_().

__init__(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, do_bias_correction: bool = True, contiguous_params: bool = False, fused: bool = False)¶: Initialize self. See help(type(self)) for accurate signature.

Methods

`__delattr__`(name, /)	Implement delattr(self, name).
`__dir__`()	Default dir() implementation.
`__eq__`(value, /)	Return self==value.
`__format__`(format_spec, /)	Default object formatter.
`__ge__`(value, /)	Return self>=value.
`__getattribute__`(name, /)	Return getattr(self, name).
`__gt__`(value, /)	Return self>value.
`__hash__`()	Return hash(self).
`__init__`(params[, lr, betas, eps, …])	Initialize self.
`__init_subclass__`	This method is called when a class is subclassed.
`__le__`(value, /)	Return self<=value.
`__lt__`(value, /)	Return self<value.
`__ne__`(value, /)	Return self!=value.
`__new__`(**kwargs)	Create and return a new object.
`__reduce__`()	Helper for pickle.
`__reduce_ex__`(protocol, /)	Helper for pickle.
`__repr__`()	Return repr(self).
`__setattr__`(name, value, /)	Implement setattr(self, name, value).
`__sizeof__`()	Size of object in memory, in bytes.
`__str__`()	Return str(self).
`__subclasshook__`	Abstract classes can override this to customize issubclass().
`_check_variables_in_graph`(vars_conf)
`_check_variables_optimizer_bound`(vars_conf)
`_fused_update`(param_group)
`_generate_conf_for_graph`(train_conf, vars_conf)
`_generate_grad_clip_conf_for_optim_conf`(…)
`_generate_indexed_slices_optimizer_conf`(…)
`_generate_lr_scale_for_optim_conf`(…)
`_parse_input_parameters`(parameters)	Supports such parameters:
`_single_tensor_update`(param_group)
`add_param_group`(param_group)	Add a param group to the `Optimizer` s param_groups.
`clip_grad`([error_if_nonfinite])	Clips gradient norm of an iterable of parameters.
`load_state_dict`(state_dict)	Load the state of the optimizer which is created by state_dict function.
`state_dict`()	Returns the state of the optimizer as a `dict`.
`step`([closure])	Performs a single optimization step.
`zero_grad`([set_to_none])	Sets the gradients of all optimized `oneflow.Tensor` s to zero.

Attributes

support_sparse

Whether the Optimizer support sparse update.