oneflow.optim.Adam

class oneflow.optim.Adam(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, do_bias_correction: bool = True, contiguous_params: bool = False, fused: bool = False)

Implements Adam algorithm.

It has been proposed in Adam: A Method for Stochastic Optimization. The implementation of the L2 penalty follows changes proposed in Decoupled Weight Decay Regularization.

This algorithm can adjust the learning rate of each parameter dynamically according to the 1st-moment estimates and the 2nd-moment estimates of gradient.

the equation of parameters updating is:

\[ \begin{align}\begin{aligned}& V_t = \beta_1*V_{t-1} + (1-\beta_1)*grad\\& S_t = \beta_2*S_{t-1} + (1-\beta_2)*{grad} \odot {grad}\\& \hat{g} = learning\_rate*\frac{{V_t}}{\sqrt{{S_t}}+\epsilon}\\& param_{new} = param_{old} - \hat{g}\end{aligned}\end{align} \]
Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm. (default: False)

  • do_bias_correction (bool, optional) – whether to do bias correction (default: True)

  • contiguous_params (bool, optional) – whether to use contiguous ParamGroup which puts all parameters of the same type, device and group into the same tensor and update them together. (default: False)

  • fused (bool, optional) – whether to divide all the parameters into several groups, then update each group of parameters with the fused kernel. (default: False)

For example:

Example 1:

# Assume net is a custom model.
adam = flow.optim.Adam(net.parameters(), lr=1e-3)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adam.step()
    adam.zero_grad()

Example 2:

# Assume net is a custom model.
adam = flow.optim.Adam(
    [
        {
            "params": net.parameters(),
            "lr": learning_rate,
            "clip_grad_max_norm": 0.5,
            "clip_grad_norm_type": 2.0,
        }
    ],
)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    adam.clip_grad()
    adam.step()
    adam.zero_grad()

If you want to use clip_grad, you can refer this example.

For more details of clip_grad_max_norm and clip_grad_norm_type, you can refer to oneflow.nn.utils.clip_grad_norm_().

__init__(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, do_bias_correction: bool = True, contiguous_params: bool = False, fused: bool = False)

Initialize self. See help(type(self)) for accurate signature.

Methods

__delattr__(name, /)

Implement delattr(self, name).

__dir__()

Default dir() implementation.

__eq__(value, /)

Return self==value.

__format__(format_spec, /)

Default object formatter.

__ge__(value, /)

Return self>=value.

__getattribute__(name, /)

Return getattr(self, name).

__gt__(value, /)

Return self>value.

__hash__()

Return hash(self).

__init__(params[, lr, betas, eps, …])

Initialize self.

__init_subclass__

This method is called when a class is subclassed.

__le__(value, /)

Return self<=value.

__lt__(value, /)

Return self<value.

__ne__(value, /)

Return self!=value.

__new__(**kwargs)

Create and return a new object.

__reduce__()

Helper for pickle.

__reduce_ex__(protocol, /)

Helper for pickle.

__repr__()

Return repr(self).

__setattr__(name, value, /)

Implement setattr(self, name, value).

__sizeof__()

Size of object in memory, in bytes.

__str__()

Return str(self).

__subclasshook__

Abstract classes can override this to customize issubclass().

_check_variables_in_graph(vars_conf)

_check_variables_optimizer_bound(vars_conf)

_fused_update(param_group)

_generate_conf_for_graph(train_conf, vars_conf)

_generate_grad_clip_conf_for_optim_conf(…)

_generate_indexed_slices_optimizer_conf(…)

_generate_lr_scale_for_optim_conf(…)

_parse_input_parameters(parameters)

Supports such parameters:

_single_tensor_update(param_group)

add_param_group(param_group)

Add a param group to the Optimizer s param_groups.

clip_grad([error_if_nonfinite])

Clips gradient norm of an iterable of parameters.

load_state_dict(state_dict)

Load the state of the optimizer which is created by state_dict function.

state_dict()

Returns the state of the optimizer as a dict.

step([closure])

Performs a single optimization step.

zero_grad([set_to_none])

Sets the gradients of all optimized oneflow.Tensor s to zero.

Attributes

support_sparse

Whether the Optimizer support sparse update.