oneflow.optim

oneflow.optim is a package implementing various optimization algorithms. Most commonly used methods are already supported, and the interface is general enough, so that more sophisticated ones can be also easily integrated in the future.

How to use an optimizer

To use oneflow.optim you have to construct an optimizer object, that will hold the current state and will update the parameters based on the computed gradients.

Constructing it

To construct an Optimizer you have to give it an iterable containing the parameters (all should be Variable s) to optimize. Then, you can specify optimizer-specific options such as the learning rate, weight decay, etc.

Note

If you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.

In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.

Example:

import oneflow
import oneflow.nn as nn
import oneflow.optim as optim

model = nn.Linear(16, 3)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

Per-parameter options

Optimizer also support specifying per-parameter options. To do this, instead of passing an iterable of Variable, pass in an iterable of dict. Each of them will define a separate parameter group, and should contain a params key, containing a list of parameters belonging to it. Other keys should match the keyword arguments accepted by the optimizers, and will be used as optimization options for this group.

Note

You can still pass options as keyword arguments. They will be used as defaults, in the groups that didn’t override them. This is useful when you only want to vary a single option, while keeping all others consistent between parameter groups.

For example, this is very useful when one wants to specify per-layer learning rates:

import oneflow.nn as nn
import oneflow.optim as optim


class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.base = nn.Linear(64, 32)
        self.classifier = nn.Linear(32, 10)

    def forward(self, x):
        out = self.base(x)
        out = self.classifier(out)
        return out


model = Model()
optim.SGD(
    [
        {"params": model.base.parameters()},
        {"params": model.classifier.parameters(), "lr": 1e-3},
    ],
    lr=1e-2,
    momentum=0.9,
)

This means that model.base’s parameters will use the default learning rate of 1e-2, model.classifier’s parameters will use a learning rate of 1e-3, and a momentum of 0.9 will be used for all parameters.

Taking an optimization step

All optimizers implement a step() method, that updates the parameters. It can be used in two ways:

optimizer.step()

This is a simplified version supported by most optimizers. The function can be called once the gradients are computed using e.g. backward().

Example:

import oneflow
import oneflow.nn as nn
import oneflow.nn.functional as F
import oneflow.optim as optim
from oneflow.utils.data import Dataset, DataLoader


class CustomDataset(Dataset):
    def __init__(self, num):
        self.inputs = oneflow.randn(num, 1)
        self.targets = oneflow.sin(self.inputs)

    def __len__(self):
        return self.inputs.shape[0]

    def __getitem__(self, index):
        return self.inputs[index], self.targets[index]


class Model(nn.Module):
    def __init__(self, input_size):
        super(Model, self).__init__()
        self.linear1 = nn.Linear(input_size, 64)
        self.linear2 = nn.Linear(64, input_size)

    def forward(self, x):
        out = self.linear1(x)
        return self.linear2(F.relu(out))


dataset = CustomDataset(10000)
dataloader = DataLoader(dataset, batch_size=10)
model = Model(1)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)

for epoch in range(100):
    for input, target in dataloader:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()

Base class

class oneflow.optim.Optimizer(parameters, options)

Optimizer.add_param_group

Add a param group to the Optimizer s param_groups.

Optimizer.load_state_dict

Load the state of the optimizer which is created by state_dict function.

Optimizer.state_dict

Returns the state of the optimizer as a dict.

Optimizer.step

Performs a single optimization step (parameter update).

Optimizer.zero_grad

Sets the gradients of all optimized oneflow.Tensor s to zero.

Algorithms

Adagrad

Implements Adagrad Optimizer.

Adam

Implements Adam algorithm.

AdamW

Implements AdamW algorithm.

LAMB

Implements LAMB algorithm.

RMSprop

Implements RMSprop algorithm.

SGD

Implements SGD algorithm.

LBFGS

Implements LBFGS algorithm

Adjust Learning Rate

oneflow.optim.lr_scheduler provides several methods to adjust the learning rate based on the number of epochs. oneflow.optim.lr_scheduler.ReduceLROnPlateau allows dynamic learning rate reducing based on some validation measurements.

Learning rate scheduling should be applied after optimizer’s update; e.g., you should write your code this way:

Example:

import oneflow
import oneflow.nn as nn
import oneflow.nn.functional as F
import oneflow.optim as optim
from oneflow.utils.data import Dataset, DataLoader


class CustomDataset(Dataset):
    def __init__(self, num):
        self.inputs = oneflow.randn(num, 1)
        self.targets = oneflow.sin(self.inputs)

    def __len__(self):
        return self.inputs.shape[0]

    def __getitem__(self, index):
        return self.inputs[index], self.targets[index]


class Model(nn.Module):
    def __init__(self, input_size):
        super(Model, self).__init__()
        self.linear1 = nn.Linear(input_size, 64)
        self.linear2 = nn.Linear(64, input_size)

    def forward(self, x):
        out = self.linear1(x)
        return self.linear2(F.relu(out))


dataset = CustomDataset(10000)
dataloader = DataLoader(dataset, batch_size=10)
model = Model(1)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)
scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)

for epoch in range(20):
    for input, target in dataloader:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
    scheduler.step()

Most learning rate schedulers can be chained (also referred to as chaining schedulers).

Example:

import oneflow
import oneflow.nn as nn
import oneflow.nn.functional as F
import oneflow.optim as optim
from oneflow.utils.data import Dataset, DataLoader


class CustomDataset(Dataset):
    def __init__(self, num):
        self.inputs = oneflow.randn(num, 1)
        self.targets = oneflow.sin(self.inputs)

    def __len__(self):
        return self.inputs.shape[0]

    def __getitem__(self, index):
        return self.inputs[index], self.targets[index]


class Model(nn.Module):
    def __init__(self, input_size):
        super(Model, self).__init__()
        self.linear1 = nn.Linear(input_size, 64)
        self.linear2 = nn.Linear(64, input_size)

    def forward(self, x):
        out = self.linear1(x)
        return self.linear2(F.relu(out))


dataset = CustomDataset(10000)
dataloader = DataLoader(dataset, batch_size=10)
model = Model(1)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)
scheduler1 = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
scheduler2 = optim.lr_scheduler.MultiStepLR(optimizer, milestones=[5, 10], gamma=0.1)

for epoch in range(20):
    for input, target in dataloader:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
    scheduler1.step()
    scheduler2.step()

In many places in the documentation, we will use the following template to refer to schedulers algorithms.

>>> scheduler = ...
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()

Warning

If you use the learning rate scheduler (calling scheduler.step()) before the optimizer’s update (calling optimizer.step()), this will skip the first value of the learning rate schedule. Please check if you are calling scheduler.step() at the wrong time.

lr_scheduler.CosineAnnealingLR

Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr and \(T_{cur}\) is the number of epochs since the last restart in SGDR:

lr_scheduler.CosineDecayLR

This operator creates a Cosine decayed learning rate scheduler.

lr_scheduler.ExponentialLR

Decays the learning rate of each parameter group by gamma every epoch.

lr_scheduler.LambdaLR

Sets the learning rate of each parameter group to the initial lr times a given function.

lr_scheduler.MultiStepLR

Decays the learning rate of each parameter group by gamma once the number of step reaches one of the milestones.

lr_scheduler.PolynomialLR

This operator creates a polynomial decayed learning rate scheduler.

lr_scheduler.ReduceLROnPlateau

Reduce learning rate when a metric has stopped improving.

lr_scheduler.StepLR

Decays the learning rate of each parameter group by gamma every step_size steps.

lr_scheduler.ConstantLR

Decays the learning rate of each parameter group by a small constant factor until the number of step reaches a pre-defined milestone: total_iters.

lr_scheduler.LinearLR

Decays the learning rate of each parameter group by linearly changing small multiplicative factor until the number of step reaches a pre-defined milestone: total_iters.

lr_scheduler.ChainedScheduler

Chains list of learning rate schedulers.

lr_scheduler.SequentialLR

Receives the list of schedulers that is expected to be called sequentially during optimization process and milestone points that provides exact intervals to reflect which scheduler is supposed to be called at a given step.

lr_scheduler.CosineAnnealingWarmRestarts

Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr, \(T_{cur}\) is the number of steps since the last restart and \(T_{i}\) is the number of steps between two warm restarts in SGDR: