OneFlow API Reference

Distributed performance (high efficiency) is the core technical difficulty of deep learning frameworks.

OneFlow upholds the core concept and architecture of static compilation and streaming parallelism around performance improvement and heterogeneous distributed scaling, solving the challenge of memory wall at cluster level with world-leading technology.

Troubleshooting

  • ‘libunwind.h’ not found

    • You might add CMake argument -DWITH_UNWIND=OFF, or install libunwind in your system.

  • CUDNN_STATUS_NOT_INITIALIZED

    • You might see error message like these:

      I0729 22:37:45.483937439   56788 ev_epoll_linux.c:82]        Use of signals is disabled. Epoll enginll not be used
      E0729 22:37:45.515343 56788 version.cpp:82] Failed to get cuda runtime version: CUDA driver version nsufficient for CUDA runtime version
      F0729 22:38:31.209002 56788 improver.cpp:535] Check failed: mem_size > 0 (-524288000 vs. 0)
      
      F0723 19:05:56.194067 40970 cuda_util.cpp:82] Check failed: error == CUDNN_STATUS_SUCCESS (1 vs. 0) CUDNN_STATUS_NOT_INITIALIZED
      
    • Please upgrade to Nvidia Linux x86_64 driver. Version >= 440.33 is recommended.

    • For more information, please refer to CUDA compatibility documentation.

  • Failed to compile .cu files

  • How do I know what compilers and flags are used to compile OneFlow?

    • run make clean && make VERBOSE=1 to get exact compile commands with compiler path and flags

  • How to compile OneFlow with RDMA support?

    • add cmake flag -DBUILD_RDMA to compile OneFlow

  • Which version of g++ CMake is using to build OneFlow?

    • You should find a line like this in CMake output:

      -- CMAKE_CXX_COMPILER_VERSION: [YOUR G++ VERSION NUMBER]
      
  • Failed to compile NCCL

    • Try use less threads when compiling OneFlow third party. For instance, use

      cmake -DTHIRD_PARTY=ON .. && make
      

      instead of

      cmake -DTHIRD_PARTY=ON .. && make -j$(nproc) `
      
  • "CUDA_VERSION" "VERSION_GREATER_EQUAL" "10.0"

    • Please use a newer version of CMake

    • Make sure cmake is correctly included in PATH

  • CUBLAS not found

    • Usually it happens when using CUDA 10.1 or newer

    • You should see error massage by CMake like this:

      cuda lib not found: /usr/local/miniconda3/envs/dl/lib/libcublas_static.a or
      /usr/local/cuda/lib64/libcublas_static.a
      
    • Make sure libcublas_static.a is in one of the two directories.

  • When running OneFlow in gdb, there is no debug information for code location.

    • add cmake flag -DCMAKE_BUILD_TYPE=RELWITHDEBINFO or -DCMAKE_BUILD_TYPE=DEBUG and recompile

  • libof_ccobj.a: File truncated

    • You might see error message like this:

      /usr/bin/ar: libof_ccobj.a: File truncated
      make[2]: *** [libof_ccobj.a] Error 1
      make[2]: *** Deleting file `libof_ccobj.a'
      make[1]: *** [CMakeFiles/of_ccobj.dir/all] Error 2
      make: *** [all] Error 2
      
    • You should upgrade your GNU Binutils. Version 2.33.1 is recommended. If you are using conda, you could install it by running conda install -c conda-forge binutils

  • Failed to compile because C++ 17 is enabled

    • In some cases, environment variable CXXFLAGS is not empty and contains --std c++17.

    • Check if it is empty by running echo $CXXFLAGS and clear it with unset CXXFLAGS.

    • If you are using conda, to make the changes on environment variables permanent, you can run:

      conda env config vars set CXXFLAGS="-fPIC"
      
  • cmake outputs error No CMAKE_ASM_NASM_COMPILER could be found.

    • Install nasm. For instance, run sudo yum install nasm if you are on centos.

  • No module named 'google.protobuf'

    • You might see error message like this:

      Scanning dependencies of target generate_api
      ...
          from google.protobuf import descriptor as _descriptor
      ModuleNotFoundError: No module named 'google.protobuf'
      CMakeFiles/generate_api.dir/build.make:57: recipe for target 'CMakeFiles/generate_api' failed
      make[2]: *** [CMakeFiles/generate_api] Error 1
      
    • Install development dependencies by running:

      pip3 install -r dev-requirements.txt
      
  • Get gdb warning ptrace: Operation not permitted. and gdb command bt prints no backtrace

    • You might get this warning when debugging OneFlow with gdb inside a docker container. Try add these flags when launching your container:

      docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined
      
    • Please refer to https://stackoverflow.com/questions/19215177/how-to-solve-ptrace-operation-not-permitted-when-trying-to-attach-gdb-to-a-pro

  • It takes too long to download python packages when running make

    • If you are in China, you could run this to have pip download packages from domestic mirror of pypi:

      python3 -m pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
      
    • For more information on this, please refer to pypi 镜像使用帮助

oneflow

The oneflow package contains data structures for multi-dimensional tensors and defines mathematical operations over these tensors. Additionally, it provides many utilities for efficient serializing of Tensors and arbitrary types, and other useful utilities.

It has a CUDA counterpart, that enables you to run your tensor computations on an NVIDIA GPU with compute capability >= 3.0

Tensor

BoolTensor

Creates a Tensor with the dtype of oneflow.bool and the device on cpu, it has the same parameters as oneflow.Tensor()

ByteTensor

Creates a Tensor with the dtype of oneflow.uint8 and the device on cpu, it has the same parameters as oneflow.Tensor()

CharTensor

Creates a Tensor with the dtype of oneflow.int8 and the device on cpu, it has the same parameters as oneflow.Tensor()

DoubleTensor

Creates a Tensor with the dtype of oneflow.float64 and the device on cpu, it has the same parameters as oneflow.Tensor()

FloatTensor

Creates a Tensor with the dtype of oneflow.float32 and the device on cpu, it has the same parameters as oneflow.Tensor()

HalfTensor

Creates a Tensor with the dtype of oneflow.float16 and the device on cpu, it has the same parameters as oneflow.Tensor()

IntTensor

Creates a Tensor with the dtype of oneflow.int32 and the device on cpu, it has the same parameters as oneflow.Tensor()

LongTensor

Creates a Tensor with the dtype of oneflow.int64 and the device on cpu, it has the same parameters as oneflow.Tensor()

is_tensor

Note that this function is simply doing isinstance(obj, Tensor).

is_floating_point

Returns True if the data type of input is a floating point data type i.e., one of oneflow.float64 , oneflow.float32 , oneflow.float16, and oneflow.bfloat16.

is_nonzero

Returns True if the input is a single element tensor which is not equal to zero after type conversions.

numel

Returns the total number of elements in the input tensor.

set_printoptions

Set options for printing.

get_default_dtype

Returns the default floating point dtype.

set_default_dtype

Sets the default floating point type for those source operators which create Tensor.

set_default_tensor_type

Sets the default floating point type for those source operators which create Tensor.

Creation Ops

Note

Random sampling creation ops are listed under Random sampling and include: oneflow.rand() oneflow.randn() oneflow.randint() oneflow.randperm()

tensor

Constructs a tensor with data, return a global tensor if placement and sbp are in kwargs,

as_tensor

Converts data into a tensor, sharing data and preserving autograd history if possible.

as_strided

Create a view of an existing oneflow.Tensor input with specified size, stride and storage_offset.

from_numpy

Creates a Tensor from a numpy.ndarray.

zeros

Returns a tensor filled with the scalar value 0, with the shape defined by the variable argument size.

zeros_like

The interface is consistent with PyTorch.

ones

Returns a tensor filled with the scalar value 1, with the shape defined by the variable argument size.

ones_like

The interface is consistent with PyTorch.

randn_like

Returns a tensor with the same size as input that is filled with random numbers from a normal distribution with mean 0 and variance 1.

randint_like

Returns a tensor filled with random integers generated uniformly between low (inclusive) and high (exclusive).

masked_fill

Fills elements of self tensor with value where mask is True.

new_ones

The interface is consistent with PyTorch.

arange

Returns a 1-D tensor of size \(\left\lfloor \frac{\text{end} - \text{start}}{\text{step}} \right\rfloor + 1\) with values from start to end with step step.

linspace

Creates a one-dimensional tensor of size steps whose values are evenly spaced from start to end, inclusive.

eye

This operator creates a 2-D Tensor with ones on the diagonal and zeros elsewhere.

empty

The interface is consistent with PyTorch.

empty_like

The interface is consistent with PyTorch.

full

Creates a tensor of size size filled with fill_value.

full_like

Returns a tensor with the same size as input filled with fill_value.

tensor_scatter_nd_update

This operation creates a new tensor by applying sparse updates to the input tensor.

logspace

This function is equivalent to PyTorch’s logspace function.

Indexing, Slicing, Joining, Mutating Ops

argwhere

This operator finds the indices of input Tensor input elements that are non-zero.

atleast_1d

Returns a 1-dimensional view of each input tensor with zero dimensions.

atleast_2d

Returns a 2-dimensional view of each input tensor with zero dimensions.

atleast_3d

Returns a 3-dimensional view of each input tensor with zero dimensions.

cat

Concatenate two or more Tensor s at specified dim.

column_stack

Creates a new tensor by horizontally stacking the tensors in tensors.

concat

cat(tensors, dim=0) -> Tensor

chunk

Splits a tensor into a specific number of chunks.

dstack

Stack tensors in tensors depthwish (along third axis).

expand

This operator expand the input tensor to a larger size.

gather

Gathers values along an axis specified by dim.

gather_nd

This operator is a high-dimensional extension of gather, index is a K-dimensional tensor, which is regarded as a index of input Tensor input.

batch_gather

Gather the element in batch dims.

hsplit

The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.hsplit.html.

hstack

Stack tensors in tensors horizontally (column wise).

vsplit

Splits input, a tensor with two or more dimensions, into multiple tensors vertically according to indices_or_sections.

vstack

Stack tensors in tensors vertically (row wise).

index_select

Select values along an axis specified by dim.

index_add

See oneflow.Tensor.index_add_() for function description.

masked_select

Returns a new 1-D tensor which indexes the input tensor according to the boolean mask mask which is a BoolTensor(In oneFlow BoolTensor is replaced by Int8Tensor).

movedim

Moves the dimension(s) of input at the position(s) in source to the position(s) in destination.

narrow

Returns a new tensor that is a narrowed version of input tensor.

nonzero

permute

Returns a view of the original tensor with its dimensions permuted.

repeat

This operator repeat the input tensor to a larger size along the specified dimensions.

reshape

This operator reshapes a Tensor.

row_stack

Alias of oneflow.vstack().

select

Slices the self tensor along the selected dimension at the given index.

scatter

This operator writes the elements specified by index along with the axis dim from the src into the input.

scatter_add

This operator scatter the src with addition operation according to index along dim into the input.

scatter_nd

This operator inserts the elements in update according to the index and create a new Tensor.

slice

Extracts a slice from a tensor.

slice_update

Update a slice of tensor x.

split

Splits the tensor into chunks.

squeeze

This operator removes the specified dimention which size is 1 of the input Tensor.

stack

Concatenates a sequence of tensors along a new dimension.

swapaxes

This function is equivalent to NumPy’s swapaxes function.

swapdims

This function is equivalent to torch’s swapdims function.

t

oneflow.t(input) → Tensor.

tile

Constructs a tensor by repeating the elements of input.

transpose

Returns a tensor that is a transposed version of input.

unbind

Removes a tensor dimension.

unsqueeze

Returns a new tensor with a dimension of size one inserted at the specified position.

where

Return a tensor of elements selected from either x or y, depending on condition.

tensor_split

Splits a tensor into multiple sub-tensors, all of which are views of input, along dimension dim according to the indices or number of sections specified by indices_or_sections .

Random sampling

seed

Sets the seed for generating random numbers to a non-deterministic random number.

manual_seed

Sets the seed for generating random numbers.

initial_seed

Returns the initial seed for generating random numbers as a Python long.

get_rng_state

Sets the random number generator state.

set_rng_state

Returns the random number generator state as a oneflow.ByteTensor.

bernoulli

This operator returns a Tensor with binaray random numbers (0 / 1) from a Bernoulli distribution.

normal

Returns a tensor of random numbers drawn from separate normal distributions whose mean and standard deviation are given.

rand

Returns a tensor filled with random numbers from a uniform distribution on the interval [0, 1)

randint

Returns a tensor filled with random integers generated uniformly between low (inclusive) and high (exclusive).

randn

Returns a tensor filled with random numbers from a normal distribution with mean 0 and variance 1 (also called the standard normal distribution).

randperm

Returns a random permutation of integers from 0 to n - 1.

multinomial

Returns a tensor where each row contains num_samples indices sampled from the multinomial probability distribution located in the corresponding row of tensor input.

In-place random sampling

There are a few more in-place random sampling functions defined on Tensors as well. Click through to refer to their documentation: - oneflow.Tensor.normal_() - in-place version of oneflow.normal() - oneflow.Tensor.uniform_() - numbers sampled from the continuous uniform distribution

Serialization

save

Save an object to a directory.

load

Loads an object saved with oneflow.save() from a directory.

Parallelism

set_num_threads

Sets the number of threads used for intraop parallelism on CPU.

Locally disabling gradient computation

The context managers oneflow.no_grad(), oneflow.enable_grad(), and oneflow.set_grad_enabled() are helpful for locally disabling and enabling gradient computation. These context managers are thread local, so they won’t work if you send work to another thread using the threading module, etc.

Examples:

>>> import oneflow
>>> x = oneflow.zeros(1, requires_grad=True)
>>> with oneflow.no_grad():
...     y = x * 2
>>> y.requires_grad
False

>>> with oneflow.set_grad_enabled(False):
...     y = x * 2
>>> y.requires_grad
False

>>> with oneflow.set_grad_enabled(True):
...     y = x * 2
>>> y.requires_grad
True

no_grad

Context-manager that disabled gradient calculation.

set_grad_enabled

Context-manager that enabled gradient calculation.

enable_grad

Context-manager that enabled gradient calculation.

is_grad_enabled

Returns True if grad mode is currently enabled.

inference_mode

Context-manager that enables or disables inference mode

Math operations

Pointwise Ops

abs

Return the absolute value of each element in input tensor:math:y = |x| element-wise.

acos

Returns a new tensor with the inverse cosine of the elements of input.

acosh

Returns a new tensor with the inverse hyperbolic cosine of the elements of input.

arccos

Returns a new tensor with the inverse cosine of the elements of input.

arccosh

Returns a new tensor with the inverse hyperbolic cosine of the elements of input.

add

Adds other, scaled by alpha, to input.

addcdiv

This function is equivalent to PyTorch’s addcdiv function.

addcmul

Performs the element-wise multiplication of tensor1 by tensor2, multiply the result by the scalar value and add it to input.

asin

Returns a new tensor with the arcsine of the elements of input.

asinh

Returns a new tensor with the inverse hyperbolic sine of the elements of input.

arcsin

Returns a new tensor with the arcsine of the elements of input.

arcsinh

Returns a new tensor with the inverse hyperbolic sine of the elements of input.

atan

Returns a new tensor with the arctangent of the elements of input.

atanh

Returns a new tensor with the inverse hyperbolic tangent of the elements of input.

arctan

Returns a new tensor with the arctangent of the elements of input.

arctanh

Returns a new tensor with the inverse hyperbolic tangent of the elements of input.

atan2

Element-wise arctangent of input{i}/other{i} with consideration of the quadrant.

ceil

Returns a new tensor with the ceil of the elements of input, the smallest integer greater than or equal to each element.

ceil_

In-place version of oneflow.ceil()

clamp

Clamp all elements in input into the range [ min, max ] and return a resulting tensor:

clamp_min

Clamp all elements in input which are less than min to min and return a resulting tensor:

clamp_max

Clamp all elements in input which are greater than max to max and return a resulting tensor:

clip

Alias for oneflow.clamp().

cos

Returns a new tensor with the cosine of the elements of input.

cosh

Returns a new tensor with the hyperbolic cosine of the elements of input.

div

Computes the division of input by other for each element, scalar and broadcast promotation are supported.

erf

Computes the error function of each element.

erfc

Computes the complementary error function of each element of input.

erfinv

Computes the inverse error function of input.

exp

This operator computes the exponential of Tensor.

expm1

Returns a new tensor with the exponential of the elements minus 1 of input.

floor

Returns a new tensor with the arcsine of the elements of input.

floor_

In-place version of oneflow.floor()

frac

frac(input) → Tensor

frac_

In-place version of oneflow.frac().

fmod

Computes the element-wise remainder of division.

gelu

Applies the Gaussian Error Linear Units function:

quick_gelu

Applies GELU approximation that is fast but somewhat inaccurate.

square_relu

Applies the relu^2 activation introduced in https://arxiv.org/abs/2109.08668v2

log

Returns a new tensor with the natural logarithm of the elements of input.

log1p

Returns a new tensor with the natural logarithm of (1 + input).

log2

Returns a new tensor with the natural logarithm to the base 2 of the elements of input.

log10

Returns a new tensor with the natural logarithm to the base 10 of the elements of input.

logical_and

Computes the element-wise logical AND of the given input tensors.

logical_not

Computes the element-wise logical NOT of the given input tensors.

logical_or

Computes the element-wise logical OR of the given input tensors.

logical_xor

Computes the element-wise logical XOR of the given input tensors.

bitwise_and

Computes the bitwise AND of input and other.

bitwise_or

Computes the bitwise OR of input and other.

bitwise_xor

Computes the bitwise XOR of input and other.

bitwise_not

Computes the bitwise NOT of input.

mish

Applies the element-wise function:

mul

Computes the multiplication of input by other for each element, scalar and broadcast promotation are supported.

neg

This operator computes the negative value of Tensor.

negative

This operator computes the negative value of Tensor.

pow

Takes the power of each element in input with exponent and returns a tensor with the result.

reciprocal

Computes the safe reciprocal of x.

round

This operator rounds the value of Blob to the nearest integer.

round_

In-place version of oneflow.round().

rsqrt

Returns a new tensor with the reciprocal of the square-root of each of the elements of input.

selu

Applies element-wise function

softmax

Softmax is defined as:

softplus

Applies the element-wise function:

softsign

The formula is:

silu

The formula is:

sigmoid

Applies the element-wise function \(\text{Sigmoid}(x) = \frac{1}{1 + \exp(-x)}\)

sign

Computes the sign of Tensor.

sin

Returns a new tensor with the sine of the elements of input.

sinh

Returns a new tensor with the hyperbolic sine of the elements of input.

sin_

In-place version of oneflow.sin()

sqrt

Returns a new tensor with the square-root of the elements of input.

square

Returns a new tensor with the square of the elements of input.

sub

Computes the subtraction of input by other for each element, scalar and broadcast promotation are supported.

tan

Returns the tan value of the elements of input.

tanh

The equation is:

trunc

The interface is consistent with PyTorch.

floor_divide

lerp

The documentation is referenced from: https://pytorch.org/docs/stable/generated/torch.lerp.html.

lerp_

In-place version of oneflow.lerp()

quantile

The documentation is referenced from: https://pytorch.org/docs/stable/generated/torch.quantile.html.

Reduction Ops

argmax

The op computes the index with the largest value of a Tensor at specified axis.

argmin

The op computes the index with the largest value of a Tensor at specified axis.

amax

Returns the maximum along a dimension.

amin

Returns the minimum value of each slice of the input tensor in the given dimension(s) dim.

any

For each row of input in the given dimension dim, returns True if any element in the row evaluate to True and False otherwise.

max

Computes the maximum value of all elements in the input tensor.

min

Computes the minimum value of all elements in the input tensor.

mean

Computes the mean of row of elements in a tensor in the given dimension.

median

Returns the median of the values in input.

mode

Returns a namedtuple (values, indices) where values is the mode value of each row of the input tensor in the given dimension dim, i.e. a value which appears most often in that row, and indices is the index location of each mode value found.

prod

Computes the product of row of elements in a tensor in the given dimension.

nansum

Returns the sum of each row of the input tensor in the given dimension dim, treating Not a Numbers (NaNs) as zero.

std

Returns the standard-deviation of each row of the input tensor in the dimension dim.

sum

Computes the sum of row of elements in a tensor in the given dimension.

logsumexp

Returns the log of summed exponentials of each row of the input tensor in the given dimension dim.

var

Returns the variance of each row of the input tensor in the given dimension dim.

norm

Returns the matrix norm or vector norm of a given tensor.

all

For each row of input in the given dimension dim, returns True if all element in the row evaluate to True and False otherwise.

Comparison Ops

argsort

This operator sorts the input Tensor at specified dim and returns the indices of the sorted Tensor.

eq

Computes element-wise equality.

equal

True if two tensors have the same size and elements, False otherwise.

gt

Returns the truth value of \(input > other\) element-wise.

isinf

This function is equivalent to PyTorch’s isinf function.

isnan

This function is equivalent to PyTorch’s isnan function.

le

Returns the truth value of \(input <= other\) element-wise.

lt

Returns the truth value of \(input < other\) element-wise.

ne

Computes element-wise not equality.

sort

Sorts the elements of the input tensor along a given dimension in ascending order by value.

topk

Finds the values and indices of the k largest entries at specified axis.

ge

Returns the truth value of \(input >= other\) element-wise.

greater

Returns the truth value of \(input > other\) element-wise.

greater_equal

Returns the truth value of \(input >= other\) element-wise.

maximum

Computes the element-wise maximum of x and y.

minimum

Computes the element-wise minimum of x and y.

not_equal

ne(input, other) -> Tensor

isclose

The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.isclose.html

allclose

The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.allclose.html

Spectral Ops

hann_window

This function is equivalent to PyTorch’s hann_window function.

Other Ops

adaptive_avg_pool1d

Applies a 1D adaptive average pooling over an input signal composed of several input planes.

adaptive_avg_pool2d

Applies a 2D adaptive average pooling over an input signal composed of several input planes.

adaptive_avg_pool3d

Applies a 3D adaptive average pooling over an input signal composed of several input planes.

broadcast_like

This operator broadcast tensor x to like_tensor according to the broadcast_axes.

cast

The operation takes input tensor x and casts it to the output with dtype

cumprod

This operator computes the cumulative product of input elements in the given dimension.

cumsum

This operator computes the cumulative sum of input elements in the given dimension.

diag

If input is a vector (1-D tensor), then returns a 2-D square tensor with the elements of input as the diagonal.

diagonal

Returns a partial view of input with the its diagonal elements with respect to dim1 and dim2 appended as a dimension at the end of the shape.

einsum

Sums the product of the elements of the input operands along dimensions specified using a notation based on the Einstein summation convention.

flatten

Flattens a contiguous range of dims into a tensor.

flip

Reverse the order of a n-D tensor along given axis in dims.

in_top_k

Says whether the targets are in the top K predictions.

meshgrid

Take \(N\) tensors, each of which can be either scalar or 1-dimensional vector, and create \(N\) N-dimensional grids, where the \(i\) th grid is defined by expanding the \(i\) th input over dimensions defined by other inputs.

nms

Performs non-maximum suppression (NMS) on the boxes according to their intersection-over-union (IoU).

roc_auc_score

roll

Roll the tensor along the given dimension(s).

searchsorted

Find the indices from the innermost dimension of sorted_sequence such that, if the corresponding values in values were inserted before the indices, the order of the corresponding innermost dimension within sorted_sequence would be preserved.

tensordot

Compute tensor dot along given dimensions.

tril

Returns the lower triangular part of a matrix (2-D tensor) or batch of matrices input along the specified diagonal, the other elements of the result tensor out are set to 0.

repeat_interleave

Repeat elements of a tensor.

triu

Returns the upper triangular part of a matrix (2-D tensor) or batch of matrices input, the other elements of the result tensor out are set to 0.

cross

Returns the cross product of vectors in dimension dim of input and other.

bincount

oneflow.bincount(input, weights=None, minlength=0) → Tensor

broadcast_shapes

The interface is consistent with PyTorch.

broadcast_tensors

The interface is consistent with PyTorch.

broadcast_to

The interface is consistent with PyTorch.

unique

Returns the unique elements of the input tensor.

BLAS and LAPACK Operations

addmm

Performs a matrix multiplication of the matrices mat1 and mat2.

bmm

Performs a batch matrix-matrix product of matrices stored in input and mat2.

baddbmm

The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.baddbmm.html.

dot

This operator computes the dot product of tensor input and other.

matmul

This operator applies matrix multiplication to two Tensor.

mm

Performs a matrix multiplication of the matrices input and mat2.

mv

Performs a matrix-vector product of the matrix input and the vector vec.

oneflow.nn

These are the basic building blocks for graphs:

Containers

Module

Base class for all neural network modules.

Sequential

A sequential container.

ModuleList

Holds submodules in a list.

ModuleDict

Holds submodules in a dictionary.

ParameterList

Holds parameters in a list.

ParameterDict

Holds parameters in a dictionary.

nn.Module

add_module

Adds a child module to the current module.

apply

Applies fn recursively to every submodule (as returned by .children()) as well as self.

buffers

Returns an iterator over module buffers.

children

Returns an iterator over immediate children modules.

cpu

Moves all model parameters and buffers to the CPU.

cuda

Moves all model parameters and buffers to the GPU.

double

Casts all floating point parameters and buffers to double datatype.

train

Sets the module in training mode.

eval

Sets the module in evaluation mode.

extra_repr

Set the extra representation of the module

float

Casts all floating point parameters and buffers to float datatype.

forward

load_state_dict

Copies parameters and buffers from state_dict into this module and its descendants.

modules

Returns an iterator over all modules in the network.

named_buffers

Returns an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

named_children

Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

named_modules

Returns an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

named_parameters

Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

parameters

Returns an iterator over module parameters.

register_buffer

Adds a buffer to the module.

register_forward_hook

Registers a forward hook on the module.

register_forward_pre_hook

Registers a forward pre-hook on the module.

register_backward_hook

Registers a backward hook on the module.

register_full_backward_hook

Registers a backward hook on the module.

register_state_dict_pre_hook

These hooks will be called with arguments: self, prefix, and keep_vars before calling state_dict on self.

register_parameter

Adds a parameter to the module.

requires_grad_

Change if autograd should record operations on parameters in this module.

state_dict

Returns a dictionary containing a whole state of the module.

to

Moves and/or casts the parameters and buffers.

zero_grad

Sets gradients of all model parameters to zero.

Containers

Convolution Layers

nn.Conv1d

Applies a 1D convolution over an input signal composed of several input planes.

nn.Conv2d

Applies a 2D convolution over an input signal composed of several input planes.

nn.Conv3d

Applies a 3D convolution over an input signal composed of several input planes.

nn.ConvTranspose1d

Applies a 1D transposed convolution operator over an input image composed of several input planes.

nn.ConvTranspose2d

Applies a 2D transposed convolution operator over an input image composed of several input planes.

nn.ConvTranspose3d

Applies a 3D transposed convolution operator over an input image composed of several input planes.

nn.Unfold

The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.Unfold.html.

nn.Fold

The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.Fold.html.

Pooling Layers

nn.MaxPool1d

Applies a 1D max pooling over an input signal composed of several input planes.

nn.MaxPool2d

Applies a 2D max pooling over an input signal composed of several input planes.

nn.MaxPool3d

Applies a 3D max pooling over an input signal composed of several input planes.

nn.MaxUnpool1d

Computes a partial inverse of MaxPool1d.

nn.MaxUnpool2d

Computes a partial inverse of MaxPool2d.

nn.MaxUnpool3d

Computes a partial inverse of MaxPool3d.

nn.AdaptiveAvgPool1d

Applies a 1D adaptive average pooling over an input signal composed of several input planes.

nn.AdaptiveAvgPool2d

Applies a 2D adaptive average pooling over an input signal composed of several input planes.

nn.AdaptiveAvgPool3d

Applies a 3D adaptive average pooling over an input signal composed of several input planes.

nn.AdaptiveMaxPool1d

Applies a 1D adaptive max pooling over an input signal composed of several input planes.

nn.AdaptiveMaxPool2d

Applies a 2D adaptive max pooling over an input signal composed of several input planes.

nn.AdaptiveMaxPool3d

Applies a 3D adaptive max pooling over an input signal composed of several input planes.

nn.AvgPool1d

Applies a 1D average pooling over an input signal composed of several input planes.

nn.AvgPool2d

Performs the 2d-average pooling on the input.

nn.AvgPool3d

Applies a 3D average pooling over an input signal composed of several input planes.

Padding Layers

nn.ConstantPad1d

Pads the input tensor boundaries with a constant value.

nn.ConstantPad2d

This operator pads the input with constant value that user specifies.

nn.ConstantPad3d

Pads the input tensor boundaries with a constant value.

nn.ReflectionPad1d

This operator pads the input tensor using the reflection of the input boundary.

nn.ReflectionPad2d

This operator pads the input tensor using the reflection of the input boundary.

nn.ReplicationPad1d

Pads the input tensor using replication of the input boundary.

nn.ReplicationPad2d

Pads the input tensor using the replication of the input boundary.

nn.ZeroPad2d

Pads the input tensor boundaries with zero.

Non-linear Activations (weighted sum, nonlinearity)

nn.ELU

Applies the element-wise function

nn.Hardshrink

The Hardshrink activation.

nn.Hardsigmoid

Applies the element-wise function:

nn.Hardswish

Applies the hardswish function, element-wise, as described in the paper Searching for MobileNetV3.

nn.Hardtanh

Applies the HardTanh function element-wise

nn.LeakyReLU

Applies the element-wise function:

nn.LogSigmoid

Applies the element-wise function:

nn.PReLU

Applies the element-wise function:

nn.ReLU

Applies the rectified linear unit function element-wise:

nn.ReLU6

Applies the element-wise function:

nn.SELU

Applies the element-wise function:

nn.CELU

Applies the element-wise function:

nn.GELU

The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.GELU.html.

nn.QuickGELU

Applies GELU approximation that is fast but somewhat inaccurate.

nn.SquareReLU

Applies the relu^2 activation introduced in https://arxiv.org/abs/2109.08668v2

nn.SiLU

SiLU(Swish) activation:

nn.Sigmoid

Applies the element-wise function:

nn.Mish

Applies the element-wise function:

nn.Softplus

Applies the element-wise function:

nn.Softshrink

The Softshrink activation.

nn.Softsign

The SoftSign activation.

nn.Tanh

This operator computes the hyperbolic tangent value of Tensor.

nn.Threshold

The Threshold Activation.

nn.GLU

The GLU activation.

Non-linear Activations (other)

nn.Softmax

Applies the Softmax function to an n-dimensional input Tensor rescaling them so that the elements of the n-dimensional output Tensor lie in the range [0,1] and sum to 1.

nn.LogSoftmax

Applies the LogSoftmax function to an n-dimensional input Tensor.

Normalization Layers

nn.BatchNorm1d

Applies Batch Normalization over a 2D or 3D input (a mini-batch of 1D inputs with optional additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift .

nn.BatchNorm2d

Applies Batch Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift .

nn.BatchNorm3d

Applies Batch Normalization over a 5D input (a mini-batch of 3D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift .

nn.SyncBatchNorm

Applies Batch Normalization over a N-Dimensional input (a mini-batch of [N-2]D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift .

nn.FusedBatchNorm1d

Applies Fused Batch Normalization over a 2D or 3D input, the formula is:

nn.FusedBatchNorm2d

Applies Fused Batch Normalization over a 4D input, the formula is:

nn.FusedBatchNorm3d

Applies Fused Batch Normalization over a 5D input, the formula is:

nn.GroupNorm

Applies Group Normalization over a mini-batch of inputs as described in the paper Group Normalization

nn.InstanceNorm1d

Applies Instance Normalization over a 3D input (a mini-batch of 1D inputs with optional additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization.

nn.InstanceNorm2d

Applies Instance Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization.

nn.InstanceNorm3d

Applies Instance Normalization over a 5D input (a mini-batch of 3D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization.

nn.LayerNorm

Applies Layer Normalization over a mini-batch of inputs as described in the paper Layer Normalization

nn.RMSLayerNorm

Construct a layernorm module in the T5 style.

nn.RMSNorm

Applies Root Mean Square Layer Normalization over a mini-batch of inputs as described in the paper Root Mean Square Layer Normalization

Recurrent Layers

nn.RNN

Applies a multi-layer Elman RNN with tanhtanh or text{ReLU}ReLU non-linearity to an input sequence.

nn.LSTM

Applies a multi-layer long short-term memory (LSTM) RNN to an input sequence.

nn.GRU

Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.

nn.RNNCell

An Elman RNN cell with tanh or ReLU non-linearity.

nn.LSTMCell

A long short-term memory (LSTM) cell.

nn.GRUCell

A gated recurrent unit (GRU) cell

Linear Layers

nn.Identity

A placeholder identity operator that is argument-insensitive.

nn.Linear

Applies a linear transformation to the incoming data: \(y = xA^T + b\)

Dropout Layers

nn.Dropout

During training, randomly zeroes some of the elements of the input tensor with probability p using samples from a Bernoulli distribution.

nn.Dropout1d

Randomly zero out entire channels (a channel is a 1D feature map, e.g., the \(j\)-th channel of the \(i\)-th sample in the batched input is a 1D tensor :math:` ext{input}[i, j]`).

nn.Dropout2d

Randomly zero out entire channels (a channel is a 2D feature map, e.g., the \(j\)-th channel of the \(i\)-th sample in the batched input is a 2D tensor :math:` ext{input}[i, j]`).

nn.Dropout3d

Randomly zero out entire channels (a channel is a 3D feature map, e.g., the \(j\)-th channel of the \(i\)-th sample in the batched input is a 3D tensor :math:` ext{input}[i, j]`).

Sparse Layers

nn.Embedding

A simple lookup table that stores embeddings of a fixed dictionary and size.

Distance Functions

nn.CosineSimilarity

Returns cosine similarity between \(x_1\) and \(x_2\), computed along dim.

nn.PairwiseDistance

Computes the pairwise distance between vectors \(v_1\), \(v_2\) using the p-norm:

Loss Functions

nn.BCELoss

This operator computes the binary cross entropy loss.

nn.BCEWithLogitsLoss

This operator combines the Sigmoid and BCELoss together.

nn.CTCLoss

The Connectionist Temporal Classification loss.

nn.CombinedMarginLoss

The operation implements “margin_softmax” in InsightFace: https://github.com/deepinsight/insightface/blob/master/recognition/arcface_mxnet/train.py The implementation of margin_softmax in InsightFace is composed of multiple operators.

nn.CrossEntropyLoss

The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.CrossEntropyLoss.html.

nn.KLDivLoss

The Kullback-Leibler divergence loss measure

nn.L1Loss

This operator computes the L1 Loss between each element in input and target.

nn.MSELoss

Creates a criterion that measures the mean squared error (squared L2 norm) between each element in the input \(x\) and target \(y\).

nn.MarginRankingLoss

Creates a criterion that measures the loss given inputs \(x1\), \(x2\), two 1D mini-batch Tensors, and a label 1D mini-batch tensor \(y\) (containing 1 or -1).

nn.NLLLoss

The negative log likelihood loss.

nn.SmoothL1Loss

Creates a criterion that uses a squared term if the absolute element-wise error falls below beta and an L1 term otherwise.

nn.TripletMarginLoss

Creates a criterion that measures the triplet loss given an input tensors \(x1\), \(x2\), \(x3\) and a margin with a value greater than \(0\).

Vision Layers

nn.PixelShuffle

alias of oneflow.nn.modules.pixelshuffle.PixelShufflev2

nn.Upsample

Upsamples a given multi-channel 1D (temporal), 2D (spatial) or 3D (volumetric) data.

nn.UpsamplingBilinear2d

Applies a 2D bilinear upsampling to an input signal composed of several input channels.

nn.UpsamplingNearest2d

Applies a 2D nearest neighbor upsampling to an input signal composed of several input channels.

Data loading and preprocessing Layers

nn.COCOReader

nn.CoinFlip

Generates random boolean values following a bernoulli distribution.

nn.CropMirrorNormalize

Performs fused cropping, normalization, format conversion (NHWC to NCHW) if desired, and type casting.

nn.OFRecordBytesDecoder

This operator reads an tensor as bytes.

nn.OFRecordImageDecoder

nn.OFRecordImageDecoderRandomCrop

nn.OFRecordRawDecoder

nn.OFRecordReader

Quantization Aware Training

nn.MinMaxObserver

Compute the quantization parameters of the input tensor.

nn.MovingAverageMinMaxObserver

Compute the quantization parameters based on the moving average of the input tensor’s min and max values.

nn.FakeQuantization

Simulate the quantize and dequantize operations in training time.

nn.QatConv1d

A Conv1d module attached with nn.MinMaxObserver, nn.MovingAverageMinMaxObserver and nn.FakeQuantization modules for weight and input, used for quantization aware training.

nn.QatConv2d

A Conv2d module attached with nn.MinMaxObserver, nn.MovingAverageMinMaxObserver and nn.FakeQuantization modules for weight and input, used for quantization aware training.

nn.QatConv3d

A Conv3d module attached with nn.MinMaxObserver, nn.MovingAverageMinMaxObserver and nn.FakeQuantization modules for weight and input, used for quantization aware training.

Utilities

From the oneflow.nn.utils module

clip_grad_norm_

Clips gradient norm of an iterable of parameters.

clip_grad_value_

Clips gradient of an iterable of parameters at specified value.

weight_norm

Applies weight normalization to a parameter in the given module.

remove_weight_norm

Removes the weight normalization reparameterization from a module.

Utility functions in other modules

nn.utils.rnn.PackedSequence

The interface is consistent with PyTorch.

nn.utils.rnn.pack_padded_sequence

The interface is consistent with PyTorch.

nn.utils.rnn.pad_packed_sequence

The interface is consistent with PyTorch.

nn.utils.rnn.pad_sequence

The interface is consistent with PyTorch.

nn.utils.rnn.pack_sequence

Packs a list of variable length Tensors

nn.Flatten

Flattens a contiguous range of dims into a tensor.

Quantized Functions

Quantization refers to techniques for performing computations and storing tensors at lower bitwidths than floating point precision.

nn.FakeQuantization

Simulate the quantize and dequantize operations in training time.

nn.MinMaxObserver

Compute the quantization parameters of the input tensor.

nn.MovingAverageMinMaxObserver

Compute the quantization parameters based on the moving average of the input tensor’s min and max values.

nn.Quantization

Simulate the quantize operation in inference time.

oneflow.nn.functional

Convolution functions

conv1d

Applies a 1D convolution over an input signal composed of several input planes.

conv2d

Applies a 2D convolution over an input image composed of several input planes.

conv3d

Applies a 3D convolution over an input image composed of several input planes.

conv_transpose1d

Applies a 1D transposed convolution operator over an input signal composed of several input planes, sometimes also called “deconvolution”.

conv_transpose2d

Applies a 2D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”.

conv_transpose3d

Applies a 3D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”.

fold

The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.functional.fold.html.

unfold

The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.functional.unfold.html.

BatchNorm functions

batch_norm

Applies Batch Normalization for each channel across a batch of data.

Pooling functions

avg_pool1d

Applies a 1D average pooling over an input signal composed of several input planes.

avg_pool2d

Applies 2D average-pooling operation in \(kH \times kW\) regions by step size \(sH \times sW\) steps.

avg_pool3d

Applies 3D average-pooling operation in \(kT \times kH \times kW\) regions by step size \(sT \times sH \times sW\) steps.

max_pool1d

Applies a 1D max pooling over an input signal composed of several input planes.

max_pool2d

Applies a 2D max pooling over an input signal composed of several input planes.

max_pool3d

Applies a 3D max pooling over an input signal composed of several input planes.

max_unpool1d

Computes a partial inverse of MaxPool1d.

max_unpool2d

Computes a partial inverse of MaxPool2d.

max_unpool3d

Computes a partial inverse of MaxPool3d.

adaptive_avg_pool1d

Applies a 1D adaptive average pooling over an input signal composed of several input planes.

adaptive_avg_pool2d

Applies a 2D adaptive average pooling over an input signal composed of several input planes.

adaptive_avg_pool3d

Applies a 3D adaptive average pooling over an input signal composed of several input planes.

adaptive_max_pool1d

Applies a 1D adaptive max pooling over an input signal composed of several input planes.

adaptive_max_pool2d

Applies a 2D adaptive max pooling over an input signal composed of several input planes.

adaptive_max_pool3d

Applies a 3D adaptive max pooling over an input signal composed of several input planes.

Non-linear activation functions

threshold

Thresholds each element of the input Tensor.

relu

Applies the rectified linear unit function element-wise.

hardtanh

Applies the HardTanh function element-wise.

hardswish

Applies the hardswish function, element-wise, as described in the paper:

relu6

Applies the element-wise function \(\text{ReLU6}(x) = \min(\max(0,x), 6)\).

elu

Applies element-wise,

selu

Applies element-wise function

celu

Applies the element-wise function:

leaky_relu

Applies element-wise, :math:` ext{LeakyReLU}(x) = max(0, x) + ext{negative_slope} * min(0, x)`

square_relu

Applies the relu^2 activation introduced in https://arxiv.org/abs/2109.08668v2

prelu

Applies the element-wise function:

glu

The equation is:

gelu

Applies the Gaussian Error Linear Units function:

quick_gelu

Applies GELU approximation that is fast but somewhat inaccurate.

logsigmoid

Applies the element-wise function:

hardshrink

Applies the hard shrinkage function in an element-wise manner.

softsign

The formula is:

softplus

Applies the element-wise function:

softmax

Applies a softmax function.

softshrink

Applies the soft shrinkage function in an element-wise manner.

log_softmax

LogSoftmax is defined as:

gumbel_softmax

Solve the problem that the output values of argmax do not reflect the probability distribution of the model’s output.

tanh

The equation is:

sigmoid

Applies the element-wise function \(\text{Sigmoid}(x) = \frac{1}{1 + \exp(-x)}\)

hardsigmoid

Applies the element-wise function

silu

The formula is:

mish

Applies the element-wise function:

layer_norm

Applies Layer Normalization for last certain number of dimensions.

normalize

Performs \(L_p\) normalization of inputs over specified dimension

Linear functions

linear

Applies a linear transformation to the incoming data: \(y = xA^T + b\).

Dropout functions

dropout

During training, randomly zeroes some of the elements of the input tensor with probability p using samples from a Bernoulli distribution.

dropout1d

The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.functional.dropout1d.html.

dropout2d

dropout1d(x: Tensor, p: float = 0.5, training: bool = True) -> Tensor

dropout3d

dropout1d(x: Tensor, p: float = 0.5, training: bool = True) -> Tensor

Sparse functions

embedding

A simple lookup table that looks up embeddings in a fixed dictionary and size.

one_hot

This operator generates a onehot Tensor from input Tensor.

Distance functions

cosine_similarity

Returns cosine similarity between x1 and x2, computed along dim.

pairwise_distance

Computes the pairwise distance between vectors \(v_1\), \(v_2\) using the p-norm:

Loss functions

sparse_softmax_cross_entropy

The interface is consistent with TensorFlow.

cross_entropy

See CrossEntropyLoss for details.

ctc_loss

The Connectionist Temporal Classification loss.

l1_loss

This operator computes the L1 loss between each element in input and target.

mse_loss

This operator computes the mean squared error (squared L2 norm) loss between each element in input and target.

smooth_l1_loss

Function that uses a squared term if the absolute element-wise error falls below beta and an L1 term otherwise.

triplet_margin_loss

Creates a criterion that measures the triplet loss given an input tensors \(x1\), \(x2\), \(x3\) and a margin with a value greater than \(0\).

binary_cross_entropy

The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.functional.binary_cross_entropy.html.

binary_cross_entropy_with_logits

The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.functional.binary_cross_entropy_with_logits.html.

Vision functions

deform_conv2d

Performs Deformable Convolution v2, described in Deformable ConvNets v2: More Deformable, Better Results if mask is not None and Performs Deformable Convolution, described in Deformable Convolutional Networks if mask is None.

pad

Pads tensor.

interpolate

The interface is consistent with PyTorch.

upsample

alias of oneflow.nn.modules.upsampling.Upsample

grid_sample

The interface is consistent with PyTorch.

affine_grid

The interface is consistent with PyTorch.

Greedy decoder

ctc_greedy_decoder

Performs greedy decoding on the logits given in input (best path).

oneflow.Tensor

A oneflow.Tensor is a multi-dimensional matrix containing elements of a single data type.

Data types

OneFlow defines 8 Tensor types with CPU and GPU variants which are as follows:

Data type

dtype

CPU tensor

GPU tensor

Boolean

oneflow.bool

oneflow.BoolTensor

oneflow.cuda.BoolTensor

8-bit integer (unsigned)

oneflow.uint8

oneflow.ByteTensor

oneflow.cuda.ByteTensor

8-bit integer (signed)

oneflow.int8

oneflow.CharTensor

oneflow.cuda.CharTensor

64-bit floating point

oneflow.float64 or oneflow.double

oneflow.DoubleTensor

oneflow.cuda.DoubleTensor

32-bit floating point

oneflow.float32 or oneflow.float

oneflow.FloatTensor

oneflow.cuda.FloatTensor

16-bit floating point

oneflow.float16 or oneflow.half

oneflow.HalfTensor

oneflow.cuda.HalfTensor

32-bit integer (signed)

oneflow.int32 or oneflow.int

oneflow.IntTensor

oneflow.cuda.IntTensor

64-bit integer (signed)

oneflow.int64 or oneflow.long

oneflow.LongTensor

oneflow.cuda.LongTensor

Initializing and basic operations

A tensor can be constructed from a Python list or sequence using the oneflow.tensor() constructor:

>>> import oneflow
>>> import numpy as np
>>> oneflow.tensor([[1., -1.], [1., -1.]])
tensor([[ 1., -1.],
        [ 1., -1.]], dtype=oneflow.float32)
>>> oneflow.tensor(np.array([[1, 2, 3], [4, 5, 6]]))
tensor([[ 1, 2, 3],
        [ 4, 5, 6]], dtype=oneflow.int64)

Warning

oneflow.tensor() always copies data. If you have a Tensor data and just want to change its requires_grad flag, use requires_grad_() or detach() to avoid a copy. If you have a numpy array and want to avoid a copy, use oneflow.as_tensor().

>>> import oneflow
>>> oneflow.zeros([2, 4], dtype=oneflow.int32)
tensor([[ 0, 0, 0, 0],
        [ 0, 0, 0, 0]], dtype=oneflow.int32)
>>> cuda0 = oneflow.device('cuda:0')
>>> oneflow.ones([2, 4], dtype=oneflow.float64, device=cuda0)
tensor([[ 1., 1., 1., 1.],
        [ 1., 1., 1., 1.]], device='cuda:0', dtype=oneflow.float64)

For more information about building tensors, see Creation Ops

The contents of a tensor can be accessed and modified using Python’s indexing and slicing notation:

>>> import oneflow
>>> x = oneflow.tensor([[1, 2, 3], [4, 5, 6]])
>>> print(x[1][2])
tensor(6, dtype=oneflow.int64)
>>> x[0][1] = 8
>>> print(x)
tensor([[1, 8, 3],
        [4, 5, 6]], dtype=oneflow.int64)

Use oneflow.Tensor.item() to get a Python number from a tensor containing a single value:

>>> import oneflow
>>> x = oneflow.tensor([[1]])
>>> x
tensor([[1]], dtype=oneflow.int64)
>>> x.item()
1
>>> x = oneflow.tensor(2.5)
>>> x
tensor(2.5000, dtype=oneflow.float32)
>>> x.item()
2.5

For more information about indexing, see Indexing, Slicing, Joining, Mutating Ops

A tensor can be created with requires_grad=True so that oneflow.autograd records operations on them for automatic differentiation.

>>> import oneflow
>>> x = oneflow.tensor([[1., -1.], [1., 1.]], requires_grad=True)
>>> out = x.pow(2).sum()
>>> out.backward()
>>> x.grad
tensor([[ 2., -2.],
        [ 2.,  2.]], dtype=oneflow.float32)

Note

For more information on the oneflow.dtype, oneflow.device, and oneflow.layout attributes of a oneflow.Tensor, see Tensor Attributes.

Note

Methods which mutate a tensor are marked with an underscore suffix. For example, oneflow.FloatTensor.add_() computes the absolute value in-place and returns the modified tensor, while oneflow.FloatTensor.add() computes the result in a new tensor.

Note

To change an existing tensor’s oneflow.device and/or oneflow.dtype, consider using to() method of Tensor object.

Warning

Current implementation of oneflow.Tensor introduces memory overhead, thus it might lead to unexpectedly high memory usage in the applications with many tiny tensors. If this is your case, consider using one large structure.

Tensor class reference

class oneflow.Tensor

There are a few main ways to create a tensor, depending on your use case.

  • To create a tensor with pre-existing data, use oneflow.tensor().

  • To create a tensor with specific size, use oneflow.* tensor creation ops (see Creation Ops).

  • To create a tensor with the same size (and similar types) as another tensor, use oneflow.*_like tensor creation ops (see Creation Ops).

Tensor.new_empty

Returns a Tensor of size size filled with uninitialized data.

Tensor.new_ones

See oneflow.new_ones()

Tensor.new_zeros

Returns a Tensor of size size filled with 0.

Tensor.new_full

Returns a Tensor of size size filled with fill_value.

Tensor.new_tensor

Tensor.is_cuda

Is True if the Tensor is stored on the GPU, False otherwise.

Tensor.is_global

Return whether this Tensor is a global tensor.

Tensor.device

Is the oneflow.device where this Tensor is, which is invalid for global tensor.

Tensor.grad

Return the gradient calculated by autograd functions.

Tensor.ndim

See oneflow.Tensor.dim()

Tensor.abs

See oneflow.abs()

Tensor.acos

See oneflow.acos()

Tensor.acosh

See oneflow.acosh()

Tensor.add

See oneflow.add()

Tensor.add_

In-place version of oneflow.Tensor.add().

Tensor.addcdiv

See oneflow.addcdiv()

Tensor.addcdiv_

In-place version of oneflow.Tensor.addcdiv()

Tensor.addcmul

See oneflow.addcmul()

Tensor.addcmul_

In-place version of oneflow.Tensor.addcmul().

Tensor.addmm

See oneflow.addmm()

Tensor.all

See oneflow.all()

Tensor.amin

See oneflow.amin()

Tensor.amax

See oneflow.amax()

Tensor.any

See oneflow.any()

Tensor.arccos

See oneflow.arccos()

Tensor.arccosh

See oneflow.arccosh()

Tensor.arcsin

See oneflow.arcsin()

Tensor.arcsinh

See oneflow.arcsinh()

Tensor.arctan

See oneflow.arctan()

Tensor.arctanh

See oneflow.arctanh()

Tensor.argmax

See oneflow.argmax()

Tensor.argmin

See oneflow.argmin()

Tensor.argsort

See oneflow.argsort()

Tensor.argwhere

See oneflow.argwhere()

Tensor.asin

See oneflow.asin()

Tensor.asinh

See oneflow.asinh()

Tensor.atan

See oneflow.atan()

Tensor.atan2

See oneflow.atan2()

Tensor.atanh

See oneflow.atanh()

Tensor.backward

Computes the gradient of current tensor w.r.t. graph leaves.

Tensor.bmm

See oneflow.bmm()

Tensor.bool

Tensor.bool() is equivalent to Tensor.to(oneflow.bool).

Tensor.byte

self.byte() is equivalent to self.to(oneflow.uint8).

Tensor.cast

See oneflow.cast()

Tensor.ceil

See oneflow.ceil()

Tensor.ceil_

See oneflow.ceil_()

Tensor.chunk

See oneflow.chunk()

Tensor.clamp

See oneflow.clamp().

Tensor.clamp_

Inplace version of oneflow.Tensor.clamp().

Tensor.clip

Alias for oneflow.Tensor.clamp().

Tensor.clip_

Alias for oneflow.Tensor.clamp_().

Tensor.clone

See oneflow.clone()

Tensor.contiguous

Tensor.copy_

Copies the elements from src into self tensor and returns self.

Tensor.cos

See oneflow.cos()

Tensor.cosh

See oneflow.cosh()

Tensor.cpu

Returns a copy of this object in CPU memory.

Tensor.cuda

Returns a copy of this object in CUDA memory.

Tensor.cumprod

See oneflow.cumprod()

Tensor.cumsum

See oneflow.cumsum()

Tensor.data

Tensor.dot

See oneflow.dot()

Tensor.detach

Tensor.placement

Is the oneflow.placement where this Tensor is, which is invalid for local tensor.

Tensor.sbp

Is the oneflow.sbp representing that how the data of the global tensor is distributed, which is invalid for local tensor.

Tensor.diag

See oneflow.diag()

Tensor.diagonal

See oneflow.diagonal()

Tensor.dim

Tensor.dim() → int

Tensor.div

See oneflow.div()

Tensor.div_

In-place version of oneflow.Tensor.div().

Tensor.double

Tensor.double() is equivalent to Tensor.to(flow.float64).

Tensor.dtype

Tensor.digamma

See oneflow.digamma()

Tensor.element_size

Tensor.element_size() → int

Tensor.eq

See oneflow.eq()

Tensor.equal

See oneflow.equal()

Tensor.erf

See oneflow.erf()

Tensor.erfc

See oneflow.erfc()

Tensor.erfinv

See oneflow.erfinv()

Tensor.erfinv_

Inplace version of oneflow.erfinv()

Tensor.exp

See oneflow.exp()

Tensor.exp2

See oneflow.exp2()

Tensor.expand

See oneflow.expand()

Tensor.expand_as

Expand this tensor to the same size as other.

Tensor.expm1

See oneflow.expm1()

Tensor.fill_

Tensor.fill_(value) → Tensor

Tensor.flatten

See oneflow.flatten()

Tensor.flip

See oneflow.flip()

Tensor.float

Tensor.float() is equivalent to Tensor.to(flow.float32).

Tensor.floor

See oneflow.floor()

Tensor.floor_

See oneflow.floor_()

Tensor.floor_divide

Tensor.fmod

See oneflow.fmod()

Tensor.gather

See oneflow.gather()

Tensor.ge

See oneflow.ge()

Tensor.get_device

For CUDA tensors, this function returns the device ordinal of the GPU on which the tensor resides.

Tensor.grad_fn

Return the function that created this tensor if it’s requires_grad is True.

Tensor.gt

See oneflow.gt()

Tensor.gt_

In-place version of oneflow.Tensor.gt().

Tensor.half

self.half() is equivalent to self.to(dtype=oneflow.float16).

Tensor.in_top_k

See oneflow.in_top_k()

Tensor.index_select

See oneflow.index_select()

Tensor.index_add

Tensor.index_add_

The interface is consistent with PyTorch.

Tensor.int

Tensor.int() is equivalent to Tensor.to(flow.int32).

Tensor.is_contiguous

Returns True if self tensor is contiguous in memory.

Tensor.is_floating_point

See oneflow.is_floating_point()

Tensor.is_lazy

Return whether this Tensor is a lazy tensor.

Tensor.is_leaf

All Tensors that have requires_grad which is False will be leaf Tensors by convention.

Tensor.isinf

See oneflow.isinf()

Tensor.isnan

See oneflow.isnan()

Tensor.item

Returns the value of this tensor as a standard Python number.

Tensor.le

See oneflow.le()

Tensor.lerp

See oneflow.lerp()

Tensor.lerp_

See oneflow.lerp_()

Tensor.log

See oneflow.log()

Tensor.log1p

See oneflow.log1p()

Tensor.log2

See oneflow.log2()

Tensor.log10

See oneflow.log10()

Tensor.logical_and

See oneflow.logical_and()

Tensor.logical_or

See oneflow.logical_or()

Tensor.logical_not

See oneflow.logical_not()

Tensor.logical_xor

See oneflow.logical_xor()

Tensor.long

Tensor.long() is equivalent to Tensor.to(flow.int64).

Tensor.lt

See oneflow.lt()

Tensor.masked_fill

See oneflow.masked_fill()

Tensor.masked_fill_

In-place version of oneflow.Tensor.masked_fill().

Tensor.masked_select

See oneflow.masked_select()

Tensor.matmul

See oneflow.matmul()

Tensor.mm

See oneflow.mm()

Tensor.mv

See oneflow.mv()

Tensor.max

See oneflow.max()

Tensor.maximum

See oneflow.maximum()

Tensor.median

See oneflow.median()

Tensor.mean

See oneflow.mean()

Tensor.min

See oneflow.min()

Tensor.minimum

See oneflow.minimum()

Tensor.mish

See oneflow.mish()

Tensor.mode

See oneflow.mode()

Tensor.mul

See oneflow.mul()

Tensor.mul_

In-place version of oneflow.Tensor.mul().

Tensor.frac

See oneflow.frac().

Tensor.frac_

In-place version of oneflow.Tensor.frac().

Tensor.nansum

See oneflow.nansum()

Tensor.narrow

See oneflow.narrow()

Tensor.ndimension

Tensor.ne

See oneflow.ne()

Tensor.neg

See oneflow.neg()

Tensor.negative

See oneflow.negative()

Tensor.nelement

Tensor.nelement() → int

Tensor.nonzero

See oneflow.nonzero()

Tensor.norm

See oneflow.norm()

Tensor.normal_

Fills self tensor with elements samples from the normal distribution parameterized by mean and std.

Tensor.numel

See oneflow.numel()

Tensor.numpy

Tensor.numpy() → numpy.ndarray

Tensor.offload

Transfer tensor data from GPU memory back to host (CPU) memory.

Tensor.load

Load tensor data stored on the host (CPU) back to GPU memory.

Tensor.is_offloaded

Determine whether the tensor has been moved to CPU memory and the CUDA device memory has been released.

Tensor.permute

See oneflow.permute()

Tensor.pow

See oneflow.pow()

Tensor.prod

See oneflow.prod()

Tensor.quantile

See oneflow.quantile()

Tensor.reciprocal

See oneflow.reciprocal()

Tensor.register_hook

Registers a backward hook.

Tensor.relu

See oneflow.relu()

Tensor.repeat

See oneflow.repeat()

Tensor.repeat_interleave

See oneflow.repeat_interleave()

Tensor.requires_grad

Is True if gradient need to be computed for this Tensor, False otherwise.

Tensor.requires_grad_

Sets this tensor’s requires_grad attribute in-place.

Tensor.reshape

See oneflow.reshape()

Tensor.reshape_as

Returns this tensor as the same shape as other.

Tensor.retain_grad

Enables this Tensor to have their grad populated during backward().

Tensor.roll

See oneflow.roll()

Tensor.round

See oneflow.round()

Tensor.round_

See oneflow.round_()

Tensor.rsqrt

See oneflow.rsqrt()

Tensor.selu

See oneflow.selu()

Tensor.shape

Tensor.sigmoid

See oneflow.sigmoid()

Tensor.sign

See oneflow.sign()

Tensor.silu

See oneflow.silu()

Tensor.sin

See oneflow.sin()

Tensor.sin_

See oneflow.sin_()

Tensor.sinh

See oneflow.sinh()

Tensor.size

Returns the size of the self tensor.

Tensor.softmax

See oneflow.softmax()

Tensor.softplus

See oneflow.softplus()

Tensor.softsign

See oneflow.softsign()

Tensor.sort

See oneflow.sort()

Tensor.split

See oneflow.split()

Tensor.sqrt

See oneflow.sqrt()

Tensor.square

See oneflow.square()

Tensor.squeeze

See oneflow.squeeze()

Tensor.squeeze_

In-place version of oneflow.Tensor.squeeze()

Tensor.std

See oneflow.std()

Tensor.storage_offset

Returns self tensor’s offset in the underlying storage in terms of number of storage elements (not bytes).

Tensor.stride

Tensor.logsumexp

See oneflow.logsumexp()

Tensor.sum

See oneflow.sum()

Tensor.swapaxes

See oneflow.swapaxes()

Tensor.swapdims

See oneflow.swapdims()

Tensor.sub

See oneflow.sub()

Tensor.sub_

In-place version of oneflow.Tensor.sub().

Tensor.tan

See oneflow.tan()

Tensor.tanh

See oneflow.tanh()

Tensor.tile

See oneflow.tile()

Tensor.to

Performs Tensor dtype and/or device conversion.

Tensor.local_to_global

Creates a global tensor from a local tensor.

Tensor.global_to_global

Performs Tensor placement and/or sbp conversion.

Tensor.to_global

Creates a global tensor if this tensor is a local tensor, otherwise performs Tensor placement and/or sbp conversion.

Tensor.to_local

Returns the local component of this global tensor in the current rank.

Tensor.to_consistent

This interface is no longer available, please use oneflow.Tensor.to_global() instead.

Tensor.tolist

Returns the tensor as a (nested) list.

Tensor.topk

See oneflow.topk()

Tensor.transpose

See oneflow.transpose()

Tensor.tril

See oneflow.tril()

Tensor.triu

See oneflow.triu()

Tensor.trunc

See oneflow.trunc()

Tensor.type_as

Returns this tensor cast to the type of the given tensor.

Tensor.type

Returns the type if dtype is not provided, else casts this object to the specified type.

Tensor.t

See oneflow.t()

Tensor.T

Is this Tensor with its dimensions reversed.

Tensor.unbind

See oneflow.unbind()

Tensor.unfold

Returns a view of the original tensor which contains all slices of size size from self tensor in the dimension dimension.

Tensor.uniform_

Tensor.uniform_(from=0, to=1) → Tensor

Tensor.unsqueeze

See oneflow.unsqueeze()

Tensor.unsqueeze_

In-place version of oneflow.Tensor.unsqueeze()

Tensor.as_strided

See oneflow.as_strided()

Tensor.as_strided_

In-place version of oneflow.Tensor.as_strided()

Tensor.var

See oneflow.var()

Tensor.view

Returns a new tensor with the same data as the self tensor but of a different shape.

Tensor.view_as

Expand this tensor to the same size as other.

Tensor.where

See oneflow.where()

Tensor.zero_

Fills self tensor with zeros.

Tensor.nms

See oneflow.nms()

Tensor.pin_memory

Copies the tensor to pinned memory, if it’s not already pinned.

Tensor.is_pinned

Returns true if this tensor resides in pinned memory.

Tensor.inverse

See oneflow.linalg.inv()

Tensor.cross

See oneflow.cross()

Tensor.scatter

See oneflow.scatter()

Tensor.scatter_

Inplace version of oneflow.Tensor.scatter()

Tensor.scatter_add

See oneflow.scatter_add()

Tensor.scatter_add_

Inplace version of oneflow.Tensor.scatter_add()

Tensor.bernoulli

See oneflow.bernoulli()

Tensor.bernoulli_

The inplace version of oneflow.Tensor.bernoulli_().

Tensor.bincount

See oneflow.bincount()

Tensor.isclose

Tensor.allclose

Tensor.broadcast_to

See oneflow.broadcast_to()

Tensor.unique

See oneflow.unique()

Tensor.bitwise_and

See oneflow.bitwise_and()

Tensor.bitwise_or

See oneflow.bitwise_or()

Tensor.bitwise_xor

See oneflow.bitwise_xor()

Tensor.baddbmm

See oneflow.baddbmm()

Tensor Attributes

Each local oneflow.Tensor has a oneflow.dtype, oneflow.device, and global oneflow.Tensor has a oneflow.dtype, oneflow.placement, oneflow.sbp.

oneflow.dtype

class oneflow.dtype

A oneflow.dtype is an object that represents the data type of a oneflow.Tensor. Oneflow has eight different data types:

Data type

dtype

CPU tensor

GPU tensor

Boolean

oneflow.bool

oneflow.BoolTensor

oneflow.cuda.BoolTensor

8-bit integer (unsigned)

oneflow.uint8

oneflow.ByteTensor

oneflow.cuda.ByteTensor

8-bit integer (signed)

oneflow.int8

oneflow.CharTensor

oneflow.cuda.CharTensor

64-bit floating point

oneflow.float64 or oneflow.double

oneflow.DoubleTensor

oneflow.cuda.DoubleTensor

32-bit floating point

oneflow.float32 or oneflow.float

oneflow.FloatTensor

oneflow.cuda.FloatTensor

16-bit floating point

oneflow.float16 or oneflow.half

oneflow.HalfTensor

oneflow.cuda.HalfTensor

32-bit integer (signed)

oneflow.int32 or oneflow.int

oneflow.IntTensor

oneflow.cuda.IntTensor

64-bit integer (signed)

oneflow.int64 or oneflow.long

oneflow.LongTensor

oneflow.cuda.LongTensor

To find out if a oneflow.dtype is a floating point data type, the property is_floating_point can be used, which returns True if the data type is a floating point data type.

When the dtypes of inputs to an arithmetic operation (add, sub, div, mul) differ, we promote by finding the minimum dtype that satisfies the following rules:

  • If the type of a scalar operand is of a higher category than tensor operands (where complex > floating > integral > boolean), we promote to a type with sufficient size to hold all scalar operands of that category.

  • If a zero-dimension tensor operand has a higher category than dimensioned operands, we promote to a type with sufficient size and category to hold all zero-dim tensor operands of that category.

  • If there are no higher-category zero-dim operands, we promote to a type with sufficient size and category to hold all dimensioned operands.

A floating point scalar operand has dtype oneflow.get_default_dtype() and an integral non-boolean scalar operand has dtype oneflow.int64. Unlike numpy, we do not inspect values when determining the minimum dtypes of an operand. Quantized and complex types are not yet supported.

Promotion Examples:

>>> float_tensor = oneflow.ones(1, dtype=oneflow.float)
>>> double_tensor = oneflow.ones(1, dtype=oneflow.double)
>>> int_tensor = oneflow.ones(1, dtype=oneflow.int)
>>> long_tensor = oneflow.ones(1, dtype=oneflow.long)
>>> uint_tensor = oneflow.ones(1, dtype=oneflow.uint8)
>>> double_tensor = oneflow.ones(1, dtype=oneflow.double)
>>> bool_tensor = oneflow.ones(1, dtype=oneflow.bool)
# zero-dim tensors
>>> long_zerodim = oneflow.tensor(1, dtype=oneflow.long)
>>> int_zerodim = oneflow.tensor(1, dtype=oneflow.int)

>>> a,b=oneflow.tensor(5),oneflow.tensor(5)
>>> oneflow.add(a, b).dtype
oneflow.int64
# 5 is an int64, but does not have higher category than int_tensor so is not considered.
>>> (int_tensor + 5).dtype
oneflow.int32
>>> (int_tensor + long_zerodim).dtype
oneflow.int64
>>> (long_tensor + int_tensor).dtype
oneflow.int64
>>> (bool_tensor + long_tensor).dtype
oneflow.int64
>>> (bool_tensor + uint_tensor).dtype
oneflow.uint8
>>> (float_tensor + double_tensor).dtype
oneflow.float64
>>> (bool_tensor + int_tensor).dtype
oneflow.int32
# Since long is a different kind than float, result dtype only needs to be large enough
# to hold the float.
>>> oneflow.add(long_tensor, float_tensor).dtype
oneflow.float32
When the output tensor of an arithmetic operation is specified, we allow casting to its dtype except that:
  • An integral output tensor cannot accept a floating point tensor.

  • A boolean output tensor cannot accept a non-boolean tensor.

  • A non-complex output tensor cannot accept a complex tensor

Casting Examples:

# allowed:
>>> float_tensor *= float_tensor
>>> float_tensor *= int_tensor
>>> float_tensor *= uint_tensor
>>> float_tensor *= bool_tensor
>>> int_tensor *= uint_tensor

# disallowed (RuntimeError: result type can't be cast to the desired output type):
>>> float_tensor *= double_tensor
>>> int_tensor *= float_tensor
>>> int_tensor *= long_tensor
>>> uint_tensor *= int_tensor
>>> bool_tensor *= int_tensor
>>> bool_tensor *= uint_tensor

oneflow.device

class oneflow.device

A oneflow.device is an object representing the device on which a oneflow.Tensor is or will be allocated.

The oneflow.device contains a device type ('cpu' or 'cuda') and optional device ordinal for the device type. If the device ordinal is not present, this object will always represent the current device for the device type, even after oneflow.cuda.set_device() is called; e.g., a oneflow.Tensor constructed with device 'cuda' is equivalent to 'cuda:X' where X is the result of oneflow.cuda.current_device().

A oneflow.Tensor’s device can be accessed via the Tensor.device property.

A oneflow.device can be constructed via a string or via a string and device ordinal

Via a string:

>>> oneflow.device('cuda:0')
device(type='cuda', index=0)

>>> oneflow.device('cpu')
device(type='cpu', index=0)

>>> oneflow.device('cuda')  # current cuda device
device(type='cuda', index=0)

Via a string and device ordinal:

>>> oneflow.device('cuda', 0)
device(type='cuda', index=0)

>>> oneflow.device('cpu', 0)
device(type='cpu', index=0)

Note

The oneflow.device argument in functions can generally be substituted with a string. This allows for fast prototyping of code.

>>> # Example of a function that takes in a oneflow.device
>>> cuda1 = oneflow.device('cuda:1')
>>> oneflow.randn((2,3), device=cuda1)
>>> # You can substitute the oneflow.device with a string
>>> oneflow.randn((2,3), device='cuda:1')

Note

For legacy reasons, a device can be constructed via a single device ordinal, which is treated as a cuda device. This matches Tensor.get_device(), which returns an ordinal for cuda tensors and is not supported for cpu tensors.

>>> oneflow.device(1)
device(type='cuda', index=1)

Note

Methods which take a device will generally accept a (properly formatted) string or (legacy) integer device ordinal, i.e. the following are all equivalent:

>>> oneflow.randn((2,3), device=oneflow.device('cuda:1'))
>>> oneflow.randn((2,3), device='cuda:1')
>>> oneflow.randn((2,3), device=1)  # legacy

oneflow.placement

class oneflow.placement

A oneflow.placement is an object representing the device group on which a oneflow.Tensor is or will be allocated. The oneflow.placement contains a device type (‘cpu’ or ‘cuda’) and corresponding device sequence.

A oneflow.Tensor’s placement can be accessed via the Tensor.placement property.

A oneflow.placement can be constructed in several ways:

>>> import oneflow as flow

>>> p = flow.placement(type="cuda", ranks=[0, 1, 2, 3])
>>> p
oneflow.placement(type="cuda", ranks=[0, 1, 2, 3])
>>> p = flow.placement(type="cuda", ranks=[[0, 1], [2, 3]])
>>> p
oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3]])

oneflow.placement.all

oneflow.placement.all(device_type)oneflow.placement

Returns a placement that contains all available devices.

Parameters

device_type (str) – cuda or cpu

For examples:

# Runs on 4 ranks
import oneflow as flow

p = flow.placement.all("cuda") # oneflow.placement(type="cuda", ranks=[0, 1, 2, 3])
p = flow.placement.all("cpu") # oneflow.placement(type="cpu", ranks=[0, 1, 2, 3])

oneflow.env.all_device_placement

oneflow.env.all_device_placement(device_type)oneflow.placement

Returns a placement that contains all available devices.

Note

It is recommended to use oneflow.placement.all instead of this function.

Parameters

device_type (str) – cuda or cpu

For examples:

# Runs on 4 ranks
import oneflow as flow

p = flow.env.all_device_placement("cuda") # oneflow.placement(type="cuda", ranks=[0, 1, 2, 3])
p = flow.env.all_device_placement("cpu") # oneflow.placement(type="cpu", ranks=[0, 1, 2, 3])

oneflow.sbp.sbp

class oneflow.sbp.sbp

A oneflow.sbp is an object representing that how the data of the global tensor is distributed across the ranks of the Tensor placement.

oneflow.sbp includes three types:

  • oneflow.sbp.split(dim)

    Indicates that the global tensor is evenly divided according to the dimension dim and distributed on each rank.

  • oneflow.sbp.broadcast()

    Indicates that the global tensor is replicated on each rank.

  • oneflow.sbp.partial_sum()

    Indicates that the value of the global tensor is element-wise sum of the local tensors distributed in each rank.

A oneflow.Tensor’s sbp can be accessed via the Tensor.sbp property.

A oneflow.sbp can be constructed in several ways:

>>> import oneflow as flow

>>> s = flow.sbp.split(0)
>>> s
oneflow.sbp.split(dim=0)
>>> b = flow.sbp.broadcast()
>>> b
oneflow.sbp.broadcast
>>> p = flow.sbp.partial_sum()
>>> p
oneflow.sbp.partial_sum

Type Info

The numerical properties of a oneflow.dtype can be accessed through either the oneflow.finfo or the oneflow.iinfo.

oneflow.finfo

class oneflow.finfo

A oneflow.finfo is an object that represents the numerical properties of a floating point oneflow.dtype, (i.e. oneflow.float32, oneflow.float64 and oneflow.float16). This is similar to numpy.finfo.

A oneflow.finfo provides the following attributes:

Name

Type

Description

bits

int

The number of bits occupied by the type.

eps

float

The smallest representable number such that 1.0 + eps != 1.0.

min

float

The largest representable number.

max

float

The smallest representable number (typically -max).

tiny

float

The smallest positive normal number. See notes.

resolution

float

The approximate decimal resolution of this type, i.e., 10**-precision.

For example:

>>> import oneflow as flow
>>> flow.finfo()
finfo(resolution=1e-06, min=-3.40282e+38, max=3.40282e+38, eps=1.19209e-07, tiny=1.17549e-38, dtype=oneflow.float32, bits=32)
>>> flow.finfo(flow.float)
finfo(resolution=1e-06, min=-3.40282e+38, max=3.40282e+38, eps=1.19209e-07, tiny=1.17549e-38, dtype=oneflow.float32, bits=32)
>>> flow.finfo(flow.float16).bits
16
>>> flow.finfo(flow.float16).max
65504.0

oneflow.iinfo

class oneflow.iinfo

A oneflow.iinfo is an object that represents the numerical properties of a integer oneflow.dtype (i.e. oneflow.uint8, oneflow.int8, oneflow.int16, oneflow.int32, and oneflow.int64). This is similar to numpy.iinfo.

A oneflow.iinfo provides the following attributes:

Name

Type

Description

bits

int

The number of bits occupied by the type.

min

float

The largest representable number.

max

float

The smallest representable number.

For example:

>>> import oneflow as flow
>>> flow.iinfo(flow.int8)
iinfo(min=-128, max=127, dtype=oneflow.int8, bits=8)
>>> flow.iinfo(flow.int).max
2147483647
>>> flow.iinfo(flow.int).bits
32

oneflow.autograd

oneflow.autograd provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions. It requires minimal changes to the existing code - you only need to declare Tensor s for which gradients should be computed with the requires_grad=True keyword. As of now, we only support autograd for floating point Tensor types ( half, float, double and bfloat16).

backward

Computes the sum of gradients of given tensors with respect to graph leaves.

grad

Computes and returns the sum of gradients of outputs with respect to the inputs.

Locally disabling gradient computation

no_grad

Context-manager that disabled gradient calculation.

enable_grad

Context-manager that enabled gradient calculation.

set_grad_enabled

Context-manager that enabled gradient calculation.

inference_mode

Context-manager that enables or disables inference mode

In-place operations on Tensors

Supporting in-place operations in autograd is a hard matter, and we discourage their use in most cases. Autograd’s aggressive buffer freeing and reuse makes it very efficient and there are very few occasions when in-place operations actually lower memory usage by any significant amount. Unless you’re operating under heavy memory pressure, you might never need to use them.

Tensor autograd functions

oneflow.Tensor.grad

Return the gradient calculated by autograd functions.

oneflow.Tensor.requires_grad

Is True if gradient need to be computed for this Tensor, False otherwise.

oneflow.Tensor.is_leaf

All Tensors that have requires_grad which is False will be leaf Tensors by convention.

oneflow.Tensor.backward([gradient, …])

Computes the gradient of current tensor w.r.t. graph leaves.

oneflow.Tensor.detach

oneflow.Tensor.register_hook(hook)

Registers a backward hook.

oneflow.Tensor.retain_grad

Enables this Tensor to have their grad populated during backward().

Function

class oneflow.autograd.Function(self)

Base class to create custom autograd.Function.

To create a custom autograd.Function, subclass this class and implement the forward() and backward() static methods. Then, to use your custom op in the forward pass, call the class method apply() or __call__(). Do not call forward() directly.

For example:

class Exp(Function):
    @staticmethod
    def forward(ctx, i):
        result = i.exp()
        ctx.save_for_backward(result)
        return result

    @staticmethod
    def backward(ctx, grad_output):
        result, = ctx.saved_tensors
        return grad_output * result

# Use it by calling the apply method or __call__ method
output = Exp.apply(input)  # output = Exp()(input)

Function.forward

Override this function for custom forward calculation.

Function.backward

Override this function for custom backward calculation.

Function.apply

Calculate output tensors and build backward graph.

Context method mixins

When creating a new Function, the following methods are available to ctx.

oneflow.cuda

is_available

Returns a bool indicating if CUDA is currently available.

device_count

Returns the number of GPUs available.

current_device

Returns local rank as device index.

set_device

Sets the current device.

synchronize

Waits for all kernels in all streams on a CUDA device to complete.

get_device_properties

Gets the properties of a device.

get_device_capability

Gets the cuda capability of a device.

get_device_name

Gets the name of a device.

Note

The current_device returns local rank as device index. It is different from the ‘torch.current_device()’ in PyTorch.

Random Number Generator

manual_seed_all

Sets the seed for generating random numbers on all GPUs.

manual_seed

Sets the seed for generating random numbers for the current GPU.

get_rng_state

Returns the random number generator state of the specified GPU as a ByteTensor.

get_rng_state_all

Returns a list of ByteTensor representing the random number states of all devices.

set_rng_state

Sets the random number generator state of the specified GPU.

set_rng_state_all

Sets the random number generator state of all devices.

GPU tensor

HalfTensor

The tensortype oneflow.cuda.HalfTensor is not available.

FloatTensor

The tensortype oneflow.cuda.FloatTensor is not available.

DoubleTensor

The tensortype oneflow.cuda.DoubleTensor is not available.

BoolTensor

The tensortype oneflow.cuda.BoolTensor is not available.

ByteTensor

The tensortype oneflow.cuda.ByteTensor is not available.

CharTensor

The tensortype oneflow.cuda.CharTensor is not available.

IntTensor

The tensortype oneflow.cuda.IntTensor is not available.

LongTensor

The tensortype oneflow.cuda.LongTensor is not available.

Memory management

empty_cache

Releases all unoccupied cached memory currently held by the caching allocators of all OneFlow streams so those can be re-allocated in OneFlow streams or other GPU application and visible in nvidia-smi.

oneflow.distributed

Note

Please refer to OneFlow Distributed Overview for a brief introduction to all features related to distributed training.

OneFlow provides two ways to accomplish Distributed Training:

  • The first way is that users are recommended to use OneFlow’s global Tensor for distributed training. Global Tensor regards the computing cluster as a supercomputing device, allowing users to write distributed training code just like in a single-machine environment.

  • OneFlow also provides a DDP(DistributedDataParallel) module aligned with PyTorch. DDP has been well-known and widely used in data parallelism by the majority of PyTorch users. Also see PyTorch DDP introduction.

Basic

When you start distributed training in OneFlow, the following functions can be used.

get_world_size

Returns the number of processes in the current process group.

get_rank

Returns the rank of current process group.

get_local_rank

Returns the local rank of current machine.

get_node_size

Returns the number of machines in the current process group.

init_rdma

Init RDMA in the current envirment.

rdma_is_initialized

Returns whether RDMA is initialized in the current envirment or not.

Global Tensor

Construct Global Tensor

A Global Tensor can be created with a placement and a sbp. The placement describes the physical devices of the global tensor will be allocated, and the sbp describes its distribution among these devices.

>>>import oneflow as flow
>>> # Place a global tensor on cuda device of rank(process) 0 and 1
>>> placement = flow.placement(type="cuda", ranks=[0, 1])
>>> # Each rank's local data is a part data as a result of spliting global data on dim 0
>>> sbp = flow.sbp.split(dim=0)
>>> # Create a global tensor by randn
>>> x = flow.randn(4, 5, placement=placement, sbp=sbp)
>>> x.shape
oneflow.Size([4, 5])

Convert Local Tensor to Global Tensor

With Tensor.to_global interface, Local Tensor can create a Global Tensor and use that Local Tensor as its local component at the current node.

Two local tensors with the shape of (2,5) are created separately on two devices. While after the to_global method, the global tensor with a shape of (4,5) is obtained.

Code running on Node 0

import oneflow as flow

x = flow.randn(2,5)
placement = flow.placement("cuda", [0,1])
sbp = flow.sbp.split(0)
x_global = x.to_global(placement=placement, sbp=sbp)
x_global.shape

Code running on Node 1

import oneflow as flow

x = flow.randn(2,5)
placement = flow.placement("cuda", [0,1])
sbp = flow.sbp.split(0)
x_global = x.to_global(placement=placement, sbp=sbp)
x_global.shape

Redistribute Global Tensor

Redistributing a Global Tensor means moving its data to another device group (or placement), or changing its data distribution (or SBP) across the group, or both at the same time. The redistributed tensor is still a Global Tensor.

>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0], placement=flow.placement("cuda", ranks=[0, 1]), sbp=flow.sbp.split(0))
>>> y = x.to_global(placement=flow.placement("cuda", ranks=[2, 3]), sbp=flow.sbp.broadcast)

According to the operator’s semantics, OneFlow defines a sequence of valid input and output SBP combinations for each built-in operator. So OneFlow could automatically redistribute the Global Tensor to satisfy the operator’s SBP requirements for its input Tensor. For example, the following code:

>>> import oneflow as flow
>>> x = flow.randn(4, 4,
        placement=flow.placement("cuda", ranks=[0, 1]),
        sbp=flow.sbp.split(0))
>>> y = flow.randn(4, 4,
        placement=flow.placement("cuda", ranks=[0, 1]),
        sbp=flow.sbp.split(1))
>>> z = x + y

When x + y is executed, since x is split along dimension 0 and y is split along dimension 1, their local components at each node can not be added directly, then OneFlow will automatically redistribute one of x and y to make them have the same SBP, and complete the add operation successfully.

Note

  • Global Tensor can not be used in combination with DDP currently.

  • Global Tensor requires all devices to execute at the same pace, otherwise, it may cause multi-process deadlock.

Get Local Tensor from Global Tensor

With Tensor.to_local interface, the Global Tensor can return its local component at the current node.

y = x.to_local()
y.is_local
True
y
tensor([[ 2.9186e-01, -3.9442e-01,  4.7072e-04, -3.2216e-01,  1.7788e-01],
            [-4.5284e-01,  1.2361e-01, -3.5962e-01,  2.6651e-01,  1.2951e+00]],
        device='cuda:0', dtype=oneflow.float32)

DistributedDataParallel

For more information about DistributedDataParallel, see nn.parallel.DistributedDataParallel

The following script shows the process of using oneflow.nn.parallel.DistributedDataParallel for training data parallel:

import oneflow as flow
from oneflow.nn.parallel import DistributedDataParallel as ddp

train_x = [
    flow.tensor([[1, 2], [2, 3]], dtype=flow.float32),
    flow.tensor([[4, 6], [3, 1]], dtype=flow.float32),
]
train_y = [
    flow.tensor([[8], [13]], dtype=flow.float32),
    flow.tensor([[26], [9]], dtype=flow.float32),
]


class Model(flow.nn.Module):
    def __init__(self):
        super().__init__()
        self.lr = 0.01
        self.iter_count = 500
        self.w = flow.nn.Parameter(flow.tensor([[0], [0]], dtype=flow.float32))

    def forward(self, x):
        x = flow.matmul(x, self.w)
        return x


m = Model().to("cuda")
m = ddp(m)
loss = flow.nn.MSELoss(reduction="sum")
optimizer = flow.optim.SGD(m.parameters(), m.lr)

for i in range(0, m.iter_count):
    rank = flow.env.get_rank()
    x = train_x[rank].to("cuda")
    y = train_y[rank].to("cuda")

    y_pred = m(x)
    l = loss(y_pred, y)
    if (i + 1) % 50 == 0:
        print(f"{i+1}/{m.iter_count} loss:{l}")

    optimizer.zero_grad()
    l.backward()
    optimizer.step()

print(f"\nw:{m.w}")

There are only two differences between the data parallelism training code and the stand-alone single-card script:

  • Use DistributedDataParallel to wrap the module object (m = ddp(m))

  • Use get_rank to get the current device number and distribute the data to the device.

Then use launcher to run the script, leave everything else to OneFlow, which makes distributed training as simple as stand-alone single-card training:

python3 -m oneflow.distributed.launch --nproc_per_node 2 ./ddp_train.py

Communication collectives

all_reduce

Reduces the tensor data across all machines in such a way that all get the final result.

all_gather

Gathers tensors from the whole group in a list.

all_gather_into_tensor

Gather tensors from all ranks and put them in a single output tensor.

all_to_all

Each process scatters list of input tensors to all processes in a group and return gathered list of tensors in output list.

broadcast

Broadcasts the tensor to the whole group.

barrier

Synchronizes all processes.

gather

Gathers a list of tensors in a single process.

reduce

Reduces the tensor data across all machines.

reduce_scatter

Reduces, then scatters a list of tensors to all processes in a group.

reduce_scatter_tensor

Reduces, then scatters a tensor to all ranks.

recv

Receives a tensor synchronously.

scatter

Scatters a list of tensors to all processes in a group.

send

Sends a tensor synchronously.

We also provide PyTorch-compatible APIs for communication collectives, for example, oneflow.distributed.all_reduce(tensor, op=ReduceOp.SUM, group=None, async_op=False). For more information, see PyTorch Distributed Communication. Note that we currently only support op=ReduceOp.SUM, group=None and async_op=False in these operations.

Launching distributed training

run commands below to see more about usage.

python3 -m oneflow.distributed.launch -h
usage: launch.py [-h] [--nnodes NNODES] [--node_rank NODE_RANK]
             [--nproc_per_node NPROC_PER_NODE] [--master_addr MASTER_ADDR]
             [--master_port MASTER_PORT] [-m] [--no_python]
             [--redirect_stdout_and_stderr] [--logdir LOGDIR]
             training_script ...

OneFlow distributed training launch helper utility that will spawn up multiple
distributed processes

positional arguments:
training_script       The full path to the single GPU training program/script to be
                        launched in parallel, followed by all the arguments for the
                        training script
training_script_args

optional arguments:
-h, --help            show this help message and exit
--nnodes NNODES       The number of nodes to use for distributed training
--node_rank NODE_RANK
                        The rank of the node for multi-node distributed training
--nproc_per_node NPROC_PER_NODE
                        The number of processes to launch on each node, for GPU
                        training, this is recommended to be set to the number of GPUs in
                        your system so that each process can be bound to a single GPU.
--master_addr MASTER_ADDR
                        Master node (rank 0)'s address, should be either the IP address
                        or the hostname of node 0, for single node multi-proc training,
                        the --master_addr can simply be 127.0.0.1
--master_port MASTER_PORT
                        Master node (rank 0)'s free port that needs to be used for
                        communication during distributed training
-m, --module          Changes each process to interpret the launch script as a python
                        module, executing with the same behavior as'python -m'.
--no_python           Do not prepend the training script with "python" - just exec it
                        directly. Useful when the script is not a Python script.
--redirect_stdout_and_stderr
                        write the stdout and stderr to files 'stdout' and 'stderr'. Only
                        available when logdir is set
--logdir LOGDIR       Relative path to write subprocess logs to. Passing in a relative
                        path will create a directory if needed. Note that successive
                        runs with the same path to write logs to will overwrite existing
                        logs, so be sure to save logs as needed.

oneflow.distributions

Distribution

Distribution is the abstract base class for probability distributions.

Categorical

Creates a categorical distribution parameterized by either probs or logits (but not both).

oneflow.hub

Oneflow Hub is a pre-trained model repository designed to facilitate research reproducibility.

Publishing models

Oneflow Hub supports publishing pre-trained models(model definitions and pre-trained weights) to a github repository by adding a simple hubconf.py file;

hubconf.py can have multiple entrypoints. Each entrypoint is defined as a python function (example: a pre-trained model you want to publish).

def entrypoint_name(*args, **kwargs):
    # args & kwargs are optional, for models which take positional/keyword arguments.
    ...

How to implement an entrypoint?

Here is a code snippet specifies an entrypoint for resnet18 model if we expand the implementation in Oneflow-Inc/vision/hubconf.py. In most case importing the right function in hubconf.py is sufficient. Here we just want to use the expanded version as an example to show how it works. You can see the full script in Oneflow-Inc/vision repo

dependencies = ['oneflow']
from flowvision.models.resnet import resnet18 as _resnet18

# resnet18 is the name of entrypoint
def resnet18(pretrained=False, **kwargs):
    """ # This docstring shows up in hub.help()
    Resnet18 model
    pretrained (bool): kwargs, load pretrained weights into the model
    """
    # Call the model, load pretrained weights
    model = _resnet18(pretrained=pretrained, **kwargs)
    return model
  • dependencies variable is a list of package names required to load the model. Note this might be slightly different from dependencies required for training a model.

  • args and kwargs are passed along to the real callable function.

  • Docstring of the function works as a help message. It explains what does the model do and what are the allowed positional/keyword arguments. It’s highly recommended to add a few examples here.

  • Entrypoint function can either return a model(nn.module), or auxiliary tools to make the user workflow smoother, e.g. tokenizers.

  • Callables prefixed with underscore are considered as helper functions which won’t show up in oneflow.hub.list().

  • Pretrained weights can either be stored locally in the github repo, or loadable by oneflow.hub.load_state_dict_from_url(). If less than 2GB, it’s recommended to attach it to a project release and use the url from the release. In the example above flowvision.models.resnet.resnet18 handles pretrained, alternatively you can put the following logic in the entrypoint definition.

if pretrained:
    # For checkpoint saved in local github repo, e.g. <RELATIVE_PATH_TO_CHECKPOINT>=weights/save.pth
    dirname = os.path.dirname(__file__)
    checkpoint = os.path.join(dirname, <RELATIVE_PATH_TO_CHECKPOINT>)
    state_dict = oneflow.load(checkpoint)
    model.load_state_dict(state_dict)

    # For checkpoint saved elsewhere
    checkpoint = 'https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/flowvision/classification/ResNet/resnet18.zip'
    model.load_state_dict(oneflow.hub.load_state_dict_from_url(checkpoint, progress=False))

Important Notice

  • The published models should be at least in a branch/tag. It can’t be a random commit.

Loading models from Hub

OneFlow Hub provides convenient APIs to explore all available models in hub through oneflow.hub.list(), show docstring and examples through oneflow.hub.help() and load the pre-trained models using oneflow.hub.load().

Copyright 2020 The OneFlow Authors. All rights reserved.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

oneflow.hub.list(github, force_reload=False, skip_validation=False, trust_repo=None)

List all callable entrypoints available in the repo specified by github.

Parameters
  • github (str) – a string with format “repo_owner/repo_name[:ref]” with an optional ref (tag or branch). If ref is not specified, the default branch is assumed to be main if it exists, and otherwise master. Example: ‘ Oneflow-Inc/vision:0.2.0’

  • force_reload (bool, optional) – whether to discard the existing cache and force a fresh download. Default is False.

  • skip_validation (bool, optional) – if False, oneflowhub will check that the branch or commit specified by the github argument properly belongs to the repo owner. This will make requests to the GitHub API; you can specify a non-default GitHub token by setting the GITHUB_TOKEN environment variable. Default is False.

  • trust_repo (bool, str or None) –

    "check", True, False or None. This parameter was introduced in v1.12 and helps ensuring that users only run code from repos that they trust. - If False, a prompt will ask the user whether the repo should be trusted.

    • If True, the repo will be added to the trusted list and loaded without requiring explicit confirmation.

    • If "check", the repo will be checked against the list of trusted repos in the cache. If it is not present in that list, the behaviour will fall back onto the trust_repo=False option.

    • If None, this will raise a warning, inviting the user to set trust_repo to either False, True or "check". This is only present for backward compatibility and will be removed in v1.14.

    Default is None and will eventually change to "check" in v1.14.

Returns

The available callables entrypoint

Return type

list

For example:

>>> entrypoints = oneflow.hub.list('Oneflow-Inc/vision', force_reload=True)
oneflow.hub.help(github, model, force_reload=False, skip_validation=False, trust_repo=None)

Show the docstring of entrypoint model.

Parameters
  • github (str) – a string with format <repo_owner/repo_name[:ref]> with an optional ref (a tag or a branch). If ref is not specified, the default branch is assumed to be main if it exists, and otherwise master. Example: ‘Oneflow-Inc/vision:0.2.0’

  • model (str) – a string of entrypoint name defined in repo’s hubconf.py

  • force_reload (bool, optional) – whether to discard the existing cache and force a fresh download. Default is False.

  • skip_validation (bool, optional) – if False, oneflowhub will check that the ref specified by the github argument properly belongs to the repo owner. This will make requests to the GitHub API; you can specify a non-default GitHub token by setting the GITHUB_TOKEN environment variable. Default is False.

  • trust_repo (bool, str or None) –

    "check", True, False or None. This parameter was introduced in v1.12 and helps ensuring that users only run code from repos that they trust.

    • If False, a prompt will ask the user whether the repo should be trusted.

    • If True, the repo will be added to the trusted list and loaded without requiring explicit confirmation.

    • If "check", the repo will be checked against the list of trusted repos in the cache. If it is not present in that list, the behaviour will fall back onto the trust_repo=False option.

    • If None: this will raise a warning, inviting the user to set trust_repo to either False, True or "check". This is only present for backward compatibility and will be removed in v1.14.

    Default is None and will eventually change to "check" in v1.14.

For example:
>>> print(oneflow.hub.help('Oneflow-Inc/vision', 'resnet18', force_reload=True))
oneflow.hub.load(repo_or_dir, model, *args, source='github', trust_repo=None, force_reload=False, verbose=True, skip_validation=False, **kwargs)

Load a model from a github repo or a local directory. Note: Loading a model is the typical use case, but this can also be used to for loading other objects such as tokenizers, loss functions, etc. If source is ‘github’, repo_or_dir is expected to be of the form repo_owner/repo_name[:ref] with an optional ref (a tag or a branch). If source is ‘local’, repo_or_dir is expected to be a path to a local directory.

Parameters
  • repo_or_dir (str) – If source is ‘github’, this should correspond to a github repo with format repo_owner/repo_name[:ref] with an optional ref (tag or branch), for example ‘Oneflow-Inc/vision:0.2.0’. If ref is not specified, the default branch is assumed to be main if it exists, and otherwise master. If source is ‘local’ then it should be a path to a local directory.

  • model (str) – the name of a callable (entrypoint) defined in the repo/dir’s hubconf.py.

  • *args (optional) – the corresponding args for callable model.

  • source (str, optional) – ‘github’ or ‘local’. Specifies how repo_or_dir is to be interpreted. Default is ‘github’.

  • trust_repo (bool, str or None) –

    "check", True, False or None. This parameter was introduced in v1.12 and helps ensuring that users only run code from repos that they trust.

    • If False, a prompt will ask the user whether the repo should be trusted.

    • If True, the repo will be added to the trusted list and loaded without requiring explicit confirmation.

    • If "check", the repo will be checked against the list of trusted repos in the cache. If it is not present in that list, the behaviour will fall back onto the trust_repo=False option.

    • If None: this will raise a warning, inviting the user to set trust_repo to either False, True or "check". This is only present for backward compatibility and will be removed in v1.14.

    Default is None and will eventually change to "check" in v1.14.

  • force_reload (bool, optional) – whether to force a fresh download of the github repo unconditionally. Does not have any effect if source = 'local'. Default is False.

  • verbose (bool, optional) – If False, mute messages about hitting local caches. Note that the message about first download cannot be muted. Does not have any effect if source = 'local'. Default is True.

  • skip_validation (bool, optional) – if False, oneflowhub will check that the branch or commit specified by the github argument properly belongs to the repo owner. This will make requests to the GitHub API; you can specify a non-default GitHub token by setting the GITHUB_TOKEN environment variable. Default is False.

  • **kwargs (optional) – the corresponding kwargs for callable model.

Returns

The output of the model callable when called with the given *args and **kwargs.

For example:
>>> # from a github repo
>>> repo = 'Oneflow-Inc/vision'
>>> model = oneflow.hub.load(repo, 'resnet50', weights='ResNet50_Weights.IMAGENET1K_V1')
>>> # from a local directory
>>> path = '/some/local/path/oneflow/vision'
>>> # xdoctest: +SKIP
>>> model = oneflow.hub.load(path, 'resnet50', weights='ResNet50_Weights.DEFAULT')
oneflow.hub.download_url_to_file(url, dst, hash_prefix=None, progress=True)

Download object at the given URL to a local path.

Parameters
  • url (str) – URL of the object to download

  • dst (str) – Full path where object will be saved, e.g. /tmp/temporary_file

  • hash_prefix (str, optional) – If not None, the SHA256 downloaded file should start with hash_prefix. Default: None

  • progress (bool, optional) – whether or not to display a progress bar to stderr Default: True

For example:
>>> # xdoctest: +REQUIRES(POSIX)
>>> oneflow.hub.download_url_to_file('https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/flowvision/classification/ResNet/resnet18.zip', '/tmp/temporary_file')
oneflow.hub.load_state_dict_from_url(url: str, model_dir: Optional[str] = None, map_location=None, progress: bool = True, check_hash: bool = False, file_name: Optional[str] = None)Dict[str, Any]

Loads the OneFlow serialized object at the given URL. If downloaded file is a zip file, it will be automatically decompressed. If the object is already present in model_dir, it’s deserialized and returned. The default value of model_dir is <hub_dir>/checkpoints where hub_dir is the directory returned by get_dir().

Parameters
  • url (str) – URL of the object to download

  • model_dir (str, optional) – directory in which to save the object

  • map_location (optional) – a function or a dict specifying how to remap storage locations (see oneflow.load)

  • progress (bool, optional) – whether or not to display a progress bar to stderr. Default: True

  • check_hash (bool, optional) – If True, the filename part of the URL should follow the naming convention filename-<sha256>.ext where <sha256> is the first eight or more digits of the SHA256 hash of the contents of the file. The hash is used to ensure unique names and to verify the contents of the file. Default: False

  • file_name (str, optional) – name for the downloaded file. Filename from url will be used if not set.

For example:
>>> state_dict = oneflow.hub.load_state_dict_from_url('https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/flowvision/classification/ResNet/resnet18.zip')

Running a loaded model:

Note that *args and **kwargs in oneflow.hub.load() are used to instantiate a model. After you have loaded a model, how can you find out what you can do with the model? A suggested workflow is

  • dir(model) to see all available methods of the model.

  • help(model.foo) to check what arguments model.foo takes to run

To help users explore without referring to documentation back and forth, we strongly recommend repo owners make function help messages clear and succinct. It’s also helpful to include a minimal working example.

Where are my downloaded models saved?

The locations are used in the order of

  • Calling hub.set_dir(<PATH_TO_HUB_DIR>)

  • $ONEFLOW_HOME/hub, if environment variable ONEFLOW_HOME is set.

  • $XDG_CACHE_HOME/oneflow/hub, if environment variable XDG_CACHE_HOME is set.

  • ~/.cache/oneflow/hub

oneflow.hub.get_dir()

Get the OneFlow Hub cache directory used for storing downloaded models & weights. If set_dir() is not called, default path is $ONEFLOW_HOME/hub where environment variable $ONEFLOW_HOME defaults to $XDG_CACHE_HOME/oneflow. $XDG_CACHE_HOME follows the X Design Group specification of the Linux filesystem layout, with a default value ~/.cache if the environment variable is not set.

oneflow.hub.set_dir(d)

Optionally set the OneFlow Hub directory used to save downloaded models & weights.

Parameters

d (str) – path to a local folder to save downloaded models & weights.

Caching logic

By default, we don’t clean up files after loading it. Hub uses the cache by default if it already exists in the directory returned by get_dir().

Users can force a reload by calling hub.load(..., force_reload=True). This will delete the existing github folder and downloaded weights, reinitialize a fresh download. This is useful when updates are published to the same branch, users can keep up with the latest release.

Known limitations:

Oneflow hub works by importing the package as if it was installed. There are some side effects introduced by importing in Python. For example, you can see new items in Python caches sys.modules and sys.path_importer_cache which is normal Python behavior. This also means that you may have import errors when importing different models from different repos, if the repos have the same sub-package names (typically, a model subpackage). A workaround for these kinds of import errors is to remove the offending sub-package from the sys.modules dict; more details can be found in this github issue.

A known limitation that is worth mentioning here: users CANNOT load two different branches of the same repo in the same python process. It’s just like installing two packages with the same name in Python, which is not good. Cache might join the party and give you surprises if you actually try that. Of course it’s totally fine to load them in separate processes.

oneflow.linalg

Common linear algebra operations.

Matrix Properties

norm

Returns the matrix norm or vector norm of a given tensor.

vector_norm

Computes a vector norm.

matrix_norm

Computes a matrix norm.

diagonal

Alias for oneflow.diagonal() with defaults dim1= -2, dim2= -1.

inv

Computes the inverse of a square matrix if it exists.

cross

Computes the cross product of two 3-dimensional vectors.

det

Computes the determinant of a square matrix.

oneflow.nn.init

oneflow.nn.init.calculate_gain(nonlinearity, param=None)
oneflow.nn.init.uniform_(tensor, a=0.0, b=1.0)

Fills the input Tensor with values drawn from the uniform distribution \(\mathcal{U}(a, b)\).

The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.

Parameters
  • tensor – an n-dimensional oneflow.Tensor

  • a – the lower bound of the uniform distribution

  • b – the upper bound of the uniform distribution

Examples

>>> w = flow.empty(3, 5)
>>> nn.init.uniform_(w)
oneflow.nn.init.normal_(tensor, mean=0.0, std=1.0)

Fills the input Tensor with values drawn from the normal distribution \(\mathcal{N}(\text{mean}, \text{std}^2)\).

The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.

Parameters
  • tensor – an n-dimensional oneflow.Tensor

  • mean – the mean of the normal distribution

  • std – the standard deviation of the normal distribution

Examples

>>> w = flow.empty(3, 5)
>>> nn.init.normal_(w)
oneflow.nn.init.constant_(tensor, val)

Fills the input Tensor with the value \(\text{val}\).

The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.

Parameters
  • tensor – an n-dimensional oneflow.Tensor

  • val – the value to fill the tensor with

Examples

>>> w = flow.empty(3, 5)
>>> nn.init.constant_(w, 0.3)
oneflow.nn.init.ones_(tensor)

Fills the input Tensor with the scalar value 1.

The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.

Parameters

tensor – an n-dimensional oneflow.Tensor

Examples

>>> w = flow.empty(3, 5)
>>> nn.init.ones_(w)
oneflow.nn.init.zeros_(tensor)

Fills the input Tensor with the scalar value 0.

The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.

Parameters

tensor – an n-dimensional oneflow.Tensor

Examples

>>> w = flow.empty(3, 5)
>>> nn.init.zeros_(w)
oneflow.nn.init.xavier_uniform_(tensor, gain=1.0, *, data_format='NCHW')

Fills the input Tensor with values according to the method described in Understanding the difficulty of training deep feedforward neural networks - Glorot, X. & Bengio, Y. (2010), using a uniform distribution. The resulting tensor will have values sampled from \(\mathcal{U}(-a, a)\) where

\[a = \text{gain} \times \sqrt{\frac{6}{\text{fan_in} + \text{fan_out}}}\]

The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.

Also known as Glorot initialization.

Parameters
  • tensor – an n-dimensional oneflow.Tensor

  • gain – an optional scaling factor

Examples

>>> w = flow.empty(3, 5)
>>> nn.init.xavier_uniform_(w, gain=nn.init.calculate_gain('relu'))
oneflow.nn.init.xavier_normal_(tensor, gain=1.0, *, data_format='NCHW')

Fills the input Tensor with values according to the method described in Understanding the difficulty of training deep feedforward neural networks - Glorot, X. & Bengio, Y. (2010), using a normal distribution. The resulting tensor will have values sampled from \(\mathcal{N}(0, \text{std}^2)\) where

\[\text{std} = \text{gain} \times \sqrt{\frac{2}{\text{fan_in} + \text{fan_out}}}\]

The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.

Also known as Glorot initialization.

Parameters
  • tensor – an n-dimensional oneflow.Tensor

  • gain – an optional scaling factor

Examples

>>> w = flow.empty(3, 5)
>>> nn.init.xavier_normal_(w)
oneflow.nn.init.kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu', *, data_format='NCHW')

Fills the input Tensor with values according to the method described in Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification - He, K. et al. (2015), using a uniform distribution. The resulting tensor will have values sampled from \(\mathcal{U}(-\text{bound}, \text{bound})\) where

\[\text{bound} = \text{gain} \times \sqrt{\frac{3}{\text{fan_mode}}}\]

The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.

Also known as He initialization.

Parameters
  • tensor – an n-dimensional oneflow.Tensor

  • a – the negative slope of the rectifier used after this layer (only used with 'leaky_relu')

  • mode – either 'fan_in' (default) or 'fan_out'. Choosing 'fan_in' preserves the magnitude of the variance of the weights in the forward pass. Choosing 'fan_out' preserves the magnitudes in the backwards pass.

  • nonlinearity – the non-linear function (nn.functional name), recommended to use only with 'relu' or 'leaky_relu' (default).

Examples

>>> w = flow.empty(3, 5)
>>> nn.init.kaiming_uniform_(w, mode='fan_in', nonlinearity='relu')
oneflow.nn.init.kaiming_normal_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu', *, data_format='NCHW')

Fills the input Tensor with values according to the method described in Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification - He, K. et al. (2015), using a normal distribution. The resulting tensor will have values sampled from \(\mathcal{N}(0, \text{std}^2)\) where

\[\text{std} = \frac{\text{gain}}{\sqrt{\text{fan_mode}}}\]

The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.

Also known as He initialization.

Parameters
  • tensor – an n-dimensional oneflow.Tensor

  • a – the negative slope of the rectifier used after this layer (only used with 'leaky_relu')

  • mode – either 'fan_in' (default) or 'fan_out'. Choosing 'fan_in' preserves the magnitude of the variance of the weights in the forward pass. Choosing 'fan_out' preserves the magnitudes in the backwards pass.

  • nonlinearity – the non-linear function (nn.functional name), recommended to use only with 'relu' or 'leaky_relu' (default).

Examples

>>> w = flow.empty(3, 5)
>>> nn.init.kaiming_normal_(w, mode='fan_out', nonlinearity='relu')
oneflow.nn.init.trunc_normal_(tensor, mean=0.0, std=1.0, a=- 2.0, b=2.0)
oneflow.nn.init.orthogonal_(tensor, gain=1.0)

Fills the input Tensor with a (semi) orthogonal matrix, as described in Exact solutions to the nonlinear dynamics of learning in deep linear neural networks - Saxe, A. et al. (2013). The input tensor must have at least 2 dimensions, and for tensors with more than 2 dimensions the trailing dimensions are flattened.

The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.

Parameters
  • tensor – an n-dimensional torch.Tensor, where \(n \geq 2\)

  • gain – optional scaling factor

Examples

>>> w = flow.empty(3, 5)
>>> nn.init.orthogonal_(w)

oneflow.optim

oneflow.optim is a package implementing various optimization algorithms. Most commonly used methods are already supported, and the interface is general enough, so that more sophisticated ones can be also easily integrated in the future.

How to use an optimizer

To use oneflow.optim you have to construct an optimizer object, that will hold the current state and will update the parameters based on the computed gradients.

Constructing it

To construct an Optimizer you have to give it an iterable containing the parameters (all should be Variable s) to optimize. Then, you can specify optimizer-specific options such as the learning rate, weight decay, etc.

Note

If you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.

In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.

Example:

import oneflow
import oneflow.nn as nn
import oneflow.optim as optim

model = nn.Linear(16, 3)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

Per-parameter options

Optimizer also support specifying per-parameter options. To do this, instead of passing an iterable of Variable, pass in an iterable of dict. Each of them will define a separate parameter group, and should contain a params key, containing a list of parameters belonging to it. Other keys should match the keyword arguments accepted by the optimizers, and will be used as optimization options for this group.

Note

You can still pass options as keyword arguments. They will be used as defaults, in the groups that didn’t override them. This is useful when you only want to vary a single option, while keeping all others consistent between parameter groups.

For example, this is very useful when one wants to specify per-layer learning rates:

import oneflow.nn as nn
import oneflow.optim as optim


class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.base = nn.Linear(64, 32)
        self.classifier = nn.Linear(32, 10)

    def forward(self, x):
        out = self.base(x)
        out = self.classifier(out)
        return out


model = Model()
optim.SGD(
    [
        {"params": model.base.parameters()},
        {"params": model.classifier.parameters(), "lr": 1e-3},
    ],
    lr=1e-2,
    momentum=0.9,
)

This means that model.base’s parameters will use the default learning rate of 1e-2, model.classifier’s parameters will use a learning rate of 1e-3, and a momentum of 0.9 will be used for all parameters.

Taking an optimization step

All optimizers implement a step() method, that updates the parameters. It can be used in two ways:

optimizer.step()

This is a simplified version supported by most optimizers. The function can be called once the gradients are computed using e.g. backward().

Example:

import oneflow
import oneflow.nn as nn
import oneflow.nn.functional as F
import oneflow.optim as optim
from oneflow.utils.data import Dataset, DataLoader


class CustomDataset(Dataset):
    def __init__(self, num):
        self.inputs = oneflow.randn(num, 1)
        self.targets = oneflow.sin(self.inputs)

    def __len__(self):
        return self.inputs.shape[0]

    def __getitem__(self, index):
        return self.inputs[index], self.targets[index]


class Model(nn.Module):
    def __init__(self, input_size):
        super(Model, self).__init__()
        self.linear1 = nn.Linear(input_size, 64)
        self.linear2 = nn.Linear(64, input_size)

    def forward(self, x):
        out = self.linear1(x)
        return self.linear2(F.relu(out))


dataset = CustomDataset(10000)
dataloader = DataLoader(dataset, batch_size=10)
model = Model(1)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)

for epoch in range(100):
    for input, target in dataloader:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()

Base class

class oneflow.optim.Optimizer(parameters, options)

Optimizer.add_param_group

Add a param group to the Optimizer s param_groups.

Optimizer.load_state_dict

Load the state of the optimizer which is created by state_dict function.

Optimizer.state_dict

Returns the state of the optimizer as a dict.

Optimizer.step

Performs a single optimization step (parameter update).

Optimizer.zero_grad

Sets the gradients of all optimized oneflow.Tensor s to zero.

Algorithms

Adagrad

Implements Adagrad Optimizer.

Adam

Implements Adam algorithm.

AdamW

Implements AdamW algorithm.

LAMB

Implements LAMB algorithm.

RMSprop

Implements RMSprop algorithm.

SGD

Implements SGD algorithm.

LBFGS

Implements LBFGS algorithm

Adjust Learning Rate

oneflow.optim.lr_scheduler provides several methods to adjust the learning rate based on the number of epochs. oneflow.optim.lr_scheduler.ReduceLROnPlateau allows dynamic learning rate reducing based on some validation measurements.

Learning rate scheduling should be applied after optimizer’s update; e.g., you should write your code this way:

Example:

import oneflow
import oneflow.nn as nn
import oneflow.nn.functional as F
import oneflow.optim as optim
from oneflow.utils.data import Dataset, DataLoader


class CustomDataset(Dataset):
    def __init__(self, num):
        self.inputs = oneflow.randn(num, 1)
        self.targets = oneflow.sin(self.inputs)

    def __len__(self):
        return self.inputs.shape[0]

    def __getitem__(self, index):
        return self.inputs[index], self.targets[index]


class Model(nn.Module):
    def __init__(self, input_size):
        super(Model, self).__init__()
        self.linear1 = nn.Linear(input_size, 64)
        self.linear2 = nn.Linear(64, input_size)

    def forward(self, x):
        out = self.linear1(x)
        return self.linear2(F.relu(out))


dataset = CustomDataset(10000)
dataloader = DataLoader(dataset, batch_size=10)
model = Model(1)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)
scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)

for epoch in range(20):
    for input, target in dataloader:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
    scheduler.step()

Most learning rate schedulers can be chained (also referred to as chaining schedulers).

Example:

import oneflow
import oneflow.nn as nn
import oneflow.nn.functional as F
import oneflow.optim as optim
from oneflow.utils.data import Dataset, DataLoader


class CustomDataset(Dataset):
    def __init__(self, num):
        self.inputs = oneflow.randn(num, 1)
        self.targets = oneflow.sin(self.inputs)

    def __len__(self):
        return self.inputs.shape[0]

    def __getitem__(self, index):
        return self.inputs[index], self.targets[index]


class Model(nn.Module):
    def __init__(self, input_size):
        super(Model, self).__init__()
        self.linear1 = nn.Linear(input_size, 64)
        self.linear2 = nn.Linear(64, input_size)

    def forward(self, x):
        out = self.linear1(x)
        return self.linear2(F.relu(out))


dataset = CustomDataset(10000)
dataloader = DataLoader(dataset, batch_size=10)
model = Model(1)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)
scheduler1 = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
scheduler2 = optim.lr_scheduler.MultiStepLR(optimizer, milestones=[5, 10], gamma=0.1)

for epoch in range(20):
    for input, target in dataloader:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
    scheduler1.step()
    scheduler2.step()

In many places in the documentation, we will use the following template to refer to schedulers algorithms.

>>> scheduler = ...
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()

Warning

If you use the learning rate scheduler (calling scheduler.step()) before the optimizer’s update (calling optimizer.step()), this will skip the first value of the learning rate schedule. Please check if you are calling scheduler.step() at the wrong time.

lr_scheduler.CosineAnnealingLR

Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr and \(T_{cur}\) is the number of epochs since the last restart in SGDR:

lr_scheduler.CosineDecayLR

This operator creates a Cosine decayed learning rate scheduler.

lr_scheduler.ExponentialLR

Decays the learning rate of each parameter group by gamma every epoch.

lr_scheduler.LambdaLR

Sets the learning rate of each parameter group to the initial lr times a given function.

lr_scheduler.MultiStepLR

Decays the learning rate of each parameter group by gamma once the number of step reaches one of the milestones.

lr_scheduler.PolynomialLR

This operator creates a polynomial decayed learning rate scheduler.

lr_scheduler.ReduceLROnPlateau

Reduce learning rate when a metric has stopped improving.

lr_scheduler.StepLR

Decays the learning rate of each parameter group by gamma every step_size steps.

lr_scheduler.ConstantLR

Decays the learning rate of each parameter group by a small constant factor until the number of step reaches a pre-defined milestone: total_iters.

lr_scheduler.LinearLR

Decays the learning rate of each parameter group by linearly changing small multiplicative factor until the number of step reaches a pre-defined milestone: total_iters.

lr_scheduler.ChainedScheduler

Chains list of learning rate schedulers.

lr_scheduler.SequentialLR

Receives the list of schedulers that is expected to be called sequentially during optimization process and milestone points that provides exact intervals to reflect which scheduler is supposed to be called at a given step.

lr_scheduler.CosineAnnealingWarmRestarts

Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr, \(T_{cur}\) is the number of steps since the last restart and \(T_{i}\) is the number of steps between two warm restarts in SGDR:

oneflow.nn.Graph

Base class for running neural networks in Static Graph Mode.

Currently, there are two main ways to run models in deep learning frameworks, namely dynamic graphs and static graphs , which are also conventionally referred to as Eager Mode to Static Graph Mode and Static Graph Mode in OneFlow.

Both approaches have their advantages and disadvantages, and OneFlow provides support for both approaches, with Eager mode being the default.

Generally speaking, dynamic graphs are easier to use and static graphs have more performance advantages. oneflow.nn.Graph module is provided by OneFlow to allow users to build static graphs and train models with Eager-like programming conventions.

Eager Mode to Static Graph Mode

OneFlow runs in Eager mode by default.

OneFlow’s nn.Graph is programmed in a style very similar to Eager Mode, so it is possible to make small changes and get large performance gains.

The following script shows the process of building a neural network in eager mode using the interface under oneflow.nn :

import oneflow as flow
import oneflow.nn as nn

class ModuleMyLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.weight = nn.Parameter(flow.randn(in_features, out_features))
        self.bias = nn.Parameter(flow.randn(out_features))

    def forward(self, input):
        return flow.matmul(input, self.weight) + self.bias

linear_model = ModuleMyLinear(4, 3)

Eager nn.Module can be reused by nn.Graph. The above script for eager mode can be changed to static Graph mode by adding just a few lines of code, which consists of the following steps:

  • Define your customized graph as a subclass of nn.Graph

  • At the beginning of __init__. Call super().__init__() to let OneFlow do the necessary initialization of the Graph

  • Reuse the nn.Module object in Eager mode in __init__ (self.model = model)

  • Describe the computation in the build method

  • Instantiate your graph then call it.

class GraphMyLinear(nn.Graph):
    def __init__(self):
        super().__init__()
        self.model = linear_model

    def build(self, input):
        return self.model(input)

graph_mylinear = GraphMyLinear()
input = flow.randn(1, 4)
out = graph_mylinear(input)
print(out)

tensor([[-0.3298, -3.7907,  0.1661]], dtype=oneflow.float32)

Static Graph Mode

Constructing a Graph

Base class for training or evaluating a neural network in static graph mode.

__init__

Initializes internal Graph states.

build

The build() method must be overridden to define neural network computaion logic.

add_optimizer

Add an optimizer, an learning rate scheduler to the graph.

set_grad_scaler

Set the GradScaler for gradient and loss scaling.

Executing a Graph

Call a nn.Graph instance to run a customized graph.

__call__

Call nn.Graph subclass instance to run your customized graph.

Config options on a Graph

Optimization options of a nn.Graph.

enable_amp

If set to true, then graph will use mixed precision mode, it means use both float16 and float32 during model training.

enable_zero

Enable ZeRO redundancy optimizer.

allow_fuse_model_update_ops

If set to true, try to fuse cast + scale + l1_l2_regularize_gradient + model_update to one op to improve performance.

allow_fuse_add_to_output

If set to true, try to fuse a binary element-wise add operator to one of the predecessors to improve performance.

allow_fuse_cast_scale

If set to true, try to fuse cast and scalar_mul_by_tensor to improve performance.

set_gradient_accumulation_steps

Set num of steps to accumulate gradient.

enable_cudnn_conv_heuristic_search_algo

Whether enable cudnn conv operation to use heuristic search algorithm.

enable_straighten_algorithm

Whether enable the straighten algorithm.

enable_compress_memory

If true, then the graph will try its best to find the minimum memory allocation strategy.

Config options on a GraphModule

GraphModule is the graph representation of a nn.Module in a nn.Graph.

When an nn.Module is added into an nn.Graph, it is wrapped into a ProxyModule. The ProxyModule has a GraphModule inside it. You can get and set the GraphModule to enable graph optimization on the nn.Module.

set_stage

Set stage id and placement of nn.Module in pipeline parallelism.

activation_checkpointing

Set/Get whether do activation checkpointing in this nn.Module.

Save & Load a Model

state_dict

Returns a dictionary containing a whole state of the graph.

load_state_dict

Copies module’s states and other graph states from state_dict into this graph.

Debug a Graph

__repr__

For printing the graph structure.

debug

Open or close debug mode of the graph.

name

Name auto-generated for this graph.

Auto Parallelism

As the scale of deep-learning models grows larger and larger, distributed training, or parallelism, is needed. Data parallelism and model parallelism has been designed to speed up the training and solve memory issues.

In oneflow, SBP signature enables users to configure parallelism policy easily. However, users still need to specify the SBP property for each operator, or most of them. Users might spend a couple of days digging into the detail of parallelism and get a low throughput just because of a slight mistake in the configuration of SBP signature.

Note

It only works on oneflow.nn.Graph mode.

Our strength

To get rid of all those configurations for SBP signatures, we developed auto parallelism. Still, configurations of placement are necessary and we have not supported auto placement yet. If you read this paragraph before you rush into any SBP stuff, then congratulation, you do not need to learn SBPs. You can start writing your code as you did under CPU mode. Our auto parallelism would generate a fast strategy customized for your specific models, the size of parameters, and the number of available GPUs.

How to use auto parallelism?

You just need to simply enable the configuration settings in the model of oneflow.nn.Graph .

Example:

import oneflow as flow
class SubclassGraph(flow.nn.Graph):
    def __init__(self):
        super().__init__() # MUST be called
        # auto parallelism configuration
        self.config.enable_auto_parallel(True)
        # other configurations about auto parallelism
        # ......

    def build(self):
        pass

Warning

If you enable auto parallelism, OneFlow will take care of the SBP configurations of operators except for explicit to_global functions.

Configuration API for auto parallelism

enable_auto_parallel

If true, then graph will use the auto parallel algorithm to select a parallelism strategy.

enable_auto_parallel_ignore_user_sbp_config

If true, it will ignore all user configurations of SBP.

set_auto_parallel_computation_cost_ratio

Set coefficient of computation cost in auto-parallel algorithm.

set_auto_parallel_wait_time

Set wait time for auto-parallel algorithm.

enable_auto_parallel_trunk_algo

Find the trunk of the SBP graph, then reduce the wait time for tributaries.

enable_auto_parallel_sbp_collector

Use “sbp collector” to create “sbp proxy” for nodes with multiple downstream operators.

enable_auto_memory

Whether we use a parallelism strategy with less memory

oneflow.nn.image

Image operations for neural networks

Resize

alias of oneflow.nn.modules.dataset.ImageResize

batch_align

alias of oneflow.nn.modules.dataset.ImageBatchAlign

decode

alias of oneflow.nn.modules.dataset.ImageDecode

flip

alias of oneflow.nn.modules.dataset.ImageFlip

normalize

alias of oneflow.nn.modules.dataset.ImageNormalize

oneflow.utils.data

Copyright 2020 The OneFlow Authors. All rights reserved.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

At the heart of Oneflow data loading utility is the oneflow.utils.data.DataLoader class. It represents a Python iterable over a dataset, with support for

These options are configured by the constructor arguments of a DataLoader, which has signature:

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None, *, prefetch_factor=2,
           persistent_workers=False)

The sections below describe in details the effects and usages of these options.

Dataset Types

The most important argument of DataLoader constructor is dataset, which indicates a dataset object to load data from. Oneflow supports two different types of datasets:

Map-style datasets

A map-style dataset is one that implements the __getitem__() and __len__() protocols, and represents a map from (possibly non-integral) indices/keys to data samples.

For example, such a dataset, when accessed with dataset[idx], could read the idx-th image and its corresponding label from a folder on the disk.

See Dataset for more details.

Iterable-style datasets

An iterable-style dataset is an instance of a subclass of IterableDataset that implements the __iter__() protocol, and represents an iterable over data samples. This type of datasets is particularly suitable for cases where random reads are expensive or even improbable, and where the batch size depends on the fetched data.

For example, such a dataset, when called iter(dataset), could return a stream of data reading from a database, a remote server, or even logs generated in real time.

See IterableDataset for more details.

Note

When using an IterableDataset with multi-process data loading. The same dataset object is replicated on each worker process, and thus the replicas must be configured differently to avoid duplicated data. See IterableDataset documentations for how to achieve this.

Data Loading Order and Sampler

For iterable-style datasets, data loading order is entirely controlled by the user-defined iterable. This allows easier implementations of chunk-reading and dynamic batch size (e.g., by yielding a batched sample at each time).

The rest of this section concerns the case with map-style datasets. oneflow.utils.data.Sampler classes are used to specify the sequence of indices/keys used in data loading. They represent iterable objects over the indices to datasets. E.g., in the common case with stochastic gradient decent (SGD), a Sampler could randomly permute a list of indices and yield each one at a time, or yield a small number of them for mini-batch SGD.

A sequential or shuffled sampler will be automatically constructed based on the shuffle argument to a DataLoader. Alternatively, users may use the sampler argument to specify a custom Sampler object that at each time yields the next index/key to fetch.

A custom Sampler that yields a list of batch indices at a time can be passed as the batch_sampler argument. Automatic batching can also be enabled via batch_size and drop_last arguments. See the next section for more details on this.

Note

Neither sampler nor batch_sampler is compatible with iterable-style datasets, since such datasets have no notion of a key or an index.

Loading Batched and Non-Batched Data

DataLoader supports automatically collating individual fetched data samples into batches via arguments batch_size, drop_last, batch_sampler, and collate_fn (which has a default function).

Automatic batching (default)

This is the most common case, and corresponds to fetching a minibatch of data and collating them into batched samples, i.e., containing Tensors with one dimension being the batch dimension (usually the first).

When batch_size (default 1) is not None, the data loader yields batched samples instead of individual samples. batch_size and drop_last arguments are used to specify how the data loader obtains batches of dataset keys. For map-style datasets, users can alternatively specify batch_sampler, which yields a list of keys at a time.

Note

The batch_size and drop_last arguments essentially are used to construct a batch_sampler from sampler. For map-style datasets, the sampler is either provided by user or constructed based on the shuffle argument. For iterable-style datasets, the sampler is a dummy infinite one. See this section on more details on samplers.

Note

When fetching from iterable-style datasets with multi-processing, the drop_last argument drops the last non-full batch of each worker’s dataset replica.

After fetching a list of samples using the indices from sampler, the function passed as the collate_fn argument is used to collate lists of samples into batches.

In this case, loading from a map-style dataset is roughly equivalent with:

for indices in batch_sampler:
    yield collate_fn([dataset[i] for i in indices])

and loading from an iterable-style dataset is roughly equivalent with:

dataset_iter = iter(dataset)
for indices in batch_sampler:
    yield collate_fn([next(dataset_iter) for _ in indices])

A custom collate_fn can be used to customize collation, e.g., padding sequential data to max length of a batch. See this section on more about collate_fn.

Disable automatic batching

In certain cases, users may want to handle batching manually in dataset code, or simply load individual samples. For example, it could be cheaper to directly load batched data (e.g., bulk reads from a database or reading continuous chunks of memory), or the batch size is data dependent, or the program is designed to work on individual samples. Under these scenarios, it’s likely better to not use automatic batching (where collate_fn is used to collate the samples), but let the data loader directly return each member of the dataset object.

When both batch_size and batch_sampler are None (default value for batch_sampler is already None), automatic batching is disabled. Each sample obtained from the dataset is processed with the function passed as the collate_fn argument.

When automatic batching is disabled, the default collate_fn simply converts NumPy arrays into Oneflow Tensors, and keeps everything else untouched.

In this case, loading from a map-style dataset is roughly equivalent with:

for index in sampler:
    yield collate_fn(dataset[index])

and loading from an iterable-style dataset is roughly equivalent with:

for data in iter(dataset):
    yield collate_fn(data)

See this section on more about collate_fn.

Working with collate_fn

The use of collate_fn is slightly different when automatic batching is enabled or disabled.

When automatic batching is disabled, collate_fn is called with each individual data sample, and the output is yielded from the data loader iterator. In this case, the default collate_fn simply converts NumPy arrays in Oneflow tensors.

When automatic batching is enabled, collate_fn is called with a list of data samples at each time. It is expected to collate the input samples into a batch for yielding from the data loader iterator. The rest of this section describes the behavior of the default collate_fn (default_collate()).

For instance, if each data sample consists of a 3-channel image and an integral class label, i.e., each element of the dataset returns a tuple (image, class_index), the default collate_fn collates a list of such tuples into a single tuple of a batched image tensor and a batched class label Tensor. In particular, the default collate_fn has the following properties:

  • It always prepends a new dimension as the batch dimension.

  • It automatically converts NumPy arrays and Python numerical values into Oneflow Tensors.

  • It preserves the data structure, e.g., if each sample is a dictionary, it outputs a dictionary with the same set of keys but batched Tensors as values (or lists if the values can not be converted into Tensors). Same for list s, tuple s, namedtuple s, etc.

Users may use customized collate_fn to achieve custom batching, e.g., collating along a dimension other than the first, padding sequences of various lengths, or adding support for custom data types.

If you run into a situation where the outputs of DataLoader have dimensions or type that is different from your expectation, you may want to check your collate_fn.

Single- and Multi-process Data Loading

A DataLoader uses single-process data loading by default.

Within a Python process, the Global Interpreter Lock (GIL) prevents true fully parallelizing Python code across threads. To avoid blocking computation code with data loading, Oneflow provides an easy switch to perform multi-process data loading by simply setting the argument num_workers to a positive integer.

Single-process data loading (default)

In this mode, data fetching is done in the same process a DataLoader is initialized. Therefore, data loading may block computing. However, this mode may be preferred when resource(s) used for sharing data among processes (e.g., shared memory, file descriptors) is limited, or when the entire dataset is small and can be loaded entirely in memory. Additionally, single-process loading often shows more readable error traces and thus is useful for debugging.

Multi-process data loading

Setting the argument num_workers as a positive integer will turn on multi-process data loading with the specified number of loader worker processes.

Warning

After several iterations, the loader worker processes will consume the same amount of CPU memory as the parent process for all Python objects in the parent process which are accessed from the worker processes. This can be problematic if the Dataset contains a lot of data (e.g., you are loading a very large list of filenames at Dataset construction time) and/or you are using a lot of workers (overall memory usage is number of workers * size of parent process). The simplest workaround is to replace Python objects with non-refcounted representations such as Pandas, Numpy or PyArrow objects.

In this mode, each time an iterator of a DataLoader is created (e.g., when you call enumerate(dataloader)), num_workers worker processes are created. At this point, the dataset, collate_fn, and worker_init_fn are passed to each worker, where they are used to initialize, and fetch data. This means that dataset access together with its internal IO, transforms (including collate_fn) runs in the worker process.

For map-style datasets, the main process generates the indices using sampler and sends them to the workers. So any shuffle randomization is done in the main process which guides loading by assigning indices to load.

For iterable-style datasets, since each worker process gets a replica of the dataset object, naive multi-process loading will often result in duplicated data. Using worker_init_fn, users may configure each replica independently. (See IterableDataset documentations for how to achieve this. ) For similar reasons, in multi-process loading, the drop_last argument drops the last non-full batch of each worker’s iterable-style dataset replica.

Workers are shut down once the end of the iteration is reached, or when the iterator becomes garbage collected.

Warning

It is generally not recommended to return CUDA tensors in multi-process loading because of many subtleties in using CUDA and sharing CUDA tensors in multiprocessing. Instead, we recommend using automatic memory pinning (i.e., setting pin_memory=True), which enables fast data transfer to CUDA-enabled GPUs.

Platform-specific behaviors

Since workers rely on Python multiprocessing, worker launch behavior is different on Windows compared to Unix.

  • On Unix, fork() is the default multiprocessing start method. Using fork(), child workers typically can access the dataset and Python argument functions directly through the cloned address space.

  • On Windows or MacOS, spawn() is the default multiprocessing start method. Using spawn(), another interpreter is launched which runs your main script, followed by the internal worker function that receives the dataset, collate_fn and other arguments through pickle serialization.

This separate serialization means that you should take two steps to ensure you are compatible with Windows while using multi-process data loading:

  • Wrap most of you main script’s code within if __name__ == '__main__': block, to make sure it doesn’t run again (most likely generating error) when each worker process is launched. You can place your dataset and DataLoader instance creation logic here, as it doesn’t need to be re-executed in workers.

  • Make sure that any custom collate_fn, worker_init_fn or dataset code is declared as top level definitions, outside of the __main__ check. This ensures that they are available in worker processes. (this is needed since functions are pickled as references only, not bytecode.)

Randomness in multi-process data loading

By default, each worker will have its Oneflow seed set to base_seed + worker_id, where base_seed is a long generated by main process using its RNG (thereby, consuming a RNG state mandatorily) or a specified generator. However, seeds for other libraries may be duplicated upon initializing workers, causing each worker to return identical random numbers.

In worker_init_fn, you may access the Oneflow seed set for each worker with oneflow.initial_seed(), and use it to seed other libraries before data loading.

Memory Pinning

Host to GPU copies are much faster when they originate from pinned (page-locked) memory. See cuda-memory-pinning for more details on when and how to use pinned memory generally.

For data loading, passing pin_memory=True to a DataLoader will automatically put the fetched data Tensors in pinned memory, and thus enables faster data transfer to CUDA-enabled GPUs.

The default memory pinning logic only recognizes Tensors and maps and iterables containing Tensors. By default, if the pinning logic sees a batch that is a custom type (which will occur if you have a collate_fn that returns a custom batch type), or if each element of your batch is a custom type, the pinning logic will not recognize them, and it will return that batch (or those elements) without pinning the memory. To enable memory pinning for custom batch or data type(s), define a pin_memory() method on your custom type(s).

See the example below.

Example:

class SimpleCustomBatch:
    def __init__(self, data):
        transposed_data = list(zip(*data))
        self.inp = oneflow.stack(transposed_data[0], 0)
        self.tgt = oneflow.stack(transposed_data[1], 0)

    # custom memory pinning method on custom type
    def pin_memory(self):
        self.inp = self.inp.pin_memory()
        self.tgt = self.tgt.pin_memory()
        return self

def collate_wrapper(batch):
    return SimpleCustomBatch(batch)

inps = oneflow.arange(10 * 5, dtype=oneflow.float32).view(10, 5)
tgts = oneflow.arange(10 * 5, dtype=oneflow.float32).view(10, 5)
dataset = TensorDataset(inps, tgts)

loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,
                    pin_memory=True)

for batch_ndx, sample in enumerate(loader):
    print(sample.inp.is_pinned())
    print(sample.tgt.is_pinned())
class oneflow.utils.data.DataLoader(dataset: oneflow.utils.data.dataset.Dataset[T_co], batch_size: Optional[int] = 1, shuffle: bool = False, sampler: Optional[oneflow.utils.data.sampler.Sampler[int]] = None, batch_sampler: Optional[oneflow.utils.data.sampler.Sampler[Sequence[int]]] = None, num_workers: int = 0, collate_fn: Optional[Callable[[List[T]], Any]] = None, pin_memory: bool = False, drop_last: bool = False, timeout: float = 0, worker_init_fn: Optional[Callable[[int], None]] = None, multiprocessing_context=None, generator=<oneflow._oneflow_internal.Generator object>, *, prefetch_factor: int = 2, persistent_workers: bool = False)

Data loader. Combines a dataset and a sampler, and provides an iterable over the given dataset.

The DataLoader supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning.

See oneflow.utils.data documentation page for more details.

In consideration of compatibility, the design of our dataloader is consistent with pytorch, ref: https://github.com/pytorch/pytorch/tree/v1.7.0

Parameters
  • dataset (Dataset) – dataset from which to load the data.

  • batch_size (int, optional) – how many samples per batch to load (default: 1).

  • shuffle (bool, optional) – set to True to have the data reshuffled at every epoch (default: False).

  • sampler (Sampler or Iterable, optional) – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented. If specified, shuffle must not be specified.

  • batch_sampler (Sampler or Iterable, optional) – like sampler, but returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.

  • num_workers (int, optional) – how many subprocesses to use for data loading (default: 0). 0 means that the data will be loaded in the main process.

  • collate_fn (callable, optional) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

  • pin_memory (bool, optional) – If True, the data loader will copy Tensors into CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below. (default: False)

  • drop_last (bool, optional) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)

  • timeout (numeric, optional) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: 0)

  • worker_init_fn (callable, optional) – If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)

  • prefetch_factor (int, optional, keyword-only arg) – Number of samples loaded in advance by each worker. 2 means there will be a total of 2 * num_workers samples prefetched across all workers. (default: 2)

  • persistent_workers (bool, optional) – If True, the data loader will immediately initialize worker preocesses and not shutdown them after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. If you are using oneflow with RDMA support in distributed training, the persistent_workers must be True otherwise will encounter segmentation fault. (default: False)

Warning

If the spawn start method is used, worker_init_fn cannot be an unpicklable object, e.g., a lambda function.

Warning

len(dataloader) heuristic is based on the length of the sampler used. When dataset is an IterableDataset, it instead returns an estimate based on len(dataset) / batch_size, with proper rounding depending on drop_last, regardless of multi-process loading configurations. This represents the best guess OneFlow can make because OneFlow trusts user dataset code in correctly handling multi-process loading to avoid duplicate data.

However, if sharding results in multiple workers having incomplete last batches, this estimate can still be inaccurate, because (1) an otherwise complete batch can be broken into multiple ones and (2) more than one batch worth of samples can be dropped when drop_last is set. Unfortunately, OneFlow can not detect such cases in general.

class oneflow.utils.data.Dataset(*args, **kwds)

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader.

Note

DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

class oneflow.utils.data.IterableDataset(*args, **kwds)

An iterable Dataset.

All datasets that represent an iterable of data samples should subclass it. Such form of datasets is particularly useful when data come from a stream.

All subclasses should overwrite __iter__(), which would return an iterator of samples in this dataset.

When a subclass is used with DataLoader, each item in the dataset will be yielded from the DataLoader iterator. When num_workers > 0, each worker process will have a different copy of the dataset object, so it is often desired to configure each copy independently to avoid having duplicate data returned from the workers.

Example 1: splitting workload across all workers in __iter__():

>>> class MyIterableDataset(flow.utils.data.IterableDataset):
...     def __init__(self, start, end):
...         super(MyIterableDataset).__init__()
...         assert end > start, "this example code only works with end >= start"
...         self.start = start
...         self.end = end
...
...     def __iter__(self):
...         iter_start = self.start
...         iter_end = self.end
...         return iter(range(iter_start, iter_end))
...
>>> # should give same set of data as range(3, 7), i.e., [3, 4, 5, 6].
>>> ds = MyIterableDataset(start=3, end=7)

>>> # Single-process loading
>>> print(list(flow.utils.data.DataLoader(ds, num_workers=0)))
[3, 4, 5, 6]

Example 2: splitting workload across all workers using worker_init_fn:

>>> class MyIterableDataset(flow.utils.data.IterableDataset):
...     def __init__(self, start, end):
...         super(MyIterableDataset).__init__()
...         assert end > start, "this example code only works with end >= start"
...         self.start = start
...         self.end = end
...
...     def __iter__(self):
...         return iter(range(self.start, self.end))
...
>>> # should give same set of data as range(3, 7), i.e., [3, 4, 5, 6].
>>> ds = MyIterableDataset(start=3, end=7)

>>> # Single-process loading
>>> print(list(flow.utils.data.DataLoader(ds, num_workers=0)))
[3, 4, 5, 6]
class oneflow.utils.data.TensorDataset(*tensors: oneflow.Tensor)

Dataset wrapping tensors.

Each sample will be retrieved by indexing tensors along the first dimension.

Parameters

*tensors (Tensor) – tensors that have the same size of the first dimension.

class oneflow.utils.data.ConcatDataset(datasets: Iterable[oneflow.utils.data.dataset.Dataset])

Dataset as a concatenation of multiple datasets.

This class is useful to assemble different existing datasets.

Parameters

datasets (sequence) – List of datasets to be concatenated

class oneflow.utils.data.Subset(dataset: oneflow.utils.data.dataset.Dataset[T_co], indices: Sequence[int])

Subset of a dataset at specified indices.

Parameters
  • dataset (Dataset) – The whole Dataset

  • indices (sequence) – Indices in the whole set selected for subset

oneflow.utils.data.random_split(dataset: oneflow.utils.data.dataset.Dataset[T], lengths: Sequence[int], generator: Optional[object] = <built-in method default_generator of PyCapsule object>)List[oneflow.utils.data.dataset.Subset[T]]

Randomly split a dataset into non-overlapping new datasets of given lengths. Optionally fix the generator for reproducible results, e.g.:

>>> random_split(range(10), [3, 7], generator=flow.Generator().manual_seed(42))
Parameters
  • dataset (Dataset) – Dataset to be split

  • lengths (sequence) – lengths of splits to be produced

  • generator (Generator) – Generator used for the random permutation.

class oneflow.utils.data.Sampler(data_source: Optional[Sized])

Base class for all Samplers.

Every Sampler subclass has to provide an __iter__() method, providing a way to iterate over indices of dataset elements, and a __len__() method that returns the length of the returned iterators.

Note

The __len__() method isn’t strictly required by DataLoader, but is expected in any calculation involving the length of a DataLoader.

class oneflow.utils.data.SequentialSampler(data_source)

Samples elements sequentially, always in the same order.

Parameters

data_source (Dataset) – dataset to sample from

class oneflow.utils.data.RandomSampler(data_source: Sized, replacement: bool = False, num_samples: Optional[int] = None, generator=None)

Samples elements randomly. If without replacement, then sample from a shuffled dataset. If with replacement, then user can specify num_samples to draw.

Parameters
  • data_source (Dataset) – dataset to sample from

  • replacement (bool) – samples are drawn on-demand with replacement if True, default=``False``

  • num_samples (int) – number of samples to draw, default=`len(dataset)`. This argument is supposed to be specified only when replacement is True.

  • generator (Generator) – Generator used in sampling.

class oneflow.utils.data.SubsetRandomSampler(indices: Sequence[int], generator=None)

Samples elements randomly from a given list of indices, without replacement.

Parameters
  • indices (sequence) – a sequence of indices

  • generator (Generator) – Generator used in sampling.

class oneflow.utils.data.BatchSampler(sampler: oneflow.utils.data.sampler.Sampler[int], batch_size: int, drop_last: bool)

Wraps another sampler to yield a mini-batch of indices.

Parameters
  • sampler (Sampler or Iterable) – Base sampler. Can be any iterable object

  • batch_size (int) – Size of mini-batch.

  • drop_last (bool) – If True, the sampler will drop the last batch if its size would be less than batch_size

Example

>>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
>>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
[[0, 1, 2], [3, 4, 5], [6, 7, 8]]
class oneflow.utils.data.distributed.DistributedSampler(dataset: oneflow.utils.data.dataset.Dataset, num_replicas: Optional[int] = None, rank: Optional[int] = None, shuffle: bool = True, seed: int = 0, drop_last: bool = False)

Sampler that restricts data loading to a subset of the dataset.

It is especially useful in conjunction with flow.nn.parallel.DistributedDataParallel. In such a case, each process can pass a DistributedSampler instance as a DataLoader sampler, and load a subset of the original dataset that is exclusive to it.

Note

Dataset is assumed to be of constant size.

Parameters
  • dataset – Dataset used for sampling.

  • num_replicas (int, optional) – Number of processes participating in distributed training. By default, world_size is retrieved from the current distributed group.

  • rank (int, optional) – Rank of the current process within num_replicas. By default, rank is retrieved from the current distributed group.

  • shuffle (bool, optional) – If True (default), sampler will shuffle the indices.

  • seed (int, optional) – random seed used to shuffle the sampler if shuffle=True. This number should be identical across all processes in the distributed group. Default: 0.

  • drop_last (bool, optional) – if True, then the sampler will drop the tail of the data to make it evenly divisible across the number of replicas. If False, the sampler will add extra indices to make the data evenly divisible across the replicas. Default: False.

Warning

In distributed mode, calling the set_epoch() method at the beginning of each epoch before creating the DataLoader iterator is necessary to make shuffling work properly across multiple epochs. Otherwise, the same ordering will be always used.

For example:

>>> sampler = DistributedSampler(dataset) if is_distributed else None
>>> loader = DataLoader(dataset, shuffle=(sampler is None), sampler=sampler)
>>> for epoch in range(start_epoch, n_epochs):
...     if is_distributed:
...         sampler.set_epoch(epoch)
...     train(loader)

oneflow.utils.global_view

Some global view Ops

to_global

Converts the input tensor or input tensor(s) in list/tuple/dict to global tensor(s).

to_local

Returns the local part of the input.

global_mode

Create a scope to provide global information for the computation process within it.

current_global_mode

Get the current global mode information.

oneflow.utils.tensor

oneflow.one_embedding

Embedding is an important component of recommender system, and it has also spread to many fields outside recommender systems. Each framework provides basic operators for Embedding, for example, flow.nn.Embedding in OneFlow:

import numpy as np
import oneflow as flow
indices = flow.tensor([[1, 2, 4, 5], [4, 3, 2, 9]], dtype=flow.int)
embedding = flow.nn.Embedding(10, 3)
y = embedding(indices)

OneEmbedding is the large-scale Embedding solution that OneFlow provides to solve the problem of large-scale deep recommender systems. OneEmbedding has the following advantages compared to ordinary opeartors:

  • With Flexible hierarchical storage, OneEmbedding can place the Embedding table on GPU memory, CPU memory or SSD, and allow high-speed devices to be used as caches for low-speed devices to achieve both speed and capacity.

  • OneEmbedding supports dynamic expansion.

Note

Please refer to Large-Scale Embedding Solution: OneEmbedding for a brief introduction to all features related to OneEmbedding.

Configure Embedding Table

OneEmbedding supports simultaneous creation of multiple Embedding table. The following codes configured three Embedding tables.

import oneflow as flow
import oneflow.nn as nn
import numpy as np

tables = [
    flow.one_embedding.make_table_options(
        flow.one_embedding.make_uniform_initializer(low=-0.1, high=0.1)
    ),
    flow.one_embedding.make_table_options(
        flow.one_embedding.make_uniform_initializer(low=-0.05, high=0.05)
    ),
    flow.one_embedding.make_table_options(
        flow.one_embedding.make_uniform_initializer(low=-0.15, high=0.15)
    ),
]

When configuring the Embedding table, you need to specify the initialization method. The above Embedding tables are initialized in the uniform method. The result of configuring the Embedding table is stored in the tables variable

oneflow.one_embedding.make_table_options(param)

make table param of Embedding tables

Parameters

param (dict or list) – param can be initializer or list of column_option. initializer can be made by make_uniform_initializer or make_normal_initializer or make_constant_initializer, column options can be made by make_column_options

Returns

table param of Embedding tables

Return type

dict

For example:

>>> import oneflow as flow
>>> initializer = flow.one_embedding.make_uniform_initializer(low=-scale, high=scale)
>>> table1 = flow.one_embedding.make_table_options(initializer)
>>> table2 = flow.one_embedding.make_table_options(initializer)
>>> tables = [table1, table2]
>>> # pass the tables to the "tables" param of flow.one_embedding.MultiTableEmbedding or flow.one_embedding.MultiTableMultiColumnEmbedding
>>> # ...
oneflow.one_embedding.make_table(param)

alias of oneflow.one_embedding.make_table_options

See also oneflow.one_embedding.make_table_options()

initialization method

make_uniform_initializer

make uniform initializer param of make_table_options

make_normal_initializer

make normal initializer param of make_table_options

Configure the Storage Attribute of the Embedding Table

Then run the following codes to configure the storage attribute of the Embedding table:

store_options = flow.one_embedding.make_cached_ssd_store_options(
cache_budget_mb=8142,
persistent_path="/your_path_to_ssd",
capacity=40000000,
size_factor=1,
physical_block_size=4096
)

Storage Method

make_device_mem_store_options

make GPU only store_options param of MultiTableEmbedding

make_cached_ssd_store_options

make SSD use GPU and host as cache store_options param of MultiTableEmbedding.

make_cached_host_mem_store_options

make host use GPU as cache store_options param of MultiTableEmbedding

Note

Please refer to Large-Scale Embedding Solution: OneEmbedding for a brief introduction to learn about How to Choose the Proper Storage Configuration

Instantiate Embedding

After the above configuration is completed, you can use MultiTableEmbedding to get the instantiated Embedding layer.

embedding_size = 128
embedding = flow.one_embedding.MultiTableEmbedding(
    name="my_embedding",
    embedding_dim=embedding_size,
    dtype=flow.float,
    key_type=flow.int64,
    tables=tables,
    store_options=store_options,
)

embedding.to("cuda")

Note

Please refer to Large-Scale Embedding Solution: OneEmbedding for a brief introduction to learn about Feature ID and Multi-Table Query.

MultiTableEmbedding

oneflow.one_embedding.MultiTableEmbedding(name, embedding_dim, dtype, key_type, tables, store_options, default_initializer=None, padding_idx=None, seed=0)

MultiTableEmbedding represent multi Embedding tables with same embedding_dim, dtype, and key_type.

Parameters
  • name (str) – The name of Embedding

  • embedding_dim (int) – the size of each embedding vector

  • dtype (flow.dtype) – the data type of embeddings

  • key_type (flow.dtype) – the data type of feature ids

  • tables (list) – list of table param which can be made by flow.one_embedding.make_table_options

  • store_options (dict) – store option of Embedding

  • default_initializer (dict, optional) – if tables param is None, use default_initializer to initialize table. Defaults to None.

  • padding_idx (int, optional) – If specified, the entries at padding_idx do not contribute to the gradient; therefore, the embedding vector at padding_idx is not updated during training, the embedding vector at padding_idx will default to all zeros.

For example:

>>> import oneflow as flow
>>> import numpy as np
>>> import oneflow.nn as nn
>>> # a simple example with 3 table
>>> table_size_array = [39884407, 39043, 17289]
>>> vocab_size = sum(table_size_array)
>>> num_tables = len(table_size_array)
>>> embedding_size = 128
>>> scales = np.sqrt(1 / np.array(table_size_array))
>>> tables = [
>>>     flow.one_embedding.make_table_options(
>>>         flow.one_embedding.make_uniform_initializer(low=-scale, high=scale)
>>>     )
>>>     for scale in scales
>>> ]
>>> store_options = flow.one_embedding.make_cached_ssd_store_options(
>>>     cache_budget_mb=8192, persistent_path="/your_path_to_ssd", capacity=vocab_size,
>>> )
>>> embedding = flow.one_embedding.MultiTableEmbedding(
>>>     name="my_embedding",
>>>     embedding_dim=embedding_size,
>>>     dtype=flow.float,
>>>     key_type=flow.int64,
>>>     tables=tables,
>>>     store_options=store_options,
>>> )
>>> embedding.to("cuda")
>>> mlp = flow.nn.FusedMLP(
>>>     in_features=embedding_size * num_tables,
>>>     hidden_features=[512, 256, 128],
>>>     out_features=1,
>>>     skip_final_activation=True,
>>> )
>>> mlp.to("cuda")
>>>
>>> class TrainGraph(flow.nn.Graph):
>>>     def __init__(self,):
>>>         super().__init__()
>>>         self.embedding_lookup = embedding
>>>         self.mlp = mlp
>>>         self.add_optimizer(
>>>             flow.optim.SGD(self.embedding_lookup.parameters(), lr=0.1, momentum=0.0)
>>>         )
>>>         self.add_optimizer(
>>>             flow.optim.SGD(self.mlp.parameters(), lr=0.1, momentum=0.0)
>>>         )
>>>     def build(self, ids):
>>>         embedding = self.embedding_lookup(ids)
>>>         loss = self.mlp(flow.reshape(embedding, (-1, num_tables * embedding_size)))
>>>         loss = loss.sum()
>>>         loss.backward()
>>>         return loss
>>> ids = np.random.randint(0, 1000, (100, num_tables), dtype=np.int64)
>>> ids_tensor = flow.tensor(ids, requires_grad=False).to("cuda")
>>> graph = TrainGraph()
>>> loss = graph(ids_tensor)
>>> print(loss)

forward

Embedding lookup operation

save_snapshot

save snapshot

load_snapshot

load snapshot

MultiTableMultiColumnEmbedding

oneflow.one_embedding.MultiTableMultiColumnEmbedding(name, embedding_dim, dtype, key_type, tables, store_options, default_initializer=None, padding_idx=None, seed=0)

MultiTableMultiColumnEmbedding represent multi Embedding tables with multi embedding_dim, same dtype, and key_type.

Parameters
  • name (str) – The name of Embedding

  • embedding_dim (list) – list of the size of each embedding vector

  • dtype (flow.dtype) – the data type of embeddings

  • key_type (flow.dtype) – the data type of feature ids

  • tables (list) – list of table param which can be made by flow.one_embedding.make_table_options

  • store_options (dict) – store option of Embedding

  • default_initializer (dict, optional) – if tables param is None, use default_initializer to initialize table. Defaults to None.

  • padding_idx (int, optional) – If specified, the entries at padding_idx do not contribute to the gradient; therefore, the embedding vector at padding_idx is not updated during training, the embedding vector at padding_idx will default to all zeros.

For example:

>>> import oneflow as flow
>>> import numpy as np
>>> import oneflow.nn as nn
>>> # a simple example with 3 table, every table has two column, the first column embedding_size is 10 and the second is 1.
>>> # every table's first column initialize with uniform(-1/sqrt(table_size), 1/sqrt(table_size)), second column initialize with normal(0, 1/sqrt(table_size))
>>> table_size_array = [39884407, 39043, 17289]
>>> vocab_size = sum(table_size_array)
>>> num_tables = len(table_size_array)
>>> embedding_size_list = [10, 1]
>>> scales = np.sqrt(1 / np.array(table_size_array))
>>> tables = [
>>>     flow.one_embedding.make_table_options(
>>>       [flow.one_embedding.make_column_options(
>>>         flow.one_embedding.make_uniform_initializer(low=-scale, high=scale)),
>>>        flow.one_embedding.make_column_options(
>>>         flow.one_embedding.make_normal_initializer(mean=0, std=scale))]
>>>     )
>>>     for scale in scales
>>> ]
>>> store_options = flow.one_embedding.make_cached_ssd_store_options(
>>>     cache_budget_mb=8192, persistent_path="/your_path_to_ssd", capacity=vocab_size,
>>> )
>>> embedding = flow.one_embedding.MultiTableMultiColumnEmbedding(
>>>     name="my_embedding",
>>>     embedding_dim=embedding_size_list,
>>>     dtype=flow.float,
>>>     key_type=flow.int64,
>>>     tables=tables,
>>>     store_options=store_options,
>>> )
>>> embedding.to("cuda")
>>> mlp = flow.nn.FusedMLP(
>>>     in_features=sum(embedding_size_list) * num_tables,
>>>     hidden_features=[512, 256, 128],
>>>     out_features=1,
>>>     skip_final_activation=True,
>>> )
>>> mlp.to("cuda")
>>>
>>> class TrainGraph(flow.nn.Graph):
>>>     def __init__(self,):
>>>         super().__init__()
>>>         self.embedding_lookup = embedding
>>>         self.mlp = mlp
>>>         self.add_optimizer(
>>>             flow.optim.SGD(self.embedding_lookup.parameters(), lr=0.1, momentum=0.0)
>>>         )
>>>         self.add_optimizer(
>>>             flow.optim.SGD(self.mlp.parameters(), lr=0.1, momentum=0.0)
>>>         )
>>>     def build(self, ids):
>>>         embedding = self.embedding_lookup(ids)
>>>         loss = self.mlp(flow.reshape(embedding, (-1, num_tables * sum(embedding_size_list))))
>>>         loss = loss.sum()
>>>         loss.backward()
>>>         return loss
>>> ids = np.random.randint(0, 1000, (100, num_tables), dtype=np.int64)
>>> ids_tensor = flow.tensor(ids, requires_grad=False).to("cuda")
>>> graph = TrainGraph()
>>> loss = graph(ids_tensor)
>>> print(loss)

forward

Embedding lookup operation

save_snapshot

save snapshot

load_snapshot

load snapshot

Construct Graph for Training

OneEmbedding is only supported in Graph mode.

num_tables = 3
mlp = flow.nn.FusedMLP(
    in_features=embedding_size * num_tables,
    hidden_features=[512, 256, 128],
    out_features=1,
    skip_final_activation=True,
)
mlp.to("cuda")

class TrainGraph(flow.nn.Graph):
    def __init__(self,):
        super().__init__()
        self.embedding_lookup = embedding
        self.mlp = mlp
        self.add_optimizer(
            flow.optim.SGD(self.embedding_lookup.parameters(), lr=0.1, momentum=0.0)
        )
        self.add_optimizer(
            flow.optim.SGD(self.mlp.parameters(), lr=0.1, momentum=0.0)
        )
    def build(self, ids):
        embedding = self.embedding_lookup(ids)
        loss = self.mlp(flow.reshape(embedding, (-1, num_tables * embedding_size)))
        loss = loss.sum()
        loss.backward()
        return loss

Note

Please refer to Distributed Training: OneEmbedding for a brief introduction to learn about Graph For Training

Persistent Read & Write

make_persistent_table_reader

Creates a reader for reading persistent table.

make_persistent_table_writer

Creates a writer for writing persistent table.

Copyright 2020 The OneFlow Authors. All rights reserved.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

class oneflow.one_embedding.Ftrl(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, weight_decay: float = 0.0, lr_power: float = - 0.5, initial_accumulator_value: float = 0.1, lambda1: float = 0.0, lambda2: float = 0.0, beta: float = 0.0)

FTRL Optimizer.

The formula is:

\[\begin{split}\begin{align} accumlator_{i+1} = accumlator_{i} + grad * grad \\ sigma = (accumulator_{i+1}^{lr\_power} - accumulator_{i}^{lr\_power}) / learning\_rate \\ z_{i+1} = z_{i} + grad - sigma * param_{i} \\ \text{} param_{i+1} = \begin{cases} 0 & \text{ if } |z_{i+1}| < \lambda_1 \\ -(\frac{\beta+accumlator_{i+1}^{lr\_power}}{learning\_rate} + \lambda_2)*(z_{i+1} - sign(z_{i+1})*\lambda_1) & \text{ otherwise } \\ \end{cases} \end{align}\end{split}\]

Example 1:

# Assume net is a custom model.
ftrl = flow.one_embedding.FTRL(net.parameters(), lr=1e-3)

for epoch in range(epochs):
    # Read data, Compute the loss and so on.
    # ...
    loss.backward()
    ftrl.step()
    ftrl.zero_grad()
Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate. Defaults to 1e-3.

  • weight_decay (float, optional) – weight decay (L2 penalty). Defaults to 0.0.

  • lr_power (float, optional) – learning rate decrease factor. Defaults to -0.5.

  • initial_accumulator_value (float, optional) – The initial value of accumlator. Defaults to 0.1.

  • lambda1 (float, optional) – L1 regularization strength. Defaults to 0.0.

  • lambda2 (float, optional) – L2 regularization strength. Defaults to 0.0.

  • beta (float, optional) – The value of beta. Defaults to 0.0.

step(closure: Optional[Callable] = None)

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

property support_sparse

Whether the Optimizer support sparse update.

Environment Variables

OneFlow has an extensive set of environment variables to tune for specific usage.

ONEFLOW_COMM_NET_IB_HCA

When there are multiple IB NIC(which can be checked by ibstatus on the server), the system uses the first IB NIC for comm_net communication by default.

When this environment variable is set, the system will check all IB NIC and find the NIC with the corresponding name. #5626

Values accepted

The default value is empty, such as mlx5_0:1mlx5_1:1. When the port is 0, the default value is 1, representing the first port.

ONEFLOW_COMM_NET_IB_GID_INDEX

For the query of ibv_query_gid, and 0 represents success. It often used with ONEFLOW_COMM_NET_IB_HCA. GID means the Global ID, QP under RoCE network must be built by this value, instead of just using the LID as in the IB network. #5626

Values accepted

The default value is 0, representing the port index value

ONEFLOW_COMM_NET_IB_QUEUE_DEPTH

Queue length of jobs in IB network.

This value effectively controls the size of the module without instead of using IB’s default size, such as ONEFLOW_COMM_NET_IB_MEM_BLOCK_SIZE.

Values accepted

The default value is 1024, receiving int64_t. The system would compare with max_qp_wr (Maximum number of outstanding WR on any work queue), and take the smaller one.

ONEFLOW_COMM_NET_IB_MEM_BLOCK_SIZE

The size of the module read when communicating.

The value can calculate the amount of module, and transmit it after encapsulation.

Values accepted

The default value is 8388608 (8M)

ONEFLOW_STREAM_CUDA_EVENT_FLAG_BLOCKING_SYNC

Represents stream, and marks Blocking synchronization in cuda. Detailed information, #5612, #5837

Values accepted

Define and set to false, and would be true` only when the value is ``1, true, yes, on and y.

ONEFLOW_LIBIBVERBS_PATH

To load the DynamicLibrary by dlopen at runtime, to find symbols of ibverbs functions by dlopen without linking during compile for better compatibility. #4852.

If it failed, it will output libibverbs not available, ibv_fork_init skipped, if it worked, the import oneflow will output such as loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1

Values accepted

The default value is empty, but will load libibverbs.so.1, libibverbs.so.

ONEFLOW_DEBUG_MODE

Enable debug mode, ONEFLOW_DEBUG can do.

If debug mode is on, it will output more INFO level logs, different prototxt and dot to files. The automatically inserted boxing information will be printed to the log file under eager global mode.

Values accepted

The default value is empty, but will receive any string.

ONEFLOW_DRY_RUN

Only for test running, it can generate log files like dot.

Exit once the test is succeed, do not try real training.

Values accepted

The default value is empty, but will receive any string.

ONEFLOW_DEBUG_KERNEL_SYNC_CHECK_NUMERICS

Only used when debugging because the performance would be affected, it could detect which op in the network appears nan or inf.

It will create CpuCheckNumericsKernelObserver under cpu , and CudaCheckNumericsKernelObserver under cuda #6052 .

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.

ONEFLOW_DEBUG_KERNEL_SYNC_CHECK

Only used when debugging because the performance would be affected.

It will create SyncCheckKernelObserver and will be synced after each kernel.

It could be used to debug cuda errors. #6052

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.

ONEFLOW_PROFILER_KERNEL_PROFILE_CUDA_MEMORY_BANDWIDTH

Used when generate profiler files by nsys.

Profiler is only valid for lazy temporarily.

It can estimate the memory bandwidth reached by kernel by counting the execution time of the GPU kernel and the size of the input and output memory, and help find potential kernels that can be optimized. Details

Values accepted

Define and set to false. When using, the compiled package needs to enable BUILD_PROFILER.

ONEFLOW_PROFILER_KERNEL_PROFILE_KERNEL_FORWARD_RANGE

The same as above. collect op name

Values accepted

Define and set to false. When using, the compiled package needs to enable BUILD_PROFILER.

ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER

Only use blob_access_checker after enabling, because blob_access_checker is for correctness assurance, and closing it in some cases can increase the kernel overhead. #5728

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.

ONEFLOW_KERNEL_ENABLE_CUDA_GRAPH

Takes effect under WITH_CUDA_GRAPHS and the default value is false. It uses more memory, so when there’s just enough memory, it won’t run.

Turning on CUDA_GRAPH will use up more memory CUDA Graphs support. #5868

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.

ONEFLOW_ACTOR_ENABLE_LIGHT_ACTOR

LightActor is a new type of Actor that only handles NormalForward and similar tasks where all regst_num is 1 or tasks with only one kernel. #5868. export ONEFLOW_KERNEL_ENABLE_CUDA_GRAPH=1 (Would use more memories), export ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE=1, export ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER=1, export ONEFLOW_ACTOR_ENABLE_LIGHT_ACTOR=1, export ONEFLOW_STREAM_REUSE_CUDA_EVENT=1 can be used together.

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.

ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE

#5720. It is used to enable local message queue, oneflow.config.thread_enable_local_message_queue(True) is no longer used.

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.

ONEFLOW_PERSISTENT_IN_STREAM_BUFFER_SIZE_BYTES

Represents the size of each read from disk. #5162

Values accepted

The default value is empty. If an invalid string or negative number is entered, the default value would be 32 * 1024; 32KB.

ONEFLOW_DECODER_ENABLE_NVJPEG_HARDWARE_ACCELERATION

NVJPEG_VER_MAJOR need to be bigger than 11. It can accelerate nvjpeg hardware, warm up jpeg decoder and hw_jpeg decoder, #5851.

Hardware JPEG decoder and NVIDIA nvJPEG library on NVIDIA A100 GPUs

Values accepted

Define and set to true, and would be true only when the value is 1, true, yes, on and y.

ONEFLOW_SERVING_DEBUG

For printing information of OneFlow Serving Debug

Values accepted

The default value is false

ONEFLOW_DISABLE_VIEW

To disable view mechanism, which means op related to view would stop running.

Values accepted

The default value is false

ONEFLOW_BOXING_DISABLE_MIDDLE_NODE_AND_CHECK

Whether to disable Middle Node. When it is false, all inter-SBP communication is supported

Values accepted

The default value is false

ONEFLOW_ONE_EMBEDDING_DISABLE_NUMA_AWARE_ALLOCATION

Whether to disable NUMA_AWARE memory allocation when the OneEmbedding module allocates video memory.

NUMA_AWARE memory allocation means that when allocating pinned host memory, the cpu close to the gpu will be considered (for example, if it is gpu 0 1, memory will be allocated on cpu0)

Values accepted

The default value is false

ONEFLOW_EP_CUDA_ENABLE_TF32_EXECUTION

Whether to allow CUDA to use TF32 numeric types for computation

Values accepted

The default value is true

ONEFLOW_FUNCTOR_DISABLE_FUSED_MLP

Whether to disable the fused_mlp operator implemented by cublasLt in FusedMLPFunctor, if disabled, it will degenerate into a multiple matrix multiplication operation.

Values accepted

The default value is false

ONEFLOW_ONE_EMBEDDING_EMBEDDING_SHUFFLE_INDEPENTENT_STREAM

Whether to put the EmbeddingShuffle of the OneEmbedding module on a separate stream for overlapping execution.

Values accepted

The default value is false

ONEFLOW_ONE_EMBEDDING_GRADIENT_SHUFFLE_USE_FP16

Whether to allow the EmbeddingGradientShuffle operator of the OneEmbedding module to use the FP16 data type in the AMP case.

Values accepted

The default value is true

ONEFLOW_ONE_EMBEDDING_NOT_FUSE_CAST_TO_UPDATE

Whether to disable the fusion of cast type conversion and parameter update of OneEmbedding parameters into one operator in the case of AMP

Values accepted

The default value is false

ONEFLOW_DEBUG_KERNEL_SYNC_CHECK_NUMERICS_DUMP

When the value appears NaN or Inf, save the data Dump.

Values accepted

The default value is false

ONEFLOW_MLIR_ENABLE_IR_PRINTING

Control whether to print ir when running each pass when debugging

Values accepted

The default value is false

ONEFLOW_MLIR_STDOUT

Control whether MLIR outputs log information in the console

Values accepted

The default value is false

ONEFLOW_MLIR_DUMP_IR

Control whether to dump ir files

Values accepted

The default value is false

ONEFLOW_MLIR_ENABLE_ROUND_TRIP

Control whether Oneflow Job goes into MLIR

Values accepted

The default value is false

ONEFLOW_KERNEL_REDUCE_SUM_USE_MATMUL

whether to use matrix multiplication for reduce_sum

Values accepted

The default value is false

ONEFLOW_ONE_EMBEDDING_ENABLE_QUANTIZED_COMM

Whether to quantify the shuffle application communication in the case of OneEmbedding multi-card

Values accepted

The default value is false

ONEFLOW_TENSOR_BUFFER_ALIGNED_SIZE

Align size when allocating TensorBuffer memory

Values accepted

The default value is 1024

ONEFLOW_TENSOR_BUFFER_POOL_THREAD_LOCAL_CACHE_SIZE

Control the size of thread_local_cache in TensorBufferPool

Values accepted

The default value is 64

ONEFLOW_GRPC_MAX_MESSAGE_BYTE_SIZE

Set the maximum size of the gRPC transport message

Values accepted

The default value is -1

ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_CAPACITY_HINT

Control the initial capacity of the PersistentTable of OneEmbedding to avoid frequent expansion

Values accepted

OneEmbedding will calculate according to the actual situation, and users can also choose to configure a larger capacity.

ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_NUM_WORKERS

The number of threads used for reading and writing the PersistentTable of OneEmbedding

Values accepted

The default value is 4

ONEFLOW_EP_CUDA_CONST_BUFFER_ELEMENT_COUNT

Specify the size of the all zero and all one buffers on the CUDA device.

This buffer can be used with matrix multiplication to implement operations such as reduce_sum

Values accepted

The default value is 1024x1024

OMP_NUM_THREADS

Set the number of threads used by OMP

Values accepted

The default value will be generated by specific computational logic.

SBP_INFER_RULE_TAG

Specify SBP derivation rules

Values accepted

When the default value is 1 , select the SBP that satisfies the producer or the SBP with the smallest cost as much as possible.

When the default value is 2, select the SBP that matches the most.

When the default value is 3, select the SBP with the smallest cost.

ONEFLOW_TENSOR_BUFFER_GROWTH_FACTOR

Control the growth factor of TensorBuffer

Values accepted

The default value is 1.0

ONEFLOW_TENSOR_BUFFER_SHRINK_FACTOR

Controls the shrink factor of TensorBuffer

Values accepted

The default value is 0.7

ONEFLOW_TENSOR_BUFFER_POOL_SIZE_FACTOR

Controls the size factor of TensorBuffer

Values accepted

The default value is 2.0

AUTO_PARALLEL_TRANSFER_COST

Control the size of the automatic parallel transfer cost

Values accepted

The default value is 1.65e8

ONEFLOW_DEBUG_PASS

Pass names and print job before and after a specific pass, such as export ONEFLOW_DEBUG_PASS="FuseAddToOutputPass.

Or ALL, print job before and after a specific pass, such as export ONEFLOW_DEBUG_PASS="ALL".

Values accepted

The default value is empty

ONEFLOW_PROFILER_HOST_THREAD_NAME_PREFIX

Add a prefix to the name of the named host thread in the profiling context to facilitate sorting in the visualization tool (nsight)

Values accepted

The default value is empty

oneflow.special

The oneflow.special module, modeled after SciPy’s special module.

digamma

Alias for oneflow.digamma().

erf

Alias for oneflow.erf().

erfc

Alias for oneflow.erfc().

erfinv

Alias for oneflow.erfinv().

exp2

Alias for oneflow.exp2().

expm1

Alias for oneflow.expm1().

log1p

Alias for oneflow.log1p().

log_softmax

Alias for oneflow.nn.functional.log_softmax().

logsumexp

Alias for oneflow.logsumexp().

round

Alias for oneflow.round().

softmax

Alias for oneflow.softmax().

zeta

Computes the Hurwitz zeta function, elementwise.

Indices and tables