OneFlow API Reference¶
Distributed performance (high efficiency) is the core technical difficulty of deep learning frameworks.
OneFlow upholds the core concept and architecture of static compilation and streaming parallelism around performance improvement and heterogeneous distributed scaling, solving the challenge of memory wall at cluster level with world-leading technology.
Troubleshooting¶
‘libunwind.h’ not found
You might add CMake argument
-DWITH_UNWIND=OFF
, or install libunwind in your system.
CUDNN_STATUS_NOT_INITIALIZED
You might see error message like these:
I0729 22:37:45.483937439 56788 ev_epoll_linux.c:82] Use of signals is disabled. Epoll enginll not be used E0729 22:37:45.515343 56788 version.cpp:82] Failed to get cuda runtime version: CUDA driver version nsufficient for CUDA runtime version F0729 22:38:31.209002 56788 improver.cpp:535] Check failed: mem_size > 0 (-524288000 vs. 0)
F0723 19:05:56.194067 40970 cuda_util.cpp:82] Check failed: error == CUDNN_STATUS_SUCCESS (1 vs. 0) CUDNN_STATUS_NOT_INITIALIZED
Please upgrade to Nvidia Linux x86_64 driver. Version >= 440.33 is recommended.
For more information, please refer to CUDA compatibility documentation.
Failed to compile
.cu
filesPlease refer to CUDA System Requirements . Make sure your linux distribution and libraries shipped with it meet the requirements.
If you are using tools like conda, please make sure libraries you install doesn’t shade the proper installation comes with linux distribution or package management like apt-get.
Please build OneFlow with a newer version of CMake. You could download version 3.14 from here: https://github.com/Kitware/CMake/releases/download/v3.14.0/cmake-3.14.0-Linux-x86_64.tar.gz
How do I know what compilers and flags are used to compile OneFlow?
run
make clean && make VERBOSE=1
to get exact compile commands with compiler path and flags
How to compile OneFlow with RDMA support?
add cmake flag
-DBUILD_RDMA
to compile OneFlow
Which version of g++ CMake is using to build OneFlow?
You should find a line like this in CMake output:
-- CMAKE_CXX_COMPILER_VERSION: [YOUR G++ VERSION NUMBER]
Failed to compile NCCL
Try use less threads when compiling OneFlow third party. For instance, use
cmake -DTHIRD_PARTY=ON .. && make
instead of
cmake -DTHIRD_PARTY=ON .. && make -j$(nproc) `
"CUDA_VERSION" "VERSION_GREATER_EQUAL" "10.0"
Please use a newer version of CMake
Make sure cmake is correctly included in
PATH
CUBLAS not found
Usually it happens when using CUDA 10.1 or newer
You should see error massage by CMake like this:
cuda lib not found: /usr/local/miniconda3/envs/dl/lib/libcublas_static.a or /usr/local/cuda/lib64/libcublas_static.a
Make sure
libcublas_static.a
is in one of the two directories.
When running OneFlow in gdb, there is no debug information for code location.
add cmake flag
-DCMAKE_BUILD_TYPE=RELWITHDEBINFO
or-DCMAKE_BUILD_TYPE=DEBUG
and recompile
libof_ccobj.a: File truncated
You might see error message like this:
/usr/bin/ar: libof_ccobj.a: File truncated make[2]: *** [libof_ccobj.a] Error 1 make[2]: *** Deleting file `libof_ccobj.a' make[1]: *** [CMakeFiles/of_ccobj.dir/all] Error 2 make: *** [all] Error 2
You should upgrade your GNU Binutils. Version 2.33.1 is recommended. If you are using conda, you could install it by running
conda install -c conda-forge binutils
Failed to compile because C++ 17 is enabled
In some cases, environment variable
CXXFLAGS
is not empty and contains--std c++17
.Check if it is empty by running
echo $CXXFLAGS
and clear it withunset CXXFLAGS
.If you are using conda, to make the changes on environment variables permanent, you can run:
conda env config vars set CXXFLAGS="-fPIC"
cmake outputs error
No CMAKE_ASM_NASM_COMPILER could be found.
Install
nasm
. For instance, runsudo yum install nasm
if you are on centos.
No module named 'google.protobuf'
You might see error message like this:
Scanning dependencies of target generate_api ... from google.protobuf import descriptor as _descriptor ModuleNotFoundError: No module named 'google.protobuf' CMakeFiles/generate_api.dir/build.make:57: recipe for target 'CMakeFiles/generate_api' failed make[2]: *** [CMakeFiles/generate_api] Error 1
Install development dependencies by running:
pip3 install -r dev-requirements.txt
Get gdb warning
ptrace: Operation not permitted.
and gdb commandbt
prints no backtraceYou might get this warning when debugging OneFlow with gdb inside a docker container. Try add these flags when launching your container:
docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined
Please refer to https://stackoverflow.com/questions/19215177/how-to-solve-ptrace-operation-not-permitted-when-trying-to-attach-gdb-to-a-pro
It takes too long to download python packages when running
make
If you are in China, you could run this to have pip download packages from domestic mirror of pypi:
python3 -m pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
For more information on this, please refer to pypi 镜像使用帮助
oneflow¶
The oneflow package contains data structures for multi-dimensional tensors and defines mathematical operations over these tensors. Additionally, it provides many utilities for efficient serializing of Tensors and arbitrary types, and other useful utilities.
It has a CUDA counterpart, that enables you to run your tensor computations on an NVIDIA GPU with compute capability >= 3.0
Tensor¶
Creates a Tensor with the dtype of oneflow.bool and the device on cpu, it has the same parameters as |
|
Creates a Tensor with the dtype of oneflow.uint8 and the device on cpu, it has the same parameters as |
|
Creates a Tensor with the dtype of oneflow.int8 and the device on cpu, it has the same parameters as |
|
Creates a Tensor with the dtype of oneflow.float64 and the device on cpu, it has the same parameters as |
|
Creates a Tensor with the dtype of oneflow.float32 and the device on cpu, it has the same parameters as |
|
Creates a Tensor with the dtype of oneflow.float16 and the device on cpu, it has the same parameters as |
|
Creates a Tensor with the dtype of oneflow.int32 and the device on cpu, it has the same parameters as |
|
Creates a Tensor with the dtype of oneflow.int64 and the device on cpu, it has the same parameters as |
Note that this function is simply doing |
|
Returns True if the data type of input is a floating point data type i.e., one of oneflow.float64 , oneflow.float32 , oneflow.float16, and oneflow.bfloat16. |
|
Returns True if the |
|
Returns the total number of elements in the |
|
Set options for printing. |
|
Returns the default floating point dtype. |
|
Sets the default floating point type for those source operators which create Tensor. |
|
Sets the default floating point type for those source operators which create Tensor. |
Creation Ops¶
Note
Random sampling creation ops are listed under Random sampling and
include:
oneflow.rand()
oneflow.randn()
oneflow.randint()
oneflow.randperm()
Constructs a tensor with data, return a global tensor if placement and sbp are in kwargs, |
|
Converts data into a tensor, sharing data and preserving autograd history if possible. |
|
Create a view of an existing oneflow.Tensor input with specified size, stride and storage_offset. |
|
Creates a |
|
Returns a tensor filled with the scalar value 0, with the shape defined by the variable argument size. |
|
The interface is consistent with PyTorch. |
|
Returns a tensor filled with the scalar value 1, with the shape defined by the variable argument size. |
|
The interface is consistent with PyTorch. |
|
Returns a tensor with the same size as input that is filled with random numbers from a normal distribution with mean 0 and variance 1. |
|
Returns a tensor filled with random integers generated uniformly between low (inclusive) and high (exclusive). |
|
Fills elements of |
|
The interface is consistent with PyTorch. |
|
Returns a 1-D tensor of size \(\left\lfloor \frac{\text{end} - \text{start}}{\text{step}} \right\rfloor + 1\) with values from |
|
Creates a one-dimensional tensor of size |
|
This operator creates a 2-D Tensor with ones on the diagonal and zeros elsewhere. |
|
The interface is consistent with PyTorch. |
|
The interface is consistent with PyTorch. |
|
Creates a tensor of size size filled with fill_value. |
|
Returns a tensor with the same size as |
|
This operation creates a new tensor by applying sparse updates to the input tensor. |
|
This function is equivalent to PyTorch’s logspace function. |
Indexing, Slicing, Joining, Mutating Ops¶
This operator finds the indices of input Tensor input elements that are non-zero. |
|
Returns a 1-dimensional view of each input tensor with zero dimensions. |
|
Returns a 2-dimensional view of each input tensor with zero dimensions. |
|
Returns a 3-dimensional view of each input tensor with zero dimensions. |
|
Concatenate two or more Tensor s at specified dim. |
|
Creates a new tensor by horizontally stacking the tensors in |
|
cat(tensors, dim=0) -> Tensor |
|
Splits a tensor into a specific number of chunks. |
|
Stack tensors in |
|
This operator expand the input tensor to a larger size. |
|
Gathers values along an axis specified by dim. |
|
This operator is a high-dimensional extension of gather, index is a K-dimensional tensor, which is regarded as a index of input Tensor input. |
|
Gather the element in batch dims. |
|
The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.hsplit.html. |
|
Stack tensors in |
|
Splits input, a tensor with two or more dimensions, into multiple tensors vertically according to indices_or_sections. |
|
Stack tensors in |
|
Select values along an axis specified by dim. |
|
See |
|
Returns a new 1-D tensor which indexes the input tensor according to the boolean mask mask which is a BoolTensor(In oneFlow BoolTensor is replaced by Int8Tensor). |
|
Moves the dimension(s) of input at the position(s) in source to the position(s) in destination. |
|
Returns a new tensor that is a narrowed version of input tensor. |
|
Returns a view of the original tensor with its dimensions permuted. |
|
This operator repeat the input tensor to a larger size along the specified dimensions. |
|
This operator reshapes a Tensor. |
|
Alias of |
|
Slices the self tensor along the selected dimension at the given index. |
|
This operator writes the elements specified by index along with the axis dim from the src into the input. |
|
This operator scatter the src with addition operation according to index along dim into the input. |
|
This operator inserts the elements in update according to the index and create a new Tensor. |
|
Extracts a slice from a tensor. |
|
Update a slice of tensor x. |
|
Splits the tensor into chunks. |
|
This operator removes the specified dimention which size is 1 of the input Tensor. |
|
Concatenates a sequence of tensors along a new dimension. |
|
This function is equivalent to NumPy’s swapaxes function. |
|
This function is equivalent to torch’s swapdims function. |
|
oneflow.t(input) → Tensor. |
|
Constructs a tensor by repeating the elements of |
|
Returns a tensor that is a transposed version of input. |
|
Removes a tensor dimension. |
|
Returns a new tensor with a dimension of size one inserted at the specified position. |
|
Return a tensor of elements selected from either |
|
Splits a tensor into multiple sub-tensors, all of which are views of input, along dimension dim according to the indices or number of sections specified by indices_or_sections . |
Random sampling¶
Sets the seed for generating random numbers to a non-deterministic random number. |
|
Sets the seed for generating random numbers. |
|
Returns the initial seed for generating random numbers as a Python long. |
|
Sets the random number generator state. |
|
Returns the random number generator state as a oneflow.ByteTensor. |
|
This operator returns a Tensor with binaray random numbers (0 / 1) from a Bernoulli distribution. |
|
Returns a tensor of random numbers drawn from separate normal distributions whose mean and standard deviation are given. |
|
Returns a tensor filled with random numbers from a uniform distribution on the interval [0, 1) |
|
Returns a tensor filled with random integers generated uniformly between low (inclusive) and high (exclusive). |
|
Returns a tensor filled with random numbers from a normal distribution with mean 0 and variance 1 (also called the standard normal distribution). |
|
Returns a random permutation of integers from |
|
Returns a tensor where each row contains |
In-place random sampling¶
There are a few more in-place random sampling functions defined on Tensors as well. Click through to refer to their documentation:
- oneflow.Tensor.normal_()
- in-place version of oneflow.normal()
- oneflow.Tensor.uniform_()
- numbers sampled from the continuous uniform distribution
Serialization¶
Parallelism¶
Sets the number of threads used for intraop parallelism on CPU. |
Locally disabling gradient computation¶
The context managers oneflow.no_grad()
, oneflow.enable_grad()
, and
oneflow.set_grad_enabled()
are helpful for locally disabling and enabling
gradient computation. These context managers are thread local, so they won’t
work if you send work to another thread using the threading
module, etc.
Examples:
>>> import oneflow
>>> x = oneflow.zeros(1, requires_grad=True)
>>> with oneflow.no_grad():
... y = x * 2
>>> y.requires_grad
False
>>> with oneflow.set_grad_enabled(False):
... y = x * 2
>>> y.requires_grad
False
>>> with oneflow.set_grad_enabled(True):
... y = x * 2
>>> y.requires_grad
True
Context-manager that disabled gradient calculation. |
|
Context-manager that enabled gradient calculation. |
|
Context-manager that enabled gradient calculation. |
|
Returns True if grad mode is currently enabled. |
|
Context-manager that enables or disables inference mode |
Math operations¶
Pointwise Ops¶
Return the absolute value of each element in input tensor:math:y = |x| element-wise. |
|
Returns a new tensor with the inverse cosine of the elements of |
|
Returns a new tensor with the inverse hyperbolic cosine of the elements of |
|
Returns a new tensor with the inverse cosine of the elements of |
|
Returns a new tensor with the inverse hyperbolic cosine of the elements of |
|
Adds other, scaled by alpha, to input. |
|
This function is equivalent to PyTorch’s addcdiv function. |
|
Performs the element-wise multiplication of tensor1 by tensor2, multiply the result by the scalar value and add it to input. |
|
Returns a new tensor with the arcsine of the elements of |
|
Returns a new tensor with the inverse hyperbolic sine of the elements of |
|
Returns a new tensor with the arcsine of the elements of |
|
Returns a new tensor with the inverse hyperbolic sine of the elements of |
|
Returns a new tensor with the arctangent of the elements of |
|
Returns a new tensor with the inverse hyperbolic tangent of the elements of |
|
Returns a new tensor with the arctangent of the elements of |
|
Returns a new tensor with the inverse hyperbolic tangent of the elements of |
|
Element-wise arctangent of input{i}/other{i} with consideration of the quadrant. |
|
Returns a new tensor with the ceil of the elements of |
|
In-place version of |
|
Clamp all elements in |
|
Clamp all elements in |
|
Clamp all elements in |
|
Alias for |
|
Returns a new tensor with the cosine of the elements of |
|
Returns a new tensor with the hyperbolic cosine of the elements of |
|
Computes the division of input by other for each element, scalar and broadcast promotation are supported. |
|
Computes the error function of each element. |
|
Computes the complementary error function of each element of input. |
|
Computes the inverse error function of |
|
This operator computes the exponential of Tensor. |
|
Returns a new tensor with the exponential of the elements minus 1 of |
|
Returns a new tensor with the arcsine of the elements of |
|
In-place version of |
|
frac(input) → Tensor |
|
In-place version of |
|
Computes the element-wise remainder of division. |
|
Applies the Gaussian Error Linear Units function: |
|
Applies GELU approximation that is fast but somewhat inaccurate. |
|
Applies the relu^2 activation introduced in https://arxiv.org/abs/2109.08668v2 |
|
Returns a new tensor with the natural logarithm of the elements of |
|
Returns a new tensor with the natural logarithm of (1 + input). |
|
Returns a new tensor with the natural logarithm to the base 2 of the elements of |
|
Returns a new tensor with the natural logarithm to the base 10 of the elements of |
|
Computes the element-wise logical AND of the given input tensors. |
|
Computes the element-wise logical NOT of the given input tensors. |
|
Computes the element-wise logical OR of the given input tensors. |
|
Computes the element-wise logical XOR of the given input tensors. |
|
Computes the bitwise AND of input and other. |
|
Computes the bitwise OR of input and other. |
|
Computes the bitwise XOR of input and other. |
|
Computes the bitwise NOT of input. |
|
Applies the element-wise function: |
|
Computes the multiplication of input by other for each element, scalar and broadcast promotation are supported. |
|
This operator computes the negative value of Tensor. |
|
This operator computes the negative value of Tensor. |
|
Takes the power of each element in input with exponent and returns a tensor with the result. |
|
Computes the safe reciprocal of x. |
|
This operator rounds the value of Blob to the nearest integer. |
|
In-place version of |
|
Returns a new tensor with the reciprocal of the square-root of each of the elements of |
|
Applies element-wise function |
|
Softmax is defined as: |
|
Applies the element-wise function: |
|
The formula is: |
|
The formula is: |
|
Applies the element-wise function \(\text{Sigmoid}(x) = \frac{1}{1 + \exp(-x)}\) |
|
Computes the sign of Tensor. |
|
Returns a new tensor with the sine of the elements of |
|
Returns a new tensor with the hyperbolic sine of the elements of |
|
In-place version of |
|
Returns a new tensor with the square-root of the elements of |
|
Returns a new tensor with the square of the elements of |
|
Computes the subtraction of input by other for each element, scalar and broadcast promotation are supported. |
|
Returns the tan value of the elements of |
|
The equation is: |
|
The interface is consistent with PyTorch. |
|
The documentation is referenced from: https://pytorch.org/docs/stable/generated/torch.lerp.html. |
|
In-place version of |
|
The documentation is referenced from: https://pytorch.org/docs/stable/generated/torch.quantile.html. |
Reduction Ops¶
The op computes the index with the largest value of a Tensor at specified axis. |
|
The op computes the index with the largest value of a Tensor at specified axis. |
|
Returns the maximum along a dimension. |
|
Returns the minimum value of each slice of the input tensor in the given dimension(s) dim. |
|
For each row of input in the given dimension dim, returns True if any element in the row evaluate to True and False otherwise. |
|
Computes the maximum value of all elements in the input tensor. |
|
Computes the minimum value of all elements in the input tensor. |
|
Computes the mean of row of elements in a tensor in the given dimension. |
|
Returns the median of the values in input. |
|
Returns a namedtuple (values, indices) where values is the mode value of each row of the input tensor in the given dimension dim, i.e. a value which appears most often in that row, and indices is the index location of each mode value found. |
|
Computes the product of row of elements in a tensor in the given dimension. |
|
Returns the sum of each row of the |
|
Returns the standard-deviation of each row of the |
|
Computes the sum of row of elements in a tensor in the given dimension. |
|
Returns the log of summed exponentials of each row of the |
|
Returns the variance of each row of the input tensor in the given dimension dim. |
|
Returns the matrix norm or vector norm of a given tensor. |
|
For each row of input in the given dimension dim, returns True if all element in the row evaluate to True and False otherwise. |
Comparison Ops¶
This operator sorts the input Tensor at specified dim and returns the indices of the sorted Tensor. |
|
Computes element-wise equality. |
|
True if two tensors have the same size and elements, False otherwise. |
|
Returns the truth value of \(input > other\) element-wise. |
|
This function is equivalent to PyTorch’s isinf function. |
|
This function is equivalent to PyTorch’s isnan function. |
|
Returns the truth value of \(input <= other\) element-wise. |
|
Returns the truth value of \(input < other\) element-wise. |
|
Computes element-wise not equality. |
|
Sorts the elements of the input tensor along a given dimension in ascending order by value. |
|
Finds the values and indices of the k largest entries at specified axis. |
|
Returns the truth value of \(input >= other\) element-wise. |
|
Returns the truth value of \(input > other\) element-wise. |
|
Returns the truth value of \(input >= other\) element-wise. |
|
Computes the element-wise maximum of x and y. |
|
Computes the element-wise minimum of x and y. |
|
ne(input, other) -> Tensor |
|
The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.isclose.html |
|
The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.allclose.html |
Spectral Ops¶
This function is equivalent to PyTorch’s hann_window function. |
Other Ops¶
Applies a 1D adaptive average pooling over an input signal composed of several input planes. |
|
Applies a 2D adaptive average pooling over an input signal composed of several input planes. |
|
Applies a 3D adaptive average pooling over an input signal composed of several input planes. |
|
This operator broadcast tensor x to like_tensor according to the broadcast_axes. |
|
The operation takes input tensor x and casts it to the output with dtype |
|
This operator computes the cumulative product of input elements in the given dimension. |
|
This operator computes the cumulative sum of input elements in the given dimension. |
|
If input is a vector (1-D tensor), then returns a 2-D square tensor with the elements of input as the diagonal. |
|
Returns a partial view of input with the its diagonal elements with respect to dim1 and dim2 appended as a dimension at the end of the shape. |
|
Sums the product of the elements of the input |
|
Flattens a contiguous range of dims into a tensor. |
|
Reverse the order of a n-D tensor along given axis in dims. |
|
Says whether the targets are in the top K predictions. |
|
Take \(N\) tensors, each of which can be either scalar or 1-dimensional vector, and create \(N\) N-dimensional grids, where the \(i\) th grid is defined by expanding the \(i\) th input over dimensions defined by other inputs. |
|
Performs non-maximum suppression (NMS) on the boxes according to their intersection-over-union (IoU). |
|
Roll the tensor along the given dimension(s). |
|
Find the indices from the innermost dimension of sorted_sequence such that, if the corresponding values in values were inserted before the indices, the order of the corresponding innermost dimension within sorted_sequence would be preserved. |
|
Compute tensor dot along given dimensions. |
|
Returns the lower triangular part of a matrix (2-D tensor) or batch of matrices input along the specified diagonal, the other elements of the result tensor out are set to 0. |
|
Repeat elements of a tensor. |
|
Returns the upper triangular part of a matrix (2-D tensor) or batch of matrices input, the other elements of the result tensor out are set to 0. |
|
Returns the cross product of vectors in dimension dim of input and other. |
|
oneflow.bincount(input, weights=None, minlength=0) → Tensor |
|
The interface is consistent with PyTorch. |
|
The interface is consistent with PyTorch. |
|
The interface is consistent with PyTorch. |
|
Returns the unique elements of the input tensor. |
BLAS and LAPACK Operations¶
Performs a matrix multiplication of the matrices |
|
Performs a batch matrix-matrix product of matrices stored in input and mat2. |
|
The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.baddbmm.html. |
|
This operator computes the dot product of tensor input and other. |
|
This operator applies matrix multiplication to two Tensor. |
|
Performs a matrix multiplication of the matrices |
|
Performs a matrix-vector product of the matrix |
oneflow.nn¶
These are the basic building blocks for graphs:
oneflow.nn
Containers¶
Base class for all neural network modules. |
|
A sequential container. |
|
Holds submodules in a list. |
|
Holds submodules in a dictionary. |
|
Holds parameters in a list. |
|
Holds parameters in a dictionary. |
nn.Module¶
Adds a child module to the current module. |
|
Applies |
|
Returns an iterator over module buffers. |
|
Returns an iterator over immediate children modules. |
|
Moves all model parameters and buffers to the CPU. |
|
Moves all model parameters and buffers to the GPU. |
|
Casts all floating point parameters and buffers to |
|
Sets the module in training mode. |
|
Sets the module in evaluation mode. |
|
Set the extra representation of the module |
|
Casts all floating point parameters and buffers to |
|
Copies parameters and buffers from |
|
Returns an iterator over all modules in the network. |
|
Returns an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself. |
|
Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself. |
|
Returns an iterator over all modules in the network, yielding both the name of the module as well as the module itself. |
|
Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself. |
|
Returns an iterator over module parameters. |
|
Adds a buffer to the module. |
|
Registers a forward hook on the module. |
|
Registers a forward pre-hook on the module. |
|
Registers a backward hook on the module. |
|
Registers a backward hook on the module. |
|
These hooks will be called with arguments: |
|
Adds a parameter to the module. |
|
Change if autograd should record operations on parameters in this module. |
|
Returns a dictionary containing a whole state of the module. |
|
Moves and/or casts the parameters and buffers. |
|
Sets gradients of all model parameters to zero. |
Containers
Convolution Layers¶
Applies a 1D convolution over an input signal composed of several input planes. |
|
Applies a 2D convolution over an input signal composed of several input planes. |
|
Applies a 3D convolution over an input signal composed of several input planes. |
|
Applies a 1D transposed convolution operator over an input image composed of several input planes. |
|
Applies a 2D transposed convolution operator over an input image composed of several input planes. |
|
Applies a 3D transposed convolution operator over an input image composed of several input planes. |
|
The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.Unfold.html. |
|
The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.Fold.html. |
Pooling Layers¶
Applies a 1D max pooling over an input signal composed of several input planes. |
|
Applies a 2D max pooling over an input signal composed of several input planes. |
|
Applies a 3D max pooling over an input signal composed of several input planes. |
|
Computes a partial inverse of |
|
Computes a partial inverse of |
|
Computes a partial inverse of |
|
Applies a 1D adaptive average pooling over an input signal composed of several input planes. |
|
Applies a 2D adaptive average pooling over an input signal composed of several input planes. |
|
Applies a 3D adaptive average pooling over an input signal composed of several input planes. |
|
Applies a 1D adaptive max pooling over an input signal composed of several input planes. |
|
Applies a 2D adaptive max pooling over an input signal composed of several input planes. |
|
Applies a 3D adaptive max pooling over an input signal composed of several input planes. |
|
Applies a 1D average pooling over an input signal composed of several input planes. |
|
Performs the 2d-average pooling on the input. |
|
Applies a 3D average pooling over an input signal composed of several input planes. |
Padding Layers¶
Pads the input tensor boundaries with a constant value. |
|
This operator pads the input with constant value that user specifies. |
|
Pads the input tensor boundaries with a constant value. |
|
This operator pads the input tensor using the reflection of the input boundary. |
|
This operator pads the input tensor using the reflection of the input boundary. |
|
Pads the input tensor using replication of the input boundary. |
|
Pads the input tensor using the replication of the input boundary. |
|
Pads the input tensor boundaries with zero. |
Non-linear Activations (weighted sum, nonlinearity)¶
Applies the element-wise function |
|
The Hardshrink activation. |
|
Applies the element-wise function: |
|
Applies the hardswish function, element-wise, as described in the paper Searching for MobileNetV3. |
|
Applies the HardTanh function element-wise |
|
Applies the element-wise function: |
|
Applies the element-wise function: |
|
Applies the element-wise function: |
|
Applies the rectified linear unit function element-wise: |
|
Applies the element-wise function: |
|
Applies the element-wise function: |
|
Applies the element-wise function: |
|
The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.GELU.html. |
|
Applies GELU approximation that is fast but somewhat inaccurate. |
|
Applies the relu^2 activation introduced in https://arxiv.org/abs/2109.08668v2 |
|
SiLU(Swish) activation: |
|
Applies the element-wise function: |
|
Applies the element-wise function: |
|
Applies the element-wise function: |
|
The Softshrink activation. |
|
The SoftSign activation. |
|
This operator computes the hyperbolic tangent value of Tensor. |
|
The Threshold Activation. |
|
The GLU activation. |
Non-linear Activations (other)¶
Applies the Softmax function to an n-dimensional input Tensor rescaling them so that the elements of the n-dimensional output Tensor lie in the range [0,1] and sum to 1. |
|
Applies the LogSoftmax function to an n-dimensional input Tensor. |
Normalization Layers¶
Applies Batch Normalization over a 2D or 3D input (a mini-batch of 1D inputs with optional additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . |
|
Applies Batch Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . |
|
Applies Batch Normalization over a 5D input (a mini-batch of 3D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . |
|
Applies Batch Normalization over a N-Dimensional input (a mini-batch of [N-2]D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . |
|
Applies Fused Batch Normalization over a 2D or 3D input, the formula is: |
|
Applies Fused Batch Normalization over a 4D input, the formula is: |
|
Applies Fused Batch Normalization over a 5D input, the formula is: |
|
Applies Group Normalization over a mini-batch of inputs as described in the paper Group Normalization |
|
Applies Instance Normalization over a 3D input (a mini-batch of 1D inputs with optional additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization. |
|
Applies Instance Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization. |
|
Applies Instance Normalization over a 5D input (a mini-batch of 3D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization. |
|
Applies Layer Normalization over a mini-batch of inputs as described in the paper Layer Normalization |
|
Construct a layernorm module in the T5 style. |
|
Applies Root Mean Square Layer Normalization over a mini-batch of inputs as described in the paper Root Mean Square Layer Normalization |
Recurrent Layers¶
Applies a multi-layer Elman RNN with tanhtanh or text{ReLU}ReLU non-linearity to an input sequence. |
|
Applies a multi-layer long short-term memory (LSTM) RNN to an input sequence. |
|
Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence. |
|
An Elman RNN cell with tanh or ReLU non-linearity. |
|
A long short-term memory (LSTM) cell. |
|
A gated recurrent unit (GRU) cell |
Linear Layers¶
A placeholder identity operator that is argument-insensitive. |
|
Applies a linear transformation to the incoming data: \(y = xA^T + b\) |
Dropout Layers¶
During training, randomly zeroes some of the elements of the input tensor with probability |
|
Randomly zero out entire channels (a channel is a 1D feature map, e.g., the \(j\)-th channel of the \(i\)-th sample in the batched input is a 1D tensor :math:` ext{input}[i, j]`). |
|
Randomly zero out entire channels (a channel is a 2D feature map, e.g., the \(j\)-th channel of the \(i\)-th sample in the batched input is a 2D tensor :math:` ext{input}[i, j]`). |
|
Randomly zero out entire channels (a channel is a 3D feature map, e.g., the \(j\)-th channel of the \(i\)-th sample in the batched input is a 3D tensor :math:` ext{input}[i, j]`). |
Sparse Layers¶
A simple lookup table that stores embeddings of a fixed dictionary and size. |
Distance Functions¶
Returns cosine similarity between \(x_1\) and \(x_2\), computed along dim. |
|
Computes the pairwise distance between vectors \(v_1\), \(v_2\) using the p-norm: |
Loss Functions¶
This operator computes the binary cross entropy loss. |
|
This operator combines the Sigmoid and BCELoss together. |
|
The Connectionist Temporal Classification loss. |
|
The operation implements “margin_softmax” in InsightFace: https://github.com/deepinsight/insightface/blob/master/recognition/arcface_mxnet/train.py The implementation of margin_softmax in InsightFace is composed of multiple operators. |
|
The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.CrossEntropyLoss.html. |
|
The Kullback-Leibler divergence loss measure |
|
This operator computes the L1 Loss between each element in input and target. |
|
Creates a criterion that measures the mean squared error (squared L2 norm) between each element in the input \(x\) and target \(y\). |
|
Creates a criterion that measures the loss given inputs \(x1\), \(x2\), two 1D mini-batch Tensors, and a label 1D mini-batch tensor \(y\) (containing 1 or -1). |
|
The negative log likelihood loss. |
|
Creates a criterion that uses a squared term if the absolute element-wise error falls below beta and an L1 term otherwise. |
|
Creates a criterion that measures the triplet loss given an input tensors \(x1\), \(x2\), \(x3\) and a margin with a value greater than \(0\). |
Vision Layers¶
alias of |
|
Upsamples a given multi-channel 1D (temporal), 2D (spatial) or 3D (volumetric) data. |
|
Applies a 2D bilinear upsampling to an input signal composed of several input channels. |
|
Applies a 2D nearest neighbor upsampling to an input signal composed of several input channels. |
Data loading and preprocessing Layers¶
Generates random boolean values following a bernoulli distribution. |
|
Performs fused cropping, normalization, format conversion (NHWC to NCHW) if desired, and type casting. |
|
This operator reads an tensor as bytes. |
|
Quantization Aware Training¶
Compute the quantization parameters of the input tensor. |
|
Compute the quantization parameters based on the moving average of the input tensor’s min and max values. |
|
Simulate the quantize and dequantize operations in training time. |
|
A Conv1d module attached with nn.MinMaxObserver, nn.MovingAverageMinMaxObserver and nn.FakeQuantization modules for weight and input, used for quantization aware training. |
|
A Conv2d module attached with nn.MinMaxObserver, nn.MovingAverageMinMaxObserver and nn.FakeQuantization modules for weight and input, used for quantization aware training. |
|
A Conv3d module attached with nn.MinMaxObserver, nn.MovingAverageMinMaxObserver and nn.FakeQuantization modules for weight and input, used for quantization aware training. |
Utilities¶
From the oneflow.nn.utils
module
Clips gradient norm of an iterable of parameters. |
|
Clips gradient of an iterable of parameters at specified value. |
|
Applies weight normalization to a parameter in the given module. |
|
Removes the weight normalization reparameterization from a module. |
Utility functions in other modules
The interface is consistent with PyTorch. |
|
The interface is consistent with PyTorch. |
|
The interface is consistent with PyTorch. |
|
The interface is consistent with PyTorch. |
|
Packs a list of variable length Tensors |
Flattens a contiguous range of dims into a tensor. |
Quantized Functions¶
Quantization refers to techniques for performing computations and storing tensors at lower bitwidths than floating point precision.
Simulate the quantize and dequantize operations in training time. |
|
Compute the quantization parameters of the input tensor. |
|
Compute the quantization parameters based on the moving average of the input tensor’s min and max values. |
|
Simulate the quantize operation in inference time. |
oneflow.nn.functional¶
oneflow.nn.functional
Convolution functions¶
Applies a 1D convolution over an input signal composed of several input planes. |
|
Applies a 2D convolution over an input image composed of several input planes. |
|
Applies a 3D convolution over an input image composed of several input planes. |
|
Applies a 1D transposed convolution operator over an input signal composed of several input planes, sometimes also called “deconvolution”. |
|
Applies a 2D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”. |
|
Applies a 3D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”. |
|
The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.functional.fold.html. |
|
The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.functional.unfold.html. |
BatchNorm functions¶
Applies Batch Normalization for each channel across a batch of data. |
Pooling functions¶
Applies a 1D average pooling over an input signal composed of several input planes. |
|
Applies 2D average-pooling operation in \(kH \times kW\) regions by step size \(sH \times sW\) steps. |
|
Applies 3D average-pooling operation in \(kT \times kH \times kW\) regions by step size \(sT \times sH \times sW\) steps. |
|
Applies a 1D max pooling over an input signal composed of several input planes. |
|
Applies a 2D max pooling over an input signal composed of several input planes. |
|
Applies a 3D max pooling over an input signal composed of several input planes. |
|
Computes a partial inverse of |
|
Computes a partial inverse of |
|
Computes a partial inverse of |
|
Applies a 1D adaptive average pooling over an input signal composed of several input planes. |
|
Applies a 2D adaptive average pooling over an input signal composed of several input planes. |
|
Applies a 3D adaptive average pooling over an input signal composed of several input planes. |
|
Applies a 1D adaptive max pooling over an input signal composed of several input planes. |
|
Applies a 2D adaptive max pooling over an input signal composed of several input planes. |
|
Applies a 3D adaptive max pooling over an input signal composed of several input planes. |
Non-linear activation functions¶
Thresholds each element of the input Tensor. |
|
Applies the rectified linear unit function element-wise. |
|
Applies the HardTanh function element-wise. |
|
Applies the hardswish function, element-wise, as described in the paper: |
|
Applies the element-wise function \(\text{ReLU6}(x) = \min(\max(0,x), 6)\). |
|
Applies element-wise, |
|
Applies element-wise function |
|
Applies the element-wise function: |
|
Applies element-wise, :math:` ext{LeakyReLU}(x) = max(0, x) + ext{negative_slope} * min(0, x)` |
|
Applies the relu^2 activation introduced in https://arxiv.org/abs/2109.08668v2 |
|
Applies the element-wise function: |
|
The equation is: |
|
Applies the Gaussian Error Linear Units function: |
|
Applies GELU approximation that is fast but somewhat inaccurate. |
|
Applies the element-wise function: |
|
Applies the hard shrinkage function in an element-wise manner. |
|
The formula is: |
|
Applies the element-wise function: |
|
Applies a softmax function. |
|
Applies the soft shrinkage function in an element-wise manner. |
|
LogSoftmax is defined as: |
|
Solve the problem that the output values of argmax do not reflect the probability distribution of the model’s output. |
|
The equation is: |
|
Applies the element-wise function \(\text{Sigmoid}(x) = \frac{1}{1 + \exp(-x)}\) |
|
Applies the element-wise function |
|
The formula is: |
|
Applies the element-wise function: |
|
Applies Layer Normalization for last certain number of dimensions. |
|
Performs \(L_p\) normalization of inputs over specified dimension |
Linear functions¶
Applies a linear transformation to the incoming data: \(y = xA^T + b\). |
Dropout functions¶
During training, randomly zeroes some of the elements of the input tensor with probability |
|
The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.functional.dropout1d.html. |
|
dropout1d(x: Tensor, p: float = 0.5, training: bool = True) -> Tensor |
|
dropout1d(x: Tensor, p: float = 0.5, training: bool = True) -> Tensor |
Distance functions¶
Returns cosine similarity between |
|
Computes the pairwise distance between vectors \(v_1\), \(v_2\) using the p-norm: |
Loss functions¶
The interface is consistent with TensorFlow. |
|
See |
|
The Connectionist Temporal Classification loss. |
|
This operator computes the L1 loss between each element in input and target. |
|
This operator computes the mean squared error (squared L2 norm) loss between each element in input and target. |
|
Function that uses a squared term if the absolute element-wise error falls below beta and an L1 term otherwise. |
|
Creates a criterion that measures the triplet loss given an input tensors \(x1\), \(x2\), \(x3\) and a margin with a value greater than \(0\). |
|
The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.functional.binary_cross_entropy.html. |
|
The documentation is referenced from: https://pytorch.org/docs/1.10/generated/torch.nn.functional.binary_cross_entropy_with_logits.html. |
Vision functions¶
Performs Deformable Convolution v2, described in Deformable ConvNets v2: More Deformable, Better Results if |
|
Pads tensor. |
|
The interface is consistent with PyTorch. |
|
alias of |
|
The interface is consistent with PyTorch. |
|
The interface is consistent with PyTorch. |
Greedy decoder¶
Performs greedy decoding on the logits given in input (best path). |
oneflow.Tensor¶
A oneflow.Tensor
is a multi-dimensional matrix containing elements of
a single data type.
Data types¶
OneFlow defines 8 Tensor types with CPU and GPU variants which are as follows:
Data type |
dtype |
CPU tensor |
GPU tensor |
---|---|---|---|
Boolean |
|
||
8-bit integer (unsigned) |
|
||
8-bit integer (signed) |
|
||
64-bit floating point |
|
||
32-bit floating point |
|
||
16-bit floating point |
|
||
32-bit integer (signed) |
|
||
64-bit integer (signed) |
|
Initializing and basic operations¶
A tensor can be constructed from a Python list
or sequence using the
oneflow.tensor()
constructor:
>>> import oneflow
>>> import numpy as np
>>> oneflow.tensor([[1., -1.], [1., -1.]])
tensor([[ 1., -1.],
[ 1., -1.]], dtype=oneflow.float32)
>>> oneflow.tensor(np.array([[1, 2, 3], [4, 5, 6]]))
tensor([[ 1, 2, 3],
[ 4, 5, 6]], dtype=oneflow.int64)
Warning
oneflow.tensor()
always copies data
. If you have a Tensor
data
and just want to change its requires_grad
flag, use
requires_grad_()
or
detach()
to avoid a copy.
If you have a numpy array and want to avoid a copy, use
oneflow.as_tensor()
.
>>> import oneflow
>>> oneflow.zeros([2, 4], dtype=oneflow.int32)
tensor([[ 0, 0, 0, 0],
[ 0, 0, 0, 0]], dtype=oneflow.int32)
>>> cuda0 = oneflow.device('cuda:0')
>>> oneflow.ones([2, 4], dtype=oneflow.float64, device=cuda0)
tensor([[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]], device='cuda:0', dtype=oneflow.float64)
For more information about building tensors, see Creation Ops
The contents of a tensor can be accessed and modified using Python’s indexing and slicing notation:
>>> import oneflow
>>> x = oneflow.tensor([[1, 2, 3], [4, 5, 6]])
>>> print(x[1][2])
tensor(6, dtype=oneflow.int64)
>>> x[0][1] = 8
>>> print(x)
tensor([[1, 8, 3],
[4, 5, 6]], dtype=oneflow.int64)
Use oneflow.Tensor.item()
to get a Python number from a tensor containing a
single value:
>>> import oneflow
>>> x = oneflow.tensor([[1]])
>>> x
tensor([[1]], dtype=oneflow.int64)
>>> x.item()
1
>>> x = oneflow.tensor(2.5)
>>> x
tensor(2.5000, dtype=oneflow.float32)
>>> x.item()
2.5
For more information about indexing, see Indexing, Slicing, Joining, Mutating Ops
A tensor can be created with requires_grad=True
so that
oneflow.autograd
records operations on them for automatic differentiation.
>>> import oneflow
>>> x = oneflow.tensor([[1., -1.], [1., 1.]], requires_grad=True)
>>> out = x.pow(2).sum()
>>> out.backward()
>>> x.grad
tensor([[ 2., -2.],
[ 2., 2.]], dtype=oneflow.float32)
Note
For more information on the oneflow.dtype
, oneflow.device
, and
oneflow.layout
attributes of a oneflow.Tensor
, see
Tensor Attributes.
Note
Methods which mutate a tensor are marked with an underscore suffix.
For example, oneflow.FloatTensor.add_()
computes the absolute value
in-place and returns the modified tensor, while oneflow.FloatTensor.add()
computes the result in a new tensor.
Note
To change an existing tensor’s oneflow.device
and/or oneflow.dtype
, consider using
to()
method of Tensor object.
Warning
Current implementation of oneflow.Tensor
introduces memory overhead,
thus it might lead to unexpectedly high memory usage in the applications with many tiny tensors.
If this is your case, consider using one large structure.
Tensor class reference¶
-
class
oneflow.
Tensor
¶ There are a few main ways to create a tensor, depending on your use case.
To create a tensor with pre-existing data, use
oneflow.tensor()
.To create a tensor with specific size, use
oneflow.*
tensor creation ops (see Creation Ops).To create a tensor with the same size (and similar types) as another tensor, use
oneflow.*_like
tensor creation ops (see Creation Ops).
Returns a Tensor of size |
|
Returns a Tensor of size size filled with 0. |
|
Returns a Tensor of size size filled with fill_value. |
|
Is True if the Tensor is stored on the GPU, False otherwise. |
|
Return whether this Tensor is a global tensor. |
|
Is the |
|
Return the gradient calculated by autograd functions. |
|
See |
|
See |
|
See |
|
See |
|
In-place version of |
|
In-place version of |
|
In-place version of |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
Computes the gradient of current tensor w.r.t. graph leaves. |
|
See |
|
|
|
self.byte() is equivalent to self.to(oneflow.uint8). |
|
See |
|
See |
|
See |
|
See |
|
See |
|
Inplace version of |
|
Alias for |
|
Alias for |
|
See |
|
Copies the elements from src into self tensor and returns self. |
|
See |
|
See |
|
Returns a copy of this object in CPU memory. |
|
Returns a copy of this object in CUDA memory. |
|
See |
|
See |
|
Is the |
|
Is the |
|
See |
|
Tensor.dim() → int |
|
See |
|
In-place version of |
|
|
|
See |
|
Tensor.element_size() → int |
|
See |
|
See |
|
See |
|
See |
|
See |
|
Inplace version of |
|
See |
|
See |
|
See |
|
Expand this tensor to the same size as |
|
See |
|
Tensor.fill_(value) → Tensor |
|
See |
|
|
|
See |
|
See |
|
See |
|
See |
|
See |
|
For CUDA tensors, this function returns the device ordinal of the GPU on which the tensor resides. |
|
Return the function that created this tensor if it’s |
|
See |
|
In-place version of |
|
self.half() is equivalent to self.to(dtype=oneflow.float16). |
|
The interface is consistent with PyTorch. |
|
|
|
Returns True if self tensor is contiguous in memory. |
|
Return whether this Tensor is a lazy tensor. |
|
All Tensors that have |
|
See |
|
See |
|
Returns the value of this tensor as a standard Python number. |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
|
|
See |
|
In-place version of |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
In-place version of |
|
See |
|
In-place version of |
|
See |
|
See |
|
See |
|
See |
|
Tensor.nelement() → int |
|
See |
|
Fills |
|
See |
|
Tensor.numpy() → numpy.ndarray |
|
Transfer tensor data from GPU memory back to host (CPU) memory. |
|
Load tensor data stored on the host (CPU) back to GPU memory. |
|
Determine whether the tensor has been moved to CPU memory and the CUDA device memory has been released. |
|
See |
|
See |
|
Registers a backward hook. |
|
See |
|
See |
|
Is |
|
Sets this tensor’s requires_grad attribute in-place. |
|
Returns this tensor as the same shape as other. |
|
Enables this Tensor to have their |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
See |
|
Returns the size of the self tensor. |
|
See |
|
See |
|
See |
|
See |
|
In-place version of |
|
See |
|
Returns self tensor’s offset in the underlying storage in terms of number of storage elements (not bytes). |
|
See |
|
See |
|
In-place version of |
|
See |
|
See |
|
See |
|
Performs Tensor dtype and/or device conversion. |
|
Creates a global tensor from a local tensor. |
|
Performs Tensor placement and/or sbp conversion. |
|
Creates a global tensor if this tensor is a local tensor, otherwise performs Tensor placement and/or sbp conversion. |
|
Returns the local component of this global tensor in the current rank. |
|
This interface is no longer available, please use |
|
Returns the tensor as a (nested) list. |
|
See |
|
See |
|
See |
|
See |
|
Returns this tensor cast to the type of the given tensor. |
|
Returns the type if dtype is not provided, else casts this object to the specified type. |
|
See |
|
Is this Tensor with its dimensions reversed. |
|
See |
|
Returns a view of the original tensor which contains all slices of size size from self tensor in the dimension dimension. |
|
Tensor.uniform_(from=0, to=1) → Tensor |
|
In-place version of |
|
In-place version of |
|
See |
|
Returns a new tensor with the same data as the |
|
Expand this tensor to the same size as |
|
See |
|
Fills self tensor with zeros. |
|
See |
|
Copies the tensor to pinned memory, if it’s not already pinned. |
|
Returns true if this tensor resides in pinned memory. |
|
See |
|
Inplace version of |
|
Inplace version of |
|
The inplace version of |
|
See |
|
Tensor Attributes¶
Each local oneflow.Tensor
has a oneflow.dtype
, oneflow.device
, and global oneflow.Tensor
has a oneflow.dtype
, oneflow.placement
, oneflow.sbp
.
oneflow
oneflow.dtype¶
-
class
oneflow.
dtype
¶
A oneflow.dtype
is an object that represents the data type of a
oneflow.Tensor
. Oneflow has eight different data types:
Data type |
dtype |
CPU tensor |
GPU tensor |
---|---|---|---|
Boolean |
|
||
8-bit integer (unsigned) |
|
||
8-bit integer (signed) |
|
||
64-bit floating point |
|
||
32-bit floating point |
|
||
16-bit floating point |
|
||
32-bit integer (signed) |
|
||
64-bit integer (signed) |
|
To find out if a oneflow.dtype
is a floating point data type, the property is_floating_point
can be used, which returns True
if the data type is a floating point data type.
When the dtypes of inputs to an arithmetic operation (add, sub, div, mul) differ, we promote by finding the minimum dtype that satisfies the following rules:
If the type of a scalar operand is of a higher category than tensor operands (where complex > floating > integral > boolean), we promote to a type with sufficient size to hold all scalar operands of that category.
If a zero-dimension tensor operand has a higher category than dimensioned operands, we promote to a type with sufficient size and category to hold all zero-dim tensor operands of that category.
If there are no higher-category zero-dim operands, we promote to a type with sufficient size and category to hold all dimensioned operands.
A floating point scalar operand has dtype oneflow.get_default_dtype() and an integral non-boolean scalar operand has dtype oneflow.int64. Unlike numpy, we do not inspect values when determining the minimum dtypes of an operand. Quantized and complex types are not yet supported.
Promotion Examples:
>>> float_tensor = oneflow.ones(1, dtype=oneflow.float)
>>> double_tensor = oneflow.ones(1, dtype=oneflow.double)
>>> int_tensor = oneflow.ones(1, dtype=oneflow.int)
>>> long_tensor = oneflow.ones(1, dtype=oneflow.long)
>>> uint_tensor = oneflow.ones(1, dtype=oneflow.uint8)
>>> double_tensor = oneflow.ones(1, dtype=oneflow.double)
>>> bool_tensor = oneflow.ones(1, dtype=oneflow.bool)
# zero-dim tensors
>>> long_zerodim = oneflow.tensor(1, dtype=oneflow.long)
>>> int_zerodim = oneflow.tensor(1, dtype=oneflow.int)
>>> a,b=oneflow.tensor(5),oneflow.tensor(5)
>>> oneflow.add(a, b).dtype
oneflow.int64
# 5 is an int64, but does not have higher category than int_tensor so is not considered.
>>> (int_tensor + 5).dtype
oneflow.int32
>>> (int_tensor + long_zerodim).dtype
oneflow.int64
>>> (long_tensor + int_tensor).dtype
oneflow.int64
>>> (bool_tensor + long_tensor).dtype
oneflow.int64
>>> (bool_tensor + uint_tensor).dtype
oneflow.uint8
>>> (float_tensor + double_tensor).dtype
oneflow.float64
>>> (bool_tensor + int_tensor).dtype
oneflow.int32
# Since long is a different kind than float, result dtype only needs to be large enough
# to hold the float.
>>> oneflow.add(long_tensor, float_tensor).dtype
oneflow.float32
- When the output tensor of an arithmetic operation is specified, we allow casting to its dtype except that:
An integral output tensor cannot accept a floating point tensor.
A boolean output tensor cannot accept a non-boolean tensor.
A non-complex output tensor cannot accept a complex tensor
Casting Examples:
# allowed:
>>> float_tensor *= float_tensor
>>> float_tensor *= int_tensor
>>> float_tensor *= uint_tensor
>>> float_tensor *= bool_tensor
>>> int_tensor *= uint_tensor
# disallowed (RuntimeError: result type can't be cast to the desired output type):
>>> float_tensor *= double_tensor
>>> int_tensor *= float_tensor
>>> int_tensor *= long_tensor
>>> uint_tensor *= int_tensor
>>> bool_tensor *= int_tensor
>>> bool_tensor *= uint_tensor
oneflow.device¶
-
class
oneflow.
device
¶
A oneflow.device
is an object representing the device on which a oneflow.Tensor
is
or will be allocated.
The oneflow.device
contains a device type ('cpu'
or 'cuda'
) and optional device
ordinal for the device type. If the device ordinal is not present, this object will always represent
the current device for the device type, even after oneflow.cuda.set_device()
is called; e.g.,
a oneflow.Tensor
constructed with device 'cuda'
is equivalent to 'cuda:X'
where X is
the result of oneflow.cuda.current_device()
.
A oneflow.Tensor
’s device can be accessed via the Tensor.device
property.
A oneflow.device
can be constructed via a string or via a string and device ordinal
Via a string:
>>> oneflow.device('cuda:0')
device(type='cuda', index=0)
>>> oneflow.device('cpu')
device(type='cpu', index=0)
>>> oneflow.device('cuda') # current cuda device
device(type='cuda', index=0)
Via a string and device ordinal:
>>> oneflow.device('cuda', 0)
device(type='cuda', index=0)
>>> oneflow.device('cpu', 0)
device(type='cpu', index=0)
Note
The oneflow.device
argument in functions can generally be substituted with a string.
This allows for fast prototyping of code.
>>> # Example of a function that takes in a oneflow.device
>>> cuda1 = oneflow.device('cuda:1')
>>> oneflow.randn((2,3), device=cuda1)
>>> # You can substitute the oneflow.device with a string
>>> oneflow.randn((2,3), device='cuda:1')
Note
For legacy reasons, a device can be constructed via a single device ordinal, which is treated
as a cuda device. This matches Tensor.get_device()
, which returns an ordinal for cuda
tensors and is not supported for cpu tensors.
>>> oneflow.device(1)
device(type='cuda', index=1)
Note
Methods which take a device will generally accept a (properly formatted) string or (legacy) integer device ordinal, i.e. the following are all equivalent:
>>> oneflow.randn((2,3), device=oneflow.device('cuda:1'))
>>> oneflow.randn((2,3), device='cuda:1')
>>> oneflow.randn((2,3), device=1) # legacy
oneflow.placement¶
-
class
oneflow.
placement
¶ A
oneflow.placement
is an object representing the device group on which aoneflow.Tensor
is or will be allocated. Theoneflow.placement
contains a device type (‘cpu’ or ‘cuda’) and corresponding device sequence.A
oneflow.Tensor
’s placement can be accessed via the Tensor.placement property.A oneflow.placement can be constructed in several ways:
>>> import oneflow as flow >>> p = flow.placement(type="cuda", ranks=[0, 1, 2, 3]) >>> p oneflow.placement(type="cuda", ranks=[0, 1, 2, 3]) >>> p = flow.placement(type="cuda", ranks=[[0, 1], [2, 3]]) >>> p oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3]])
oneflow.placement.all¶
-
oneflow.placement.
all
(device_type) → oneflow.placement¶ Returns a placement that contains all available devices.
- Parameters
device_type (str) – cuda or cpu
For examples:
# Runs on 4 ranks import oneflow as flow p = flow.placement.all("cuda") # oneflow.placement(type="cuda", ranks=[0, 1, 2, 3]) p = flow.placement.all("cpu") # oneflow.placement(type="cpu", ranks=[0, 1, 2, 3])
oneflow.env.all_device_placement¶
-
oneflow.env.
all_device_placement
(device_type) → oneflow.placement¶ Returns a placement that contains all available devices.
Note
It is recommended to use oneflow.placement.all instead of this function.
- Parameters
device_type (str) – cuda or cpu
For examples:
# Runs on 4 ranks import oneflow as flow p = flow.env.all_device_placement("cuda") # oneflow.placement(type="cuda", ranks=[0, 1, 2, 3]) p = flow.env.all_device_placement("cpu") # oneflow.placement(type="cpu", ranks=[0, 1, 2, 3])
oneflow.sbp.sbp¶
-
class
oneflow.sbp.
sbp
¶ A
oneflow.sbp
is an object representing that how the data of the global tensor is distributed across the ranks of theTensor
placement.oneflow.sbp
includes three types:oneflow.sbp.split(dim)
Indicates that the global tensor is evenly divided according to the dimension dim and distributed on each rank.
oneflow.sbp.broadcast()
Indicates that the global tensor is replicated on each rank.
oneflow.sbp.partial_sum()
Indicates that the value of the global tensor is element-wise sum of the local tensors distributed in each rank.
A
oneflow.Tensor
’s sbp can be accessed via the Tensor.sbp property.A
oneflow.sbp
can be constructed in several ways:>>> import oneflow as flow >>> s = flow.sbp.split(0) >>> s oneflow.sbp.split(dim=0) >>> b = flow.sbp.broadcast() >>> b oneflow.sbp.broadcast >>> p = flow.sbp.partial_sum() >>> p oneflow.sbp.partial_sum
Type Info¶
The numerical properties of a oneflow.dtype
can be accessed through either the oneflow.finfo
or the oneflow.iinfo
.
oneflow
oneflow.finfo¶
-
class
oneflow.
finfo
¶
A oneflow.finfo
is an object that represents the numerical properties of a floating point oneflow.dtype
, (i.e. oneflow.float32
, oneflow.float64
and oneflow.float16
). This is similar to numpy.finfo.
A oneflow.finfo
provides the following attributes:
Name |
Type |
Description |
---|---|---|
bits |
int |
The number of bits occupied by the type. |
eps |
float |
The smallest representable number such that |
min |
float |
The largest representable number. |
max |
float |
The smallest representable number (typically |
tiny |
float |
The smallest positive normal number. See notes. |
resolution |
float |
The approximate decimal resolution of this type, i.e., |
For example:
>>> import oneflow as flow
>>> flow.finfo()
finfo(resolution=1e-06, min=-3.40282e+38, max=3.40282e+38, eps=1.19209e-07, tiny=1.17549e-38, dtype=oneflow.float32, bits=32)
>>> flow.finfo(flow.float)
finfo(resolution=1e-06, min=-3.40282e+38, max=3.40282e+38, eps=1.19209e-07, tiny=1.17549e-38, dtype=oneflow.float32, bits=32)
>>> flow.finfo(flow.float16).bits
16
>>> flow.finfo(flow.float16).max
65504.0
oneflow.iinfo¶
-
class
oneflow.
iinfo
¶
A oneflow.iinfo
is an object that represents the numerical properties of a integer oneflow.dtype
(i.e. oneflow.uint8
, oneflow.int8
, oneflow.int16
, oneflow.int32
, and oneflow.int64
). This is similar to numpy.iinfo.
A oneflow.iinfo
provides the following attributes:
Name |
Type |
Description |
---|---|---|
bits |
int |
The number of bits occupied by the type. |
min |
float |
The largest representable number. |
max |
float |
The smallest representable number. |
For example:
>>> import oneflow as flow
>>> flow.iinfo(flow.int8)
iinfo(min=-128, max=127, dtype=oneflow.int8, bits=8)
>>> flow.iinfo(flow.int).max
2147483647
>>> flow.iinfo(flow.int).bits
32
oneflow.autograd¶
oneflow.autograd
provides classes and functions implementing automatic differentiation of arbitrary scalar
valued functions. It requires minimal changes to the existing code - you only need to declare Tensor
s
for which gradients should be computed with the requires_grad=True
keyword. As of now, we only support
autograd for floating point Tensor
types ( half, float, double and bfloat16).
Computes the sum of gradients of given tensors with respect to graph leaves. |
|
Computes and returns the sum of gradients of outputs with respect to the inputs. |
Locally disabling gradient computation¶
Context-manager that disabled gradient calculation. |
|
Context-manager that enabled gradient calculation. |
|
Context-manager that enabled gradient calculation. |
|
Context-manager that enables or disables inference mode |
In-place operations on Tensors¶
Supporting in-place operations in autograd is a hard matter, and we discourage their use in most cases. Autograd’s aggressive buffer freeing and reuse makes it very efficient and there are very few occasions when in-place operations actually lower memory usage by any significant amount. Unless you’re operating under heavy memory pressure, you might never need to use them.
Tensor autograd functions¶
Return the gradient calculated by autograd functions. |
|
Is |
|
All Tensors that have |
|
|
Computes the gradient of current tensor w.r.t. graph leaves. |
Registers a backward hook. |
|
Enables this Tensor to have their |
Function¶
-
class
oneflow.autograd.
Function
(self)¶ Base class to create custom autograd.Function.
To create a custom autograd.Function, subclass this class and implement the
forward()
andbackward()
static methods. Then, to use your custom op in the forward pass, call the class methodapply()
or__call__()
. Do not callforward()
directly.For example:
class Exp(Function): @staticmethod def forward(ctx, i): result = i.exp() ctx.save_for_backward(result) return result @staticmethod def backward(ctx, grad_output): result, = ctx.saved_tensors return grad_output * result # Use it by calling the apply method or __call__ method output = Exp.apply(input) # output = Exp()(input)
Override this function for custom forward calculation. |
|
Override this function for custom backward calculation. |
|
Calculate output tensors and build backward graph. |
oneflow.cuda¶
Returns a bool indicating if CUDA is currently available. |
|
Returns the number of GPUs available. |
|
Returns local rank as device index. |
|
Sets the current device. |
|
Waits for all kernels in all streams on a CUDA device to complete. |
|
Gets the properties of a device. |
|
Gets the cuda capability of a device. |
|
Gets the name of a device. |
Note
The current_device
returns local rank as device index. It is different from the ‘torch.current_device()’ in PyTorch.
Random Number Generator¶
Sets the seed for generating random numbers on all GPUs. |
|
Sets the seed for generating random numbers for the current GPU. |
|
Returns the random number generator state of the specified GPU as a ByteTensor. |
|
Returns a list of ByteTensor representing the random number states of all devices. |
|
Sets the random number generator state of the specified GPU. |
|
Sets the random number generator state of all devices. |
GPU tensor¶
The tensortype oneflow.cuda.HalfTensor is not available. |
|
The tensortype oneflow.cuda.FloatTensor is not available. |
|
The tensortype oneflow.cuda.DoubleTensor is not available. |
|
The tensortype oneflow.cuda.BoolTensor is not available. |
|
The tensortype oneflow.cuda.ByteTensor is not available. |
|
The tensortype oneflow.cuda.CharTensor is not available. |
|
The tensortype oneflow.cuda.IntTensor is not available. |
|
The tensortype oneflow.cuda.LongTensor is not available. |
Memory management¶
Releases all unoccupied cached memory currently held by the caching allocators of all OneFlow streams so those can be re-allocated in OneFlow streams or other GPU application and visible in nvidia-smi. |
oneflow.distributed¶
Note
Please refer to OneFlow Distributed Overview for a brief introduction to all features related to distributed training.
OneFlow provides two ways to accomplish Distributed Training:
The first way is that users are recommended to use OneFlow’s global Tensor for distributed training. Global Tensor regards the computing cluster as a supercomputing device, allowing users to write distributed training code just like in a single-machine environment.
OneFlow also provides a DDP(DistributedDataParallel) module aligned with PyTorch. DDP has been well-known and widely used in data parallelism by the majority of PyTorch users. Also see PyTorch DDP introduction.
Basic¶
When you start distributed training in OneFlow, the following functions can be used.
Returns the number of processes in the current process group. |
|
Returns the rank of current process group. |
|
Returns the local rank of current machine. |
|
Returns the number of machines in the current process group. |
|
Init RDMA in the current envirment. |
|
Returns whether RDMA is initialized in the current envirment or not. |
Global Tensor¶
Construct Global Tensor¶
A Global Tensor can be created with a placement
and a sbp
. The placement
describes the physical devices of the global tensor will be allocated, and the sbp
describes its distribution among these devices.
>>>import oneflow as flow
>>> # Place a global tensor on cuda device of rank(process) 0 and 1
>>> placement = flow.placement(type="cuda", ranks=[0, 1])
>>> # Each rank's local data is a part data as a result of spliting global data on dim 0
>>> sbp = flow.sbp.split(dim=0)
>>> # Create a global tensor by randn
>>> x = flow.randn(4, 5, placement=placement, sbp=sbp)
>>> x.shape
oneflow.Size([4, 5])
Convert Local Tensor to Global Tensor¶
With Tensor.to_global
interface, Local Tensor can create a Global Tensor and use that Local Tensor as its local component at the current node.
Two local tensors with the shape of (2,5)
are created separately on two devices. While after the to_global
method, the global tensor with a shape of (4,5)
is obtained.
Code running on Node 0
import oneflow as flow
x = flow.randn(2,5)
placement = flow.placement("cuda", [0,1])
sbp = flow.sbp.split(0)
x_global = x.to_global(placement=placement, sbp=sbp)
x_global.shape
Code running on Node 1
import oneflow as flow
x = flow.randn(2,5)
placement = flow.placement("cuda", [0,1])
sbp = flow.sbp.split(0)
x_global = x.to_global(placement=placement, sbp=sbp)
x_global.shape
Redistribute Global Tensor¶
Redistributing a Global Tensor means moving its data to another device group (or placement), or changing its data distribution (or SBP) across the group, or both at the same time. The redistributed tensor is still a Global Tensor.
>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0], placement=flow.placement("cuda", ranks=[0, 1]), sbp=flow.sbp.split(0))
>>> y = x.to_global(placement=flow.placement("cuda", ranks=[2, 3]), sbp=flow.sbp.broadcast)
According to the operator’s semantics, OneFlow defines a sequence of valid input and output SBP combinations for each built-in operator. So OneFlow could automatically redistribute the Global Tensor to satisfy the operator’s SBP requirements for its input Tensor. For example, the following code:
>>> import oneflow as flow
>>> x = flow.randn(4, 4,
placement=flow.placement("cuda", ranks=[0, 1]),
sbp=flow.sbp.split(0))
>>> y = flow.randn(4, 4,
placement=flow.placement("cuda", ranks=[0, 1]),
sbp=flow.sbp.split(1))
>>> z = x + y
When x + y
is executed, since x is split along dimension 0
and y is split along dimension 1
, their local components at each node can not be added directly, then OneFlow will automatically redistribute one of x and y to make them have the same SBP, and complete the add operation successfully.
Note
Global Tensor can not be used in combination with DDP currently.
Global Tensor requires all devices to execute at the same pace, otherwise, it may cause multi-process deadlock.
Get Local Tensor from Global Tensor¶
With Tensor.to_local
interface, the Global Tensor can return its local component at the current node.
y = x.to_local()
y.is_local
True
y
tensor([[ 2.9186e-01, -3.9442e-01, 4.7072e-04, -3.2216e-01, 1.7788e-01],
[-4.5284e-01, 1.2361e-01, -3.5962e-01, 2.6651e-01, 1.2951e+00]],
device='cuda:0', dtype=oneflow.float32)
DistributedDataParallel¶
For more information about DistributedDataParallel, see nn.parallel.DistributedDataParallel
The following script shows the process of using oneflow.nn.parallel.DistributedDataParallel
for training data parallel:
import oneflow as flow
from oneflow.nn.parallel import DistributedDataParallel as ddp
train_x = [
flow.tensor([[1, 2], [2, 3]], dtype=flow.float32),
flow.tensor([[4, 6], [3, 1]], dtype=flow.float32),
]
train_y = [
flow.tensor([[8], [13]], dtype=flow.float32),
flow.tensor([[26], [9]], dtype=flow.float32),
]
class Model(flow.nn.Module):
def __init__(self):
super().__init__()
self.lr = 0.01
self.iter_count = 500
self.w = flow.nn.Parameter(flow.tensor([[0], [0]], dtype=flow.float32))
def forward(self, x):
x = flow.matmul(x, self.w)
return x
m = Model().to("cuda")
m = ddp(m)
loss = flow.nn.MSELoss(reduction="sum")
optimizer = flow.optim.SGD(m.parameters(), m.lr)
for i in range(0, m.iter_count):
rank = flow.env.get_rank()
x = train_x[rank].to("cuda")
y = train_y[rank].to("cuda")
y_pred = m(x)
l = loss(y_pred, y)
if (i + 1) % 50 == 0:
print(f"{i+1}/{m.iter_count} loss:{l}")
optimizer.zero_grad()
l.backward()
optimizer.step()
print(f"\nw:{m.w}")
There are only two differences between the data parallelism training code and the stand-alone single-card script:
Use DistributedDataParallel to wrap the module object (m = ddp(m))
Use get_rank to get the current device number and distribute the data to the device.
Then use launcher to run the script, leave everything else to OneFlow, which makes distributed training as simple as stand-alone single-card training:
python3 -m oneflow.distributed.launch --nproc_per_node 2 ./ddp_train.py
Communication collectives¶
Reduces the tensor data across all machines in such a way that all get the final result. |
|
Gathers tensors from the whole group in a list. |
|
Gather tensors from all ranks and put them in a single output tensor. |
|
Each process scatters list of input tensors to all processes in a group and return gathered list of tensors in output list. |
|
Broadcasts the tensor to the whole group. |
|
Synchronizes all processes. |
|
Gathers a list of tensors in a single process. |
|
Reduces the tensor data across all machines. |
|
Reduces, then scatters a list of tensors to all processes in a group. |
|
Reduces, then scatters a tensor to all ranks. |
|
Receives a tensor synchronously. |
|
Scatters a list of tensors to all processes in a group. |
|
Sends a tensor synchronously. |
We also provide PyTorch-compatible APIs for communication collectives, for example, oneflow.distributed.all_reduce(tensor, op=ReduceOp.SUM, group=None, async_op=False). For more information, see PyTorch Distributed Communication. Note that we currently only support op=ReduceOp.SUM, group=None and async_op=False in these operations.
Launching distributed training¶
run commands below to see more about usage.
python3 -m oneflow.distributed.launch -h
usage: launch.py [-h] [--nnodes NNODES] [--node_rank NODE_RANK]
[--nproc_per_node NPROC_PER_NODE] [--master_addr MASTER_ADDR]
[--master_port MASTER_PORT] [-m] [--no_python]
[--redirect_stdout_and_stderr] [--logdir LOGDIR]
training_script ...
OneFlow distributed training launch helper utility that will spawn up multiple
distributed processes
positional arguments:
training_script The full path to the single GPU training program/script to be
launched in parallel, followed by all the arguments for the
training script
training_script_args
optional arguments:
-h, --help show this help message and exit
--nnodes NNODES The number of nodes to use for distributed training
--node_rank NODE_RANK
The rank of the node for multi-node distributed training
--nproc_per_node NPROC_PER_NODE
The number of processes to launch on each node, for GPU
training, this is recommended to be set to the number of GPUs in
your system so that each process can be bound to a single GPU.
--master_addr MASTER_ADDR
Master node (rank 0)'s address, should be either the IP address
or the hostname of node 0, for single node multi-proc training,
the --master_addr can simply be 127.0.0.1
--master_port MASTER_PORT
Master node (rank 0)'s free port that needs to be used for
communication during distributed training
-m, --module Changes each process to interpret the launch script as a python
module, executing with the same behavior as'python -m'.
--no_python Do not prepend the training script with "python" - just exec it
directly. Useful when the script is not a Python script.
--redirect_stdout_and_stderr
write the stdout and stderr to files 'stdout' and 'stderr'. Only
available when logdir is set
--logdir LOGDIR Relative path to write subprocess logs to. Passing in a relative
path will create a directory if needed. Note that successive
runs with the same path to write logs to will overwrite existing
logs, so be sure to save logs as needed.
oneflow.distributions¶
Distribution is the abstract base class for probability distributions. |
|
Creates a categorical distribution parameterized by either |
oneflow.hub¶
Oneflow Hub is a pre-trained model repository designed to facilitate research reproducibility.
Publishing models¶
Oneflow Hub supports publishing pre-trained models(model definitions and pre-trained weights)
to a github repository by adding a simple hubconf.py
file;
hubconf.py
can have multiple entrypoints. Each entrypoint is defined as a python function
(example: a pre-trained model you want to publish).
def entrypoint_name(*args, **kwargs):
# args & kwargs are optional, for models which take positional/keyword arguments.
...
How to implement an entrypoint?¶
Here is a code snippet specifies an entrypoint for resnet18
model if we expand
the implementation in Oneflow-Inc/vision/hubconf.py
.
In most case importing the right function in hubconf.py
is sufficient. Here we
just want to use the expanded version as an example to show how it works.
You can see the full script in
Oneflow-Inc/vision repo
dependencies = ['oneflow']
from flowvision.models.resnet import resnet18 as _resnet18
# resnet18 is the name of entrypoint
def resnet18(pretrained=False, **kwargs):
""" # This docstring shows up in hub.help()
Resnet18 model
pretrained (bool): kwargs, load pretrained weights into the model
"""
# Call the model, load pretrained weights
model = _resnet18(pretrained=pretrained, **kwargs)
return model
dependencies
variable is a list of package names required to load the model. Note this might be slightly different from dependencies required for training a model.args
andkwargs
are passed along to the real callable function.Docstring of the function works as a help message. It explains what does the model do and what are the allowed positional/keyword arguments. It’s highly recommended to add a few examples here.
Entrypoint function can either return a model(nn.module), or auxiliary tools to make the user workflow smoother, e.g. tokenizers.
Callables prefixed with underscore are considered as helper functions which won’t show up in
oneflow.hub.list()
.Pretrained weights can either be stored locally in the github repo, or loadable by
oneflow.hub.load_state_dict_from_url()
. If less than 2GB, it’s recommended to attach it to a project release and use the url from the release. In the example aboveflowvision.models.resnet.resnet18
handlespretrained
, alternatively you can put the following logic in the entrypoint definition.
if pretrained:
# For checkpoint saved in local github repo, e.g. <RELATIVE_PATH_TO_CHECKPOINT>=weights/save.pth
dirname = os.path.dirname(__file__)
checkpoint = os.path.join(dirname, <RELATIVE_PATH_TO_CHECKPOINT>)
state_dict = oneflow.load(checkpoint)
model.load_state_dict(state_dict)
# For checkpoint saved elsewhere
checkpoint = 'https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/flowvision/classification/ResNet/resnet18.zip'
model.load_state_dict(oneflow.hub.load_state_dict_from_url(checkpoint, progress=False))
Important Notice¶
The published models should be at least in a branch/tag. It can’t be a random commit.
Loading models from Hub¶
OneFlow Hub provides convenient APIs to explore all available models in hub
through oneflow.hub.list()
, show docstring and examples through
oneflow.hub.help()
and load the pre-trained models using
oneflow.hub.load()
.
Copyright 2020 The OneFlow Authors. All rights reserved.
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
-
oneflow.hub.
list
(github, force_reload=False, skip_validation=False, trust_repo=None)¶ List all callable entrypoints available in the repo specified by
github
.- Parameters
github (str) – a string with format “repo_owner/repo_name[:ref]” with an optional ref (tag or branch). If
ref
is not specified, the default branch is assumed to bemain
if it exists, and otherwisemaster
. Example: ‘ Oneflow-Inc/vision:0.2.0’force_reload (bool, optional) – whether to discard the existing cache and force a fresh download. Default is
False
.skip_validation (bool, optional) – if
False
, oneflowhub will check that the branch or commit specified by thegithub
argument properly belongs to the repo owner. This will make requests to the GitHub API; you can specify a non-default GitHub token by setting theGITHUB_TOKEN
environment variable. Default isFalse
.trust_repo (bool, str or None) –
"check"
,True
,False
orNone
. This parameter was introduced in v1.12 and helps ensuring that users only run code from repos that they trust. - IfFalse
, a prompt will ask the user whether the repo should be trusted.If
True
, the repo will be added to the trusted list and loaded without requiring explicit confirmation.If
"check"
, the repo will be checked against the list of trusted repos in the cache. If it is not present in that list, the behaviour will fall back onto thetrust_repo=False
option.If
None
, this will raise a warning, inviting the user to settrust_repo
to eitherFalse
,True
or"check"
. This is only present for backward compatibility and will be removed in v1.14.
Default is
None
and will eventually change to"check"
in v1.14.
- Returns
The available callables entrypoint
- Return type
list
For example:
>>> entrypoints = oneflow.hub.list('Oneflow-Inc/vision', force_reload=True)
-
oneflow.hub.
help
(github, model, force_reload=False, skip_validation=False, trust_repo=None)¶ Show the docstring of entrypoint
model
.- Parameters
github (str) – a string with format <repo_owner/repo_name[:ref]> with an optional ref (a tag or a branch). If
ref
is not specified, the default branch is assumed to bemain
if it exists, and otherwisemaster
. Example: ‘Oneflow-Inc/vision:0.2.0’model (str) – a string of entrypoint name defined in repo’s
hubconf.py
force_reload (bool, optional) – whether to discard the existing cache and force a fresh download. Default is
False
.skip_validation (bool, optional) – if
False
, oneflowhub will check that the ref specified by thegithub
argument properly belongs to the repo owner. This will make requests to the GitHub API; you can specify a non-default GitHub token by setting theGITHUB_TOKEN
environment variable. Default isFalse
.trust_repo (bool, str or None) –
"check"
,True
,False
orNone
. This parameter was introduced in v1.12 and helps ensuring that users only run code from repos that they trust.If
False
, a prompt will ask the user whether the repo should be trusted.If
True
, the repo will be added to the trusted list and loaded without requiring explicit confirmation.If
"check"
, the repo will be checked against the list of trusted repos in the cache. If it is not present in that list, the behaviour will fall back onto thetrust_repo=False
option.If
None
: this will raise a warning, inviting the user to settrust_repo
to eitherFalse
,True
or"check"
. This is only present for backward compatibility and will be removed in v1.14.
Default is
None
and will eventually change to"check"
in v1.14.
- For example:
>>> print(oneflow.hub.help('Oneflow-Inc/vision', 'resnet18', force_reload=True))
-
oneflow.hub.
load
(repo_or_dir, model, *args, source='github', trust_repo=None, force_reload=False, verbose=True, skip_validation=False, **kwargs)¶ Load a model from a github repo or a local directory. Note: Loading a model is the typical use case, but this can also be used to for loading other objects such as tokenizers, loss functions, etc. If
source
is ‘github’,repo_or_dir
is expected to be of the formrepo_owner/repo_name[:ref]
with an optional ref (a tag or a branch). Ifsource
is ‘local’,repo_or_dir
is expected to be a path to a local directory.- Parameters
repo_or_dir (str) – If
source
is ‘github’, this should correspond to a github repo with formatrepo_owner/repo_name[:ref]
with an optional ref (tag or branch), for example ‘Oneflow-Inc/vision:0.2.0’. Ifref
is not specified, the default branch is assumed to bemain
if it exists, and otherwisemaster
. Ifsource
is ‘local’ then it should be a path to a local directory.model (str) – the name of a callable (entrypoint) defined in the repo/dir’s
hubconf.py
.*args (optional) – the corresponding args for callable
model
.source (str, optional) – ‘github’ or ‘local’. Specifies how
repo_or_dir
is to be interpreted. Default is ‘github’.trust_repo (bool, str or None) –
"check"
,True
,False
orNone
. This parameter was introduced in v1.12 and helps ensuring that users only run code from repos that they trust.If
False
, a prompt will ask the user whether the repo should be trusted.If
True
, the repo will be added to the trusted list and loaded without requiring explicit confirmation.If
"check"
, the repo will be checked against the list of trusted repos in the cache. If it is not present in that list, the behaviour will fall back onto thetrust_repo=False
option.If
None
: this will raise a warning, inviting the user to settrust_repo
to eitherFalse
,True
or"check"
. This is only present for backward compatibility and will be removed in v1.14.
Default is
None
and will eventually change to"check"
in v1.14.force_reload (bool, optional) – whether to force a fresh download of the github repo unconditionally. Does not have any effect if
source = 'local'
. Default isFalse
.verbose (bool, optional) – If
False
, mute messages about hitting local caches. Note that the message about first download cannot be muted. Does not have any effect ifsource = 'local'
. Default isTrue
.skip_validation (bool, optional) – if
False
, oneflowhub will check that the branch or commit specified by thegithub
argument properly belongs to the repo owner. This will make requests to the GitHub API; you can specify a non-default GitHub token by setting theGITHUB_TOKEN
environment variable. Default isFalse
.**kwargs (optional) – the corresponding kwargs for callable
model
.
- Returns
The output of the
model
callable when called with the given*args
and**kwargs
.
- For example:
>>> # from a github repo >>> repo = 'Oneflow-Inc/vision' >>> model = oneflow.hub.load(repo, 'resnet50', weights='ResNet50_Weights.IMAGENET1K_V1') >>> # from a local directory >>> path = '/some/local/path/oneflow/vision' >>> # xdoctest: +SKIP >>> model = oneflow.hub.load(path, 'resnet50', weights='ResNet50_Weights.DEFAULT')
-
oneflow.hub.
download_url_to_file
(url, dst, hash_prefix=None, progress=True)¶ Download object at the given URL to a local path.
- Parameters
url (str) – URL of the object to download
dst (str) – Full path where object will be saved, e.g.
/tmp/temporary_file
hash_prefix (str, optional) – If not None, the SHA256 downloaded file should start with
hash_prefix
. Default: Noneprogress (bool, optional) – whether or not to display a progress bar to stderr Default: True
- For example:
>>> # xdoctest: +REQUIRES(POSIX) >>> oneflow.hub.download_url_to_file('https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/flowvision/classification/ResNet/resnet18.zip', '/tmp/temporary_file')
-
oneflow.hub.
load_state_dict_from_url
(url: str, model_dir: Optional[str] = None, map_location=None, progress: bool = True, check_hash: bool = False, file_name: Optional[str] = None) → Dict[str, Any]¶ Loads the OneFlow serialized object at the given URL. If downloaded file is a zip file, it will be automatically decompressed. If the object is already present in model_dir, it’s deserialized and returned. The default value of
model_dir
is<hub_dir>/checkpoints
wherehub_dir
is the directory returned byget_dir()
.- Parameters
url (str) – URL of the object to download
model_dir (str, optional) – directory in which to save the object
map_location (optional) – a function or a dict specifying how to remap storage locations (see oneflow.load)
progress (bool, optional) – whether or not to display a progress bar to stderr. Default: True
check_hash (bool, optional) – If True, the filename part of the URL should follow the naming convention
filename-<sha256>.ext
where<sha256>
is the first eight or more digits of the SHA256 hash of the contents of the file. The hash is used to ensure unique names and to verify the contents of the file. Default: Falsefile_name (str, optional) – name for the downloaded file. Filename from
url
will be used if not set.
- For example:
>>> state_dict = oneflow.hub.load_state_dict_from_url('https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/flowvision/classification/ResNet/resnet18.zip')
Running a loaded model:¶
Note that *args
and **kwargs
in oneflow.hub.load()
are used to
instantiate a model. After you have loaded a model, how can you find out
what you can do with the model?
A suggested workflow is
dir(model)
to see all available methods of the model.help(model.foo)
to check what argumentsmodel.foo
takes to run
To help users explore without referring to documentation back and forth, we strongly recommend repo owners make function help messages clear and succinct. It’s also helpful to include a minimal working example.
Where are my downloaded models saved?¶
The locations are used in the order of
Calling
hub.set_dir(<PATH_TO_HUB_DIR>)
$ONEFLOW_HOME/hub
, if environment variableONEFLOW_HOME
is set.$XDG_CACHE_HOME/oneflow/hub
, if environment variableXDG_CACHE_HOME
is set.~/.cache/oneflow/hub
-
oneflow.hub.
get_dir
()¶ Get the OneFlow Hub cache directory used for storing downloaded models & weights. If
set_dir()
is not called, default path is$ONEFLOW_HOME/hub
where environment variable$ONEFLOW_HOME
defaults to$XDG_CACHE_HOME/oneflow
.$XDG_CACHE_HOME
follows the X Design Group specification of the Linux filesystem layout, with a default value~/.cache
if the environment variable is not set.
-
oneflow.hub.
set_dir
(d)¶ Optionally set the OneFlow Hub directory used to save downloaded models & weights.
- Parameters
d (str) – path to a local folder to save downloaded models & weights.
Caching logic¶
By default, we don’t clean up files after loading it. Hub uses the cache by default if it already exists in the
directory returned by get_dir()
.
Users can force a reload by calling hub.load(..., force_reload=True)
. This will delete
the existing github folder and downloaded weights, reinitialize a fresh download. This is useful
when updates are published to the same branch, users can keep up with the latest release.
Known limitations:¶
Oneflow hub works by importing the package as if it was installed. There are some side effects
introduced by importing in Python. For example, you can see new items in Python caches
sys.modules
and sys.path_importer_cache
which is normal Python behavior.
This also means that you may have import errors when importing different models
from different repos, if the repos have the same sub-package names (typically, a
model
subpackage). A workaround for these kinds of import errors is to
remove the offending sub-package from the sys.modules
dict; more details can
be found in this github issue.
A known limitation that is worth mentioning here: users CANNOT load two different branches of the same repo in the same python process. It’s just like installing two packages with the same name in Python, which is not good. Cache might join the party and give you surprises if you actually try that. Of course it’s totally fine to load them in separate processes.
oneflow.linalg¶
Common linear algebra operations.
Matrix Properties¶
Returns the matrix norm or vector norm of a given tensor. |
|
Computes a vector norm. |
|
Computes a matrix norm. |
|
Alias for |
|
Computes the inverse of a square matrix if it exists. |
|
Computes the cross product of two 3-dimensional vectors. |
|
Computes the determinant of a square matrix. |
oneflow.nn.init¶
-
oneflow.nn.init.
calculate_gain
(nonlinearity, param=None)¶
-
oneflow.nn.init.
uniform_
(tensor, a=0.0, b=1.0)¶ Fills the input Tensor with values drawn from the uniform distribution \(\mathcal{U}(a, b)\).
The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.
- Parameters
tensor – an n-dimensional oneflow.Tensor
a – the lower bound of the uniform distribution
b – the upper bound of the uniform distribution
Examples
>>> w = flow.empty(3, 5) >>> nn.init.uniform_(w)
-
oneflow.nn.init.
normal_
(tensor, mean=0.0, std=1.0)¶ Fills the input Tensor with values drawn from the normal distribution \(\mathcal{N}(\text{mean}, \text{std}^2)\).
The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.
- Parameters
tensor – an n-dimensional oneflow.Tensor
mean – the mean of the normal distribution
std – the standard deviation of the normal distribution
Examples
>>> w = flow.empty(3, 5) >>> nn.init.normal_(w)
-
oneflow.nn.init.
constant_
(tensor, val)¶ Fills the input Tensor with the value \(\text{val}\).
The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.
- Parameters
tensor – an n-dimensional oneflow.Tensor
val – the value to fill the tensor with
Examples
>>> w = flow.empty(3, 5) >>> nn.init.constant_(w, 0.3)
-
oneflow.nn.init.
ones_
(tensor)¶ Fills the input Tensor with the scalar value 1.
The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.
- Parameters
tensor – an n-dimensional oneflow.Tensor
Examples
>>> w = flow.empty(3, 5) >>> nn.init.ones_(w)
-
oneflow.nn.init.
zeros_
(tensor)¶ Fills the input Tensor with the scalar value 0.
The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.
- Parameters
tensor – an n-dimensional oneflow.Tensor
Examples
>>> w = flow.empty(3, 5) >>> nn.init.zeros_(w)
-
oneflow.nn.init.
xavier_uniform_
(tensor, gain=1.0, *, data_format='NCHW')¶ Fills the input Tensor with values according to the method described in Understanding the difficulty of training deep feedforward neural networks - Glorot, X. & Bengio, Y. (2010), using a uniform distribution. The resulting tensor will have values sampled from \(\mathcal{U}(-a, a)\) where
\[a = \text{gain} \times \sqrt{\frac{6}{\text{fan_in} + \text{fan_out}}}\]The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.
Also known as Glorot initialization.
- Parameters
tensor – an n-dimensional oneflow.Tensor
gain – an optional scaling factor
Examples
>>> w = flow.empty(3, 5) >>> nn.init.xavier_uniform_(w, gain=nn.init.calculate_gain('relu'))
-
oneflow.nn.init.
xavier_normal_
(tensor, gain=1.0, *, data_format='NCHW')¶ Fills the input Tensor with values according to the method described in Understanding the difficulty of training deep feedforward neural networks - Glorot, X. & Bengio, Y. (2010), using a normal distribution. The resulting tensor will have values sampled from \(\mathcal{N}(0, \text{std}^2)\) where
\[\text{std} = \text{gain} \times \sqrt{\frac{2}{\text{fan_in} + \text{fan_out}}}\]The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.
Also known as Glorot initialization.
- Parameters
tensor – an n-dimensional oneflow.Tensor
gain – an optional scaling factor
Examples
>>> w = flow.empty(3, 5) >>> nn.init.xavier_normal_(w)
-
oneflow.nn.init.
kaiming_uniform_
(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu', *, data_format='NCHW')¶ Fills the input Tensor with values according to the method described in Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification - He, K. et al. (2015), using a uniform distribution. The resulting tensor will have values sampled from \(\mathcal{U}(-\text{bound}, \text{bound})\) where
\[\text{bound} = \text{gain} \times \sqrt{\frac{3}{\text{fan_mode}}}\]The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.
Also known as He initialization.
- Parameters
tensor – an n-dimensional oneflow.Tensor
a – the negative slope of the rectifier used after this layer (only used with
'leaky_relu'
)mode – either
'fan_in'
(default) or'fan_out'
. Choosing'fan_in'
preserves the magnitude of the variance of the weights in the forward pass. Choosing'fan_out'
preserves the magnitudes in the backwards pass.nonlinearity – the non-linear function (nn.functional name), recommended to use only with
'relu'
or'leaky_relu'
(default).
Examples
>>> w = flow.empty(3, 5) >>> nn.init.kaiming_uniform_(w, mode='fan_in', nonlinearity='relu')
-
oneflow.nn.init.
kaiming_normal_
(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu', *, data_format='NCHW')¶ Fills the input Tensor with values according to the method described in Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification - He, K. et al. (2015), using a normal distribution. The resulting tensor will have values sampled from \(\mathcal{N}(0, \text{std}^2)\) where
\[\text{std} = \frac{\text{gain}}{\sqrt{\text{fan_mode}}}\]The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.
Also known as He initialization.
- Parameters
tensor – an n-dimensional oneflow.Tensor
a – the negative slope of the rectifier used after this layer (only used with
'leaky_relu'
)mode – either
'fan_in'
(default) or'fan_out'
. Choosing'fan_in'
preserves the magnitude of the variance of the weights in the forward pass. Choosing'fan_out'
preserves the magnitudes in the backwards pass.nonlinearity – the non-linear function (nn.functional name), recommended to use only with
'relu'
or'leaky_relu'
(default).
Examples
>>> w = flow.empty(3, 5) >>> nn.init.kaiming_normal_(w, mode='fan_out', nonlinearity='relu')
-
oneflow.nn.init.
trunc_normal_
(tensor, mean=0.0, std=1.0, a=- 2.0, b=2.0)¶
-
oneflow.nn.init.
orthogonal_
(tensor, gain=1.0)¶ Fills the input Tensor with a (semi) orthogonal matrix, as described in Exact solutions to the nonlinear dynamics of learning in deep linear neural networks - Saxe, A. et al. (2013). The input tensor must have at least 2 dimensions, and for tensors with more than 2 dimensions the trailing dimensions are flattened.
The interface is consistent with PyTorch. The documentation is referenced from: https://pytorch.org/docs/1.10/nn.init.html.
- Parameters
tensor – an n-dimensional torch.Tensor, where \(n \geq 2\)
gain – optional scaling factor
Examples
>>> w = flow.empty(3, 5) >>> nn.init.orthogonal_(w)
oneflow.optim¶
oneflow.optim is a package implementing various optimization algorithms. Most commonly used methods are already supported, and the interface is general enough, so that more sophisticated ones can be also easily integrated in the future.
How to use an optimizer¶
To use oneflow.optim
you have to construct an optimizer object, that will hold
the current state and will update the parameters based on the computed gradients.
Constructing it¶
To construct an Optimizer
you have to give it an iterable containing the
parameters (all should be Variable
s) to optimize. Then,
you can specify optimizer-specific options such as the learning rate, weight decay, etc.
Note
If you need to move a model to GPU via .cuda()
, please do so before
constructing optimizers for it. Parameters of a model after .cuda()
will be different objects with those before the call.
In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.
Example:
import oneflow
import oneflow.nn as nn
import oneflow.optim as optim
model = nn.Linear(16, 3)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
Per-parameter options¶
Optimizer
also support specifying per-parameter options. To do this, instead
of passing an iterable of Variable
, pass in an iterable of
dict
. Each of them will define a separate parameter group, and should contain
a params
key, containing a list of parameters belonging to it. Other keys
should match the keyword arguments accepted by the optimizers, and will be used
as optimization options for this group.
Note
You can still pass options as keyword arguments. They will be used as defaults, in the groups that didn’t override them. This is useful when you only want to vary a single option, while keeping all others consistent between parameter groups.
For example, this is very useful when one wants to specify per-layer learning rates:
import oneflow.nn as nn
import oneflow.optim as optim
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.base = nn.Linear(64, 32)
self.classifier = nn.Linear(32, 10)
def forward(self, x):
out = self.base(x)
out = self.classifier(out)
return out
model = Model()
optim.SGD(
[
{"params": model.base.parameters()},
{"params": model.classifier.parameters(), "lr": 1e-3},
],
lr=1e-2,
momentum=0.9,
)
This means that model.base
’s parameters will use the default learning rate of 1e-2
,
model.classifier
’s parameters will use a learning rate of 1e-3
, and a momentum of
0.9
will be used for all parameters.
Taking an optimization step¶
All optimizers implement a step()
method, that updates the
parameters. It can be used in two ways:
optimizer.step()
¶
This is a simplified version supported by most optimizers. The function can be
called once the gradients are computed using e.g.
backward()
.
Example:
import oneflow
import oneflow.nn as nn
import oneflow.nn.functional as F
import oneflow.optim as optim
from oneflow.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
def __init__(self, num):
self.inputs = oneflow.randn(num, 1)
self.targets = oneflow.sin(self.inputs)
def __len__(self):
return self.inputs.shape[0]
def __getitem__(self, index):
return self.inputs[index], self.targets[index]
class Model(nn.Module):
def __init__(self, input_size):
super(Model, self).__init__()
self.linear1 = nn.Linear(input_size, 64)
self.linear2 = nn.Linear(64, input_size)
def forward(self, x):
out = self.linear1(x)
return self.linear2(F.relu(out))
dataset = CustomDataset(10000)
dataloader = DataLoader(dataset, batch_size=10)
model = Model(1)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)
for epoch in range(100):
for input, target in dataloader:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
Base class¶
-
class
oneflow.optim.
Optimizer
(parameters, options)¶
Add a param group to the |
|
Load the state of the optimizer which is created by state_dict function. |
|
Returns the state of the optimizer as a |
|
Performs a single optimization step (parameter update). |
|
Sets the gradients of all optimized |
Algorithms¶
Adjust Learning Rate¶
oneflow.optim.lr_scheduler
provides several methods to adjust the learning
rate based on the number of epochs. oneflow.optim.lr_scheduler.ReduceLROnPlateau
allows dynamic learning rate reducing based on some validation measurements.
Learning rate scheduling should be applied after optimizer’s update; e.g., you should write your code this way:
Example:
import oneflow
import oneflow.nn as nn
import oneflow.nn.functional as F
import oneflow.optim as optim
from oneflow.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
def __init__(self, num):
self.inputs = oneflow.randn(num, 1)
self.targets = oneflow.sin(self.inputs)
def __len__(self):
return self.inputs.shape[0]
def __getitem__(self, index):
return self.inputs[index], self.targets[index]
class Model(nn.Module):
def __init__(self, input_size):
super(Model, self).__init__()
self.linear1 = nn.Linear(input_size, 64)
self.linear2 = nn.Linear(64, input_size)
def forward(self, x):
out = self.linear1(x)
return self.linear2(F.relu(out))
dataset = CustomDataset(10000)
dataloader = DataLoader(dataset, batch_size=10)
model = Model(1)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)
scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
for epoch in range(20):
for input, target in dataloader:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
scheduler.step()
Most learning rate schedulers can be chained (also referred to as chaining schedulers).
Example:
import oneflow
import oneflow.nn as nn
import oneflow.nn.functional as F
import oneflow.optim as optim
from oneflow.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
def __init__(self, num):
self.inputs = oneflow.randn(num, 1)
self.targets = oneflow.sin(self.inputs)
def __len__(self):
return self.inputs.shape[0]
def __getitem__(self, index):
return self.inputs[index], self.targets[index]
class Model(nn.Module):
def __init__(self, input_size):
super(Model, self).__init__()
self.linear1 = nn.Linear(input_size, 64)
self.linear2 = nn.Linear(64, input_size)
def forward(self, x):
out = self.linear1(x)
return self.linear2(F.relu(out))
dataset = CustomDataset(10000)
dataloader = DataLoader(dataset, batch_size=10)
model = Model(1)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)
scheduler1 = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
scheduler2 = optim.lr_scheduler.MultiStepLR(optimizer, milestones=[5, 10], gamma=0.1)
for epoch in range(20):
for input, target in dataloader:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
scheduler1.step()
scheduler2.step()
In many places in the documentation, we will use the following template to refer to schedulers algorithms.
>>> scheduler = ...
>>> for epoch in range(100):
>>> train(...)
>>> validate(...)
>>> scheduler.step()
Warning
If you use the learning rate scheduler (calling scheduler.step()
) before the optimizer’s update
(calling optimizer.step()
), this will skip the first value of the learning rate schedule. Please
check if you are calling scheduler.step()
at the wrong time.
Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr and \(T_{cur}\) is the number of epochs since the last restart in SGDR: |
|
This operator creates a Cosine decayed learning rate scheduler. |
|
Decays the learning rate of each parameter group by gamma every epoch. |
|
Sets the learning rate of each parameter group to the initial lr times a given function. |
|
Decays the learning rate of each parameter group by gamma once the number of step reaches one of the milestones. |
|
This operator creates a polynomial decayed learning rate scheduler. |
|
Reduce learning rate when a metric has stopped improving. |
|
Decays the learning rate of each parameter group by gamma every step_size steps. |
|
Decays the learning rate of each parameter group by a small constant factor until the number of step reaches a pre-defined milestone: total_iters. |
|
Decays the learning rate of each parameter group by linearly changing small multiplicative factor until the number of step reaches a pre-defined milestone: total_iters. |
|
Chains list of learning rate schedulers. |
|
Receives the list of schedulers that is expected to be called sequentially during optimization process and milestone points that provides exact intervals to reflect which scheduler is supposed to be called at a given step. |
|
Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr, \(T_{cur}\) is the number of steps since the last restart and \(T_{i}\) is the number of steps between two warm restarts in SGDR: |
oneflow.nn.Graph¶
Base class for running neural networks in Static Graph Mode.
Currently, there are two main ways to run models in deep learning frameworks, namely dynamic graphs and static graphs , which are also conventionally referred to as Eager Mode to Static Graph Mode and Static Graph Mode in OneFlow.
Both approaches have their advantages and disadvantages, and OneFlow provides support for both approaches, with Eager mode being the default.
Generally speaking, dynamic graphs are easier to use and static graphs have more performance advantages. oneflow.nn.Graph
module is provided by OneFlow to allow users to build static graphs and train models with Eager-like programming conventions.
oneflow.nn.Graph
Eager Mode to Static Graph Mode¶
OneFlow runs in Eager mode by default.
OneFlow’s nn.Graph is programmed in a style very similar to Eager Mode, so it is possible to make small changes and get large performance gains.
The following script shows the process of building a neural network in eager mode using the interface under oneflow.nn
:
import oneflow as flow
import oneflow.nn as nn
class ModuleMyLinear(nn.Module):
def __init__(self, in_features, out_features):
super().__init__()
self.weight = nn.Parameter(flow.randn(in_features, out_features))
self.bias = nn.Parameter(flow.randn(out_features))
def forward(self, input):
return flow.matmul(input, self.weight) + self.bias
linear_model = ModuleMyLinear(4, 3)
Eager nn.Module
can be reused by nn.Graph
. The above script for eager mode can be changed to static Graph mode by adding just a few lines of code, which consists of the following steps:
Define your customized graph as a subclass of
nn.Graph
At the beginning of __init__. Call super().__init__() to let OneFlow do the necessary initialization of the Graph
Reuse the
nn.Module
object in Eager mode in __init__ (self.model = model)Describe the computation in the
build
methodInstantiate your graph then call it.
class GraphMyLinear(nn.Graph):
def __init__(self):
super().__init__()
self.model = linear_model
def build(self, input):
return self.model(input)
graph_mylinear = GraphMyLinear()
input = flow.randn(1, 4)
out = graph_mylinear(input)
print(out)
tensor([[-0.3298, -3.7907, 0.1661]], dtype=oneflow.float32)
Static Graph Mode¶
Constructing a Graph¶
Base class for training or evaluating a neural network in static graph mode.
Initializes internal Graph states. |
|
The |
|
Add an optimizer, an learning rate scheduler to the graph. |
|
Set the GradScaler for gradient and loss scaling. |
Executing a Graph¶
Call a nn.Graph instance to run a customized graph.
Call nn.Graph subclass instance to run your customized graph. |
Config options on a Graph¶
Optimization options of a nn.Graph.
If set to true, then graph will use mixed precision mode, it means use both float16 and float32 during model training. |
|
Enable ZeRO redundancy optimizer. |
|
If set to true, try to fuse cast + scale + l1_l2_regularize_gradient + model_update to one op to improve performance. |
|
If set to true, try to fuse a binary element-wise add operator to one of the predecessors to improve performance. |
|
If set to true, try to fuse cast and scalar_mul_by_tensor to improve performance. |
|
Set num of steps to accumulate gradient. |
|
Whether enable cudnn conv operation to use heuristic search algorithm. |
|
Whether enable the straighten algorithm. |
|
If true, then the graph will try its best to find the minimum memory allocation strategy. |
Config options on a GraphModule¶
GraphModule is the graph representation of a nn.Module in a nn.Graph.
When an nn.Module is added into an nn.Graph, it is wrapped into a ProxyModule. The ProxyModule has a GraphModule inside it. You can get and set the GraphModule to enable graph optimization on the nn.Module.
Set stage id and placement of nn.Module in pipeline parallelism. |
|
Set/Get whether do activation checkpointing in this nn.Module. |
Save & Load a Model¶
Returns a dictionary containing a whole state of the graph. |
|
Copies module’s states and other graph states from |
Auto Parallelism¶
As the scale of deep-learning models grows larger and larger, distributed training, or parallelism, is needed. Data parallelism and model parallelism has been designed to speed up the training and solve memory issues.
In oneflow, SBP signature enables users to configure parallelism policy easily. However, users still need to specify the SBP property for each operator, or most of them. Users might spend a couple of days digging into the detail of parallelism and get a low throughput just because of a slight mistake in the configuration of SBP signature.
Note
It only works on oneflow.nn.Graph mode.
Our strength¶
To get rid of all those configurations for SBP signatures, we developed auto parallelism. Still, configurations of placement are necessary and we have not supported auto placement yet. If you read this paragraph before you rush into any SBP stuff, then congratulation, you do not need to learn SBPs. You can start writing your code as you did under CPU mode. Our auto parallelism would generate a fast strategy customized for your specific models, the size of parameters, and the number of available GPUs.
How to use auto parallelism?¶
You just need to simply enable the configuration settings in the model of oneflow.nn.Graph .
Example:
import oneflow as flow
class SubclassGraph(flow.nn.Graph):
def __init__(self):
super().__init__() # MUST be called
# auto parallelism configuration
self.config.enable_auto_parallel(True)
# other configurations about auto parallelism
# ......
def build(self):
pass
Warning
If you enable auto parallelism, OneFlow will take care of the SBP configurations
of operators except for explicit to_global
functions.
Configuration API for auto parallelism¶
If true, then graph will use the auto parallel algorithm to select a parallelism strategy. |
|
If true, it will ignore all user configurations of SBP. |
|
Set coefficient of computation cost in auto-parallel algorithm. |
|
Set wait time for auto-parallel algorithm. |
|
Find the trunk of the SBP graph, then reduce the wait time for tributaries. |
|
Use “sbp collector” to create “sbp proxy” for nodes with multiple downstream operators. |
|
Whether we use a parallelism strategy with less memory |
oneflow.nn.image¶
Image operations for neural networks¶
alias of |
|
alias of |
|
alias of |
|
alias of |
|
alias of |
oneflow.utils.data¶
Copyright 2020 The OneFlow Authors. All rights reserved.
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
At the heart of Oneflow data loading utility is the oneflow.utils.data.DataLoader
class. It represents a Python iterable over a dataset, with support for
These options are configured by the constructor arguments of a
DataLoader
, which has signature:
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
batch_sampler=None, num_workers=0, collate_fn=None,
pin_memory=False, drop_last=False, timeout=0,
worker_init_fn=None, *, prefetch_factor=2,
persistent_workers=False)
The sections below describe in details the effects and usages of these options.
Dataset Types¶
The most important argument of DataLoader
constructor is dataset
, which indicates a dataset object to load data
from. Oneflow supports two different types of datasets:
Map-style datasets¶
A map-style dataset is one that implements the __getitem__()
and
__len__()
protocols, and represents a map from (possibly non-integral)
indices/keys to data samples.
For example, such a dataset, when accessed with dataset[idx]
, could read
the idx
-th image and its corresponding label from a folder on the disk.
See Dataset
for more details.
Iterable-style datasets¶
An iterable-style dataset is an instance of a subclass of IterableDataset
that implements the __iter__()
protocol, and represents an iterable over
data samples. This type of datasets is particularly suitable for cases where
random reads are expensive or even improbable, and where the batch size depends
on the fetched data.
For example, such a dataset, when called iter(dataset)
, could return a
stream of data reading from a database, a remote server, or even logs generated
in real time.
See IterableDataset
for more details.
Note
When using an IterableDataset
with
multi-process data loading. The same
dataset object is replicated on each worker process, and thus the
replicas must be configured differently to avoid duplicated data. See
IterableDataset
documentations for how to
achieve this.
Data Loading Order and Sampler
¶
For iterable-style datasets, data loading order is entirely controlled by the user-defined iterable. This allows easier implementations of chunk-reading and dynamic batch size (e.g., by yielding a batched sample at each time).
The rest of this section concerns the case with
map-style datasets. oneflow.utils.data.Sampler
classes are used to specify the sequence of indices/keys used in data loading.
They represent iterable objects over the indices to datasets. E.g., in the
common case with stochastic gradient decent (SGD), a
Sampler
could randomly permute a list of indices
and yield each one at a time, or yield a small number of them for mini-batch
SGD.
A sequential or shuffled sampler will be automatically constructed based on the shuffle
argument to a DataLoader
.
Alternatively, users may use the sampler
argument to specify a
custom Sampler
object that at each time yields
the next index/key to fetch.
A custom Sampler
that yields a list of batch
indices at a time can be passed as the batch_sampler
argument.
Automatic batching can also be enabled via batch_size
and
drop_last
arguments. See
the next section for more details
on this.
Note
Neither sampler
nor batch_sampler
is compatible with
iterable-style datasets, since such datasets have no notion of a key or an
index.
Loading Batched and Non-Batched Data¶
DataLoader
supports automatically collating
individual fetched data samples into batches via arguments
batch_size
, drop_last
, batch_sampler
, and
collate_fn
(which has a default function).
Automatic batching (default)¶
This is the most common case, and corresponds to fetching a minibatch of data and collating them into batched samples, i.e., containing Tensors with one dimension being the batch dimension (usually the first).
When batch_size
(default 1
) is not None
, the data loader yields
batched samples instead of individual samples. batch_size
and
drop_last
arguments are used to specify how the data loader obtains
batches of dataset keys. For map-style datasets, users can alternatively
specify batch_sampler
, which yields a list of keys at a time.
Note
The batch_size
and drop_last
arguments essentially are used
to construct a batch_sampler
from sampler
. For map-style
datasets, the sampler
is either provided by user or constructed
based on the shuffle
argument. For iterable-style datasets, the
sampler
is a dummy infinite one. See
this section on more details on
samplers.
Note
When fetching from
iterable-style datasets with
multi-processing, the drop_last
argument drops the last non-full batch of each worker’s dataset replica.
After fetching a list of samples using the indices from sampler, the function
passed as the collate_fn
argument is used to collate lists of samples
into batches.
In this case, loading from a map-style dataset is roughly equivalent with:
for indices in batch_sampler:
yield collate_fn([dataset[i] for i in indices])
and loading from an iterable-style dataset is roughly equivalent with:
dataset_iter = iter(dataset)
for indices in batch_sampler:
yield collate_fn([next(dataset_iter) for _ in indices])
A custom collate_fn
can be used to customize collation, e.g., padding
sequential data to max length of a batch. See
this section on more about collate_fn
.
Disable automatic batching¶
In certain cases, users may want to handle batching manually in dataset code,
or simply load individual samples. For example, it could be cheaper to directly
load batched data (e.g., bulk reads from a database or reading continuous
chunks of memory), or the batch size is data dependent, or the program is
designed to work on individual samples. Under these scenarios, it’s likely
better to not use automatic batching (where collate_fn
is used to
collate the samples), but let the data loader directly return each member of
the dataset
object.
When both batch_size
and batch_sampler
are None
(default
value for batch_sampler
is already None
), automatic batching is
disabled. Each sample obtained from the dataset
is processed with the
function passed as the collate_fn
argument.
When automatic batching is disabled, the default collate_fn
simply
converts NumPy arrays into Oneflow Tensors, and keeps everything else untouched.
In this case, loading from a map-style dataset is roughly equivalent with:
for index in sampler:
yield collate_fn(dataset[index])
and loading from an iterable-style dataset is roughly equivalent with:
for data in iter(dataset):
yield collate_fn(data)
See this section on more about collate_fn
.
Working with collate_fn
¶
The use of collate_fn
is slightly different when automatic batching is
enabled or disabled.
When automatic batching is disabled, collate_fn
is called with
each individual data sample, and the output is yielded from the data loader
iterator. In this case, the default collate_fn
simply converts NumPy
arrays in Oneflow tensors.
When automatic batching is enabled, collate_fn
is called with a list
of data samples at each time. It is expected to collate the input samples into
a batch for yielding from the data loader iterator. The rest of this section
describes the behavior of the default collate_fn
(default_collate()
).
For instance, if each data sample consists of a 3-channel image and an integral
class label, i.e., each element of the dataset returns a tuple
(image, class_index)
, the default collate_fn
collates a list of
such tuples into a single tuple of a batched image tensor and a batched class
label Tensor. In particular, the default collate_fn
has the following
properties:
It always prepends a new dimension as the batch dimension.
It automatically converts NumPy arrays and Python numerical values into Oneflow Tensors.
It preserves the data structure, e.g., if each sample is a dictionary, it outputs a dictionary with the same set of keys but batched Tensors as values (or lists if the values can not be converted into Tensors). Same for
list
s,tuple
s,namedtuple
s, etc.
Users may use customized collate_fn
to achieve custom batching, e.g.,
collating along a dimension other than the first, padding sequences of
various lengths, or adding support for custom data types.
If you run into a situation where the outputs of DataLoader
have dimensions or type that is different from your expectation, you may
want to check your collate_fn
.
Single- and Multi-process Data Loading¶
A DataLoader
uses single-process data loading by
default.
Within a Python process, the
Global Interpreter Lock (GIL)
prevents true fully parallelizing Python code across threads. To avoid blocking
computation code with data loading, Oneflow provides an easy switch to perform
multi-process data loading by simply setting the argument num_workers
to a positive integer.
Single-process data loading (default)¶
In this mode, data fetching is done in the same process a
DataLoader
is initialized. Therefore, data loading
may block computing. However, this mode may be preferred when resource(s) used
for sharing data among processes (e.g., shared memory, file descriptors) is
limited, or when the entire dataset is small and can be loaded entirely in
memory. Additionally, single-process loading often shows more readable error
traces and thus is useful for debugging.
Multi-process data loading¶
Setting the argument num_workers
as a positive integer will
turn on multi-process data loading with the specified number of loader worker
processes.
Warning
After several iterations, the loader worker processes will consume
the same amount of CPU memory as the parent process for all Python
objects in the parent process which are accessed from the worker
processes. This can be problematic if the Dataset contains a lot of
data (e.g., you are loading a very large list of filenames at Dataset
construction time) and/or you are using a lot of workers (overall
memory usage is number of workers * size of parent process
). The
simplest workaround is to replace Python objects with non-refcounted
representations such as Pandas, Numpy or PyArrow objects.
In this mode, each time an iterator of a DataLoader
is created (e.g., when you call enumerate(dataloader)
), num_workers
worker processes are created. At this point, the dataset
,
collate_fn
, and worker_init_fn
are passed to each
worker, where they are used to initialize, and fetch data. This means that
dataset access together with its internal IO, transforms
(including collate_fn
) runs in the worker process.
For map-style datasets, the main process generates the indices using
sampler
and sends them to the workers. So any shuffle randomization is
done in the main process which guides loading by assigning indices to load.
For iterable-style datasets, since each worker process gets a replica of the
dataset
object, naive multi-process loading will often result in
duplicated data. Using worker_init_fn
, users may configure each replica independently. (See
IterableDataset
documentations for how to achieve
this. ) For similar reasons, in multi-process loading, the drop_last
argument drops the last non-full batch of each worker’s iterable-style dataset
replica.
Workers are shut down once the end of the iteration is reached, or when the iterator becomes garbage collected.
Warning
It is generally not recommended to return CUDA tensors in multi-process
loading because of many subtleties in using CUDA and sharing CUDA tensors in
multiprocessing. Instead, we recommend
using automatic memory pinning (i.e., setting
pin_memory=True
), which enables fast data transfer to CUDA-enabled
GPUs.
Platform-specific behaviors¶
Since workers rely on Python multiprocessing
, worker launch behavior is
different on Windows compared to Unix.
On Unix,
fork()
is the defaultmultiprocessing
start method. Usingfork()
, child workers typically can access thedataset
and Python argument functions directly through the cloned address space.On Windows or MacOS,
spawn()
is the defaultmultiprocessing
start method. Usingspawn()
, another interpreter is launched which runs your main script, followed by the internal worker function that receives thedataset
,collate_fn
and other arguments throughpickle
serialization.
This separate serialization means that you should take two steps to ensure you are compatible with Windows while using multi-process data loading:
Wrap most of you main script’s code within
if __name__ == '__main__':
block, to make sure it doesn’t run again (most likely generating error) when each worker process is launched. You can place your dataset andDataLoader
instance creation logic here, as it doesn’t need to be re-executed in workers.Make sure that any custom
collate_fn
,worker_init_fn
ordataset
code is declared as top level definitions, outside of the__main__
check. This ensures that they are available in worker processes. (this is needed since functions are pickled as references only, notbytecode
.)
Randomness in multi-process data loading¶
By default, each worker will have its Oneflow seed set to base_seed + worker_id
,
where base_seed
is a long generated by main process using its RNG (thereby,
consuming a RNG state mandatorily) or a specified generator
. However, seeds for other
libraries may be duplicated upon initializing workers, causing each worker to return
identical random numbers.
In worker_init_fn
, you may access the Oneflow seed set for each worker
with oneflow.initial_seed()
, and use it to seed other libraries before data
loading.
Memory Pinning¶
Host to GPU copies are much faster when they originate from pinned (page-locked) memory. See cuda-memory-pinning for more details on when and how to use pinned memory generally.
For data loading, passing pin_memory=True
to a
DataLoader
will automatically put the fetched data
Tensors in pinned memory, and thus enables faster data transfer to CUDA-enabled
GPUs.
The default memory pinning logic only recognizes Tensors and maps and iterables
containing Tensors. By default, if the pinning logic sees a batch that is a
custom type (which will occur if you have a collate_fn
that returns a
custom batch type), or if each element of your batch is a custom type, the
pinning logic will not recognize them, and it will return that batch (or those
elements) without pinning the memory. To enable memory pinning for custom
batch or data type(s), define a pin_memory()
method on your custom
type(s).
See the example below.
Example:
class SimpleCustomBatch:
def __init__(self, data):
transposed_data = list(zip(*data))
self.inp = oneflow.stack(transposed_data[0], 0)
self.tgt = oneflow.stack(transposed_data[1], 0)
# custom memory pinning method on custom type
def pin_memory(self):
self.inp = self.inp.pin_memory()
self.tgt = self.tgt.pin_memory()
return self
def collate_wrapper(batch):
return SimpleCustomBatch(batch)
inps = oneflow.arange(10 * 5, dtype=oneflow.float32).view(10, 5)
tgts = oneflow.arange(10 * 5, dtype=oneflow.float32).view(10, 5)
dataset = TensorDataset(inps, tgts)
loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,
pin_memory=True)
for batch_ndx, sample in enumerate(loader):
print(sample.inp.is_pinned())
print(sample.tgt.is_pinned())
-
class
oneflow.utils.data.
DataLoader
(dataset: oneflow.utils.data.dataset.Dataset[T_co], batch_size: Optional[int] = 1, shuffle: bool = False, sampler: Optional[oneflow.utils.data.sampler.Sampler[int]] = None, batch_sampler: Optional[oneflow.utils.data.sampler.Sampler[Sequence[int]]] = None, num_workers: int = 0, collate_fn: Optional[Callable[[List[T]], Any]] = None, pin_memory: bool = False, drop_last: bool = False, timeout: float = 0, worker_init_fn: Optional[Callable[[int], None]] = None, multiprocessing_context=None, generator=<oneflow._oneflow_internal.Generator object>, *, prefetch_factor: int = 2, persistent_workers: bool = False)¶ Data loader. Combines a dataset and a sampler, and provides an iterable over the given dataset.
The
DataLoader
supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning.See
oneflow.utils.data
documentation page for more details.In consideration of compatibility, the design of our dataloader is consistent with pytorch, ref: https://github.com/pytorch/pytorch/tree/v1.7.0
- Parameters
dataset (Dataset) – dataset from which to load the data.
batch_size (int, optional) – how many samples per batch to load (default:
1
).shuffle (bool, optional) – set to
True
to have the data reshuffled at every epoch (default:False
).sampler (Sampler or Iterable, optional) – defines the strategy to draw samples from the dataset. Can be any
Iterable
with__len__
implemented. If specified,shuffle
must not be specified.batch_sampler (Sampler or Iterable, optional) – like
sampler
, but returns a batch of indices at a time. Mutually exclusive withbatch_size
,shuffle
,sampler
, anddrop_last
.num_workers (int, optional) – how many subprocesses to use for data loading (default:
0
).0
means that the data will be loaded in the main process.collate_fn (callable, optional) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
pin_memory (bool, optional) – If
True
, the data loader will copy Tensors into CUDA pinned memory before returning them. If your data elements are a custom type, or yourcollate_fn
returns a batch that is a custom type, see the example below. (default:False
)drop_last (bool, optional) – set to
True
to drop the last incomplete batch, if the dataset size is not divisible by the batch size. IfFalse
and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default:False
)timeout (numeric, optional) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default:
0
)worker_init_fn (callable, optional) – If not
None
, this will be called on each worker subprocess with the worker id (an int in[0, num_workers - 1]
) as input, after seeding and before data loading. (default:None
)prefetch_factor (int, optional, keyword-only arg) – Number of samples loaded in advance by each worker.
2
means there will be a total of 2 * num_workers samples prefetched across all workers. (default:2
)persistent_workers (bool, optional) – If
True
, the data loader will immediately initialize worker preocesses and not shutdown them after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. If you are using oneflow with RDMA support in distributed training, thepersistent_workers
must beTrue
otherwise will encounter segmentation fault. (default:False
)
Warning
If the
spawn
start method is used,worker_init_fn
cannot be an unpicklable object, e.g., a lambda function.Warning
len(dataloader)
heuristic is based on the length of the sampler used. Whendataset
is anIterableDataset
, it instead returns an estimate based onlen(dataset) / batch_size
, with proper rounding depending ondrop_last
, regardless of multi-process loading configurations. This represents the best guess OneFlow can make because OneFlow trusts userdataset
code in correctly handling multi-process loading to avoid duplicate data.However, if sharding results in multiple workers having incomplete last batches, this estimate can still be inaccurate, because (1) an otherwise complete batch can be broken into multiple ones and (2) more than one batch worth of samples can be dropped when
drop_last
is set. Unfortunately, OneFlow can not detect such cases in general.
-
class
oneflow.utils.data.
Dataset
(*args, **kwds)¶ An abstract class representing a
Dataset
.All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite
__getitem__()
, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite__len__()
, which is expected to return the size of the dataset by manySampler
implementations and the default options ofDataLoader
.Note
DataLoader
by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.
-
class
oneflow.utils.data.
IterableDataset
(*args, **kwds)¶ An iterable Dataset.
All datasets that represent an iterable of data samples should subclass it. Such form of datasets is particularly useful when data come from a stream.
All subclasses should overwrite
__iter__()
, which would return an iterator of samples in this dataset.When a subclass is used with
DataLoader
, each item in the dataset will be yielded from theDataLoader
iterator. Whennum_workers > 0
, each worker process will have a different copy of the dataset object, so it is often desired to configure each copy independently to avoid having duplicate data returned from the workers.Example 1: splitting workload across all workers in
__iter__()
:>>> class MyIterableDataset(flow.utils.data.IterableDataset): ... def __init__(self, start, end): ... super(MyIterableDataset).__init__() ... assert end > start, "this example code only works with end >= start" ... self.start = start ... self.end = end ... ... def __iter__(self): ... iter_start = self.start ... iter_end = self.end ... return iter(range(iter_start, iter_end)) ... >>> # should give same set of data as range(3, 7), i.e., [3, 4, 5, 6]. >>> ds = MyIterableDataset(start=3, end=7) >>> # Single-process loading >>> print(list(flow.utils.data.DataLoader(ds, num_workers=0))) [3, 4, 5, 6]
Example 2: splitting workload across all workers using
worker_init_fn
:>>> class MyIterableDataset(flow.utils.data.IterableDataset): ... def __init__(self, start, end): ... super(MyIterableDataset).__init__() ... assert end > start, "this example code only works with end >= start" ... self.start = start ... self.end = end ... ... def __iter__(self): ... return iter(range(self.start, self.end)) ... >>> # should give same set of data as range(3, 7), i.e., [3, 4, 5, 6]. >>> ds = MyIterableDataset(start=3, end=7) >>> # Single-process loading >>> print(list(flow.utils.data.DataLoader(ds, num_workers=0))) [3, 4, 5, 6]
-
class
oneflow.utils.data.
TensorDataset
(*tensors: oneflow.Tensor)¶ Dataset wrapping tensors.
Each sample will be retrieved by indexing tensors along the first dimension.
- Parameters
*tensors (Tensor) – tensors that have the same size of the first dimension.
-
class
oneflow.utils.data.
ConcatDataset
(datasets: Iterable[oneflow.utils.data.dataset.Dataset])¶ Dataset as a concatenation of multiple datasets.
This class is useful to assemble different existing datasets.
- Parameters
datasets (sequence) – List of datasets to be concatenated
-
class
oneflow.utils.data.
Subset
(dataset: oneflow.utils.data.dataset.Dataset[T_co], indices: Sequence[int])¶ Subset of a dataset at specified indices.
- Parameters
dataset (Dataset) – The whole Dataset
indices (sequence) – Indices in the whole set selected for subset
-
oneflow.utils.data.
random_split
(dataset: oneflow.utils.data.dataset.Dataset[T], lengths: Sequence[int], generator: Optional[object] = <built-in method default_generator of PyCapsule object>) → List[oneflow.utils.data.dataset.Subset[T]]¶ Randomly split a dataset into non-overlapping new datasets of given lengths. Optionally fix the generator for reproducible results, e.g.:
>>> random_split(range(10), [3, 7], generator=flow.Generator().manual_seed(42))
- Parameters
dataset (Dataset) – Dataset to be split
lengths (sequence) – lengths of splits to be produced
generator (Generator) – Generator used for the random permutation.
-
class
oneflow.utils.data.
Sampler
(data_source: Optional[Sized])¶ Base class for all Samplers.
Every Sampler subclass has to provide an
__iter__()
method, providing a way to iterate over indices of dataset elements, and a__len__()
method that returns the length of the returned iterators.Note
The
__len__()
method isn’t strictly required byDataLoader
, but is expected in any calculation involving the length of aDataLoader
.
-
class
oneflow.utils.data.
SequentialSampler
(data_source)¶ Samples elements sequentially, always in the same order.
- Parameters
data_source (Dataset) – dataset to sample from
-
class
oneflow.utils.data.
RandomSampler
(data_source: Sized, replacement: bool = False, num_samples: Optional[int] = None, generator=None)¶ Samples elements randomly. If without replacement, then sample from a shuffled dataset. If with replacement, then user can specify
num_samples
to draw.- Parameters
data_source (Dataset) – dataset to sample from
replacement (bool) – samples are drawn on-demand with replacement if
True
, default=``False``num_samples (int) – number of samples to draw, default=`len(dataset)`. This argument is supposed to be specified only when replacement is
True
.generator (Generator) – Generator used in sampling.
-
class
oneflow.utils.data.
SubsetRandomSampler
(indices: Sequence[int], generator=None)¶ Samples elements randomly from a given list of indices, without replacement.
- Parameters
indices (sequence) – a sequence of indices
generator (Generator) – Generator used in sampling.
-
class
oneflow.utils.data.
BatchSampler
(sampler: oneflow.utils.data.sampler.Sampler[int], batch_size: int, drop_last: bool)¶ Wraps another sampler to yield a mini-batch of indices.
- Parameters
sampler (Sampler or Iterable) – Base sampler. Can be any iterable object
batch_size (int) – Size of mini-batch.
drop_last (bool) – If
True
, the sampler will drop the last batch if its size would be less thanbatch_size
Example
>>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False)) [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]] >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True)) [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
-
class
oneflow.utils.data.distributed.
DistributedSampler
(dataset: oneflow.utils.data.dataset.Dataset, num_replicas: Optional[int] = None, rank: Optional[int] = None, shuffle: bool = True, seed: int = 0, drop_last: bool = False)¶ Sampler that restricts data loading to a subset of the dataset.
It is especially useful in conjunction with
flow.nn.parallel.DistributedDataParallel
. In such a case, each process can pass aDistributedSampler
instance as aDataLoader
sampler, and load a subset of the original dataset that is exclusive to it.Note
Dataset is assumed to be of constant size.
- Parameters
dataset – Dataset used for sampling.
num_replicas (int, optional) – Number of processes participating in distributed training. By default,
world_size
is retrieved from the current distributed group.rank (int, optional) – Rank of the current process within
num_replicas
. By default,rank
is retrieved from the current distributed group.shuffle (bool, optional) – If
True
(default), sampler will shuffle the indices.seed (int, optional) – random seed used to shuffle the sampler if
shuffle=True
. This number should be identical across all processes in the distributed group. Default:0
.drop_last (bool, optional) – if
True
, then the sampler will drop the tail of the data to make it evenly divisible across the number of replicas. IfFalse
, the sampler will add extra indices to make the data evenly divisible across the replicas. Default:False
.
Warning
In distributed mode, calling the
set_epoch()
method at the beginning of each epoch before creating theDataLoader
iterator is necessary to make shuffling work properly across multiple epochs. Otherwise, the same ordering will be always used.For example:
>>> sampler = DistributedSampler(dataset) if is_distributed else None >>> loader = DataLoader(dataset, shuffle=(sampler is None), sampler=sampler) >>> for epoch in range(start_epoch, n_epochs): ... if is_distributed: ... sampler.set_epoch(epoch) ... train(loader)
oneflow.utils.global_view¶
Some global view Ops¶
Converts the input tensor or input tensor(s) in list/tuple/dict to global tensor(s). |
|
Returns the local part of the input. |
|
Create a scope to provide global information for the computation process within it. |
|
Get the current global mode information. |
oneflow.utils.tensor¶
oneflow.one_embedding¶
Embedding is an important component of recommender system, and it has also spread to many fields outside recommender systems. Each framework provides basic operators for Embedding, for example, flow.nn.Embedding
in OneFlow:
import numpy as np
import oneflow as flow
indices = flow.tensor([[1, 2, 4, 5], [4, 3, 2, 9]], dtype=flow.int)
embedding = flow.nn.Embedding(10, 3)
y = embedding(indices)
OneEmbedding is the large-scale Embedding solution that OneFlow provides to solve the problem of large-scale deep recommender systems. OneEmbedding has the following advantages compared to ordinary opeartors:
With Flexible hierarchical storage, OneEmbedding can place the Embedding table on GPU memory, CPU memory or SSD, and allow high-speed devices to be used as caches for low-speed devices to achieve both speed and capacity.
OneEmbedding supports dynamic expansion.
Note
Please refer to Large-Scale Embedding Solution: OneEmbedding for a brief introduction to all features related to OneEmbedding.
Configure Embedding Table¶
OneEmbedding supports simultaneous creation of multiple Embedding table. The following codes configured three Embedding tables.
import oneflow as flow
import oneflow.nn as nn
import numpy as np
tables = [
flow.one_embedding.make_table_options(
flow.one_embedding.make_uniform_initializer(low=-0.1, high=0.1)
),
flow.one_embedding.make_table_options(
flow.one_embedding.make_uniform_initializer(low=-0.05, high=0.05)
),
flow.one_embedding.make_table_options(
flow.one_embedding.make_uniform_initializer(low=-0.15, high=0.15)
),
]
When configuring the Embedding table, you need to specify the initialization method. The above Embedding tables are initialized in the uniform
method. The result of configuring the Embedding table is stored in the tables
variable
-
oneflow.one_embedding.
make_table_options
(param)¶ make table param of Embedding tables
- Parameters
param (dict or list) – param can be initializer or list of column_option. initializer can be made by make_uniform_initializer or make_normal_initializer or make_constant_initializer, column options can be made by make_column_options
- Returns
table param of Embedding tables
- Return type
dict
For example:
>>> import oneflow as flow >>> initializer = flow.one_embedding.make_uniform_initializer(low=-scale, high=scale) >>> table1 = flow.one_embedding.make_table_options(initializer) >>> table2 = flow.one_embedding.make_table_options(initializer) >>> tables = [table1, table2] >>> # pass the tables to the "tables" param of flow.one_embedding.MultiTableEmbedding or flow.one_embedding.MultiTableMultiColumnEmbedding >>> # ...
-
oneflow.one_embedding.
make_table
(param)¶ alias of oneflow.one_embedding.make_table_options
initialization method¶
make uniform initializer param of make_table_options |
|
make normal initializer param of make_table_options |
Configure the Storage Attribute of the Embedding Table¶
Then run the following codes to configure the storage attribute of the Embedding table:
store_options = flow.one_embedding.make_cached_ssd_store_options(
cache_budget_mb=8142,
persistent_path="/your_path_to_ssd",
capacity=40000000,
size_factor=1,
physical_block_size=4096
)
Storage Method¶
make GPU only store_options param of MultiTableEmbedding |
|
make SSD use GPU and host as cache store_options param of MultiTableEmbedding. |
|
make host use GPU as cache store_options param of MultiTableEmbedding |
Note
Please refer to Large-Scale Embedding Solution: OneEmbedding for a brief introduction to learn about How to Choose the Proper Storage Configuration
Instantiate Embedding¶
After the above configuration is completed, you can use MultiTableEmbedding to get the instantiated Embedding layer.
embedding_size = 128
embedding = flow.one_embedding.MultiTableEmbedding(
name="my_embedding",
embedding_dim=embedding_size,
dtype=flow.float,
key_type=flow.int64,
tables=tables,
store_options=store_options,
)
embedding.to("cuda")
Note
Please refer to Large-Scale Embedding Solution: OneEmbedding for a brief introduction to learn about Feature ID and Multi-Table Query.
MultiTableEmbedding¶
-
oneflow.one_embedding.
MultiTableEmbedding
(name, embedding_dim, dtype, key_type, tables, store_options, default_initializer=None, padding_idx=None, seed=0)¶ MultiTableEmbedding represent multi Embedding tables with same embedding_dim, dtype, and key_type.
- Parameters
name (str) – The name of Embedding
embedding_dim (int) – the size of each embedding vector
dtype (flow.dtype) – the data type of embeddings
key_type (flow.dtype) – the data type of feature ids
tables (list) – list of table param which can be made by flow.one_embedding.make_table_options
store_options (dict) – store option of Embedding
default_initializer (dict, optional) – if tables param is None, use default_initializer to initialize table. Defaults to None.
padding_idx (int, optional) – If specified, the entries at
padding_idx
do not contribute to the gradient; therefore, the embedding vector atpadding_idx
is not updated during training, the embedding vector atpadding_idx
will default to all zeros.
For example:
>>> import oneflow as flow >>> import numpy as np >>> import oneflow.nn as nn >>> # a simple example with 3 table >>> table_size_array = [39884407, 39043, 17289] >>> vocab_size = sum(table_size_array) >>> num_tables = len(table_size_array) >>> embedding_size = 128 >>> scales = np.sqrt(1 / np.array(table_size_array)) >>> tables = [ >>> flow.one_embedding.make_table_options( >>> flow.one_embedding.make_uniform_initializer(low=-scale, high=scale) >>> ) >>> for scale in scales >>> ] >>> store_options = flow.one_embedding.make_cached_ssd_store_options( >>> cache_budget_mb=8192, persistent_path="/your_path_to_ssd", capacity=vocab_size, >>> ) >>> embedding = flow.one_embedding.MultiTableEmbedding( >>> name="my_embedding", >>> embedding_dim=embedding_size, >>> dtype=flow.float, >>> key_type=flow.int64, >>> tables=tables, >>> store_options=store_options, >>> ) >>> embedding.to("cuda") >>> mlp = flow.nn.FusedMLP( >>> in_features=embedding_size * num_tables, >>> hidden_features=[512, 256, 128], >>> out_features=1, >>> skip_final_activation=True, >>> ) >>> mlp.to("cuda") >>> >>> class TrainGraph(flow.nn.Graph): >>> def __init__(self,): >>> super().__init__() >>> self.embedding_lookup = embedding >>> self.mlp = mlp >>> self.add_optimizer( >>> flow.optim.SGD(self.embedding_lookup.parameters(), lr=0.1, momentum=0.0) >>> ) >>> self.add_optimizer( >>> flow.optim.SGD(self.mlp.parameters(), lr=0.1, momentum=0.0) >>> ) >>> def build(self, ids): >>> embedding = self.embedding_lookup(ids) >>> loss = self.mlp(flow.reshape(embedding, (-1, num_tables * embedding_size))) >>> loss = loss.sum() >>> loss.backward() >>> return loss >>> ids = np.random.randint(0, 1000, (100, num_tables), dtype=np.int64) >>> ids_tensor = flow.tensor(ids, requires_grad=False).to("cuda") >>> graph = TrainGraph() >>> loss = graph(ids_tensor) >>> print(loss)
Embedding lookup operation |
|
save snapshot |
|
load snapshot |
MultiTableMultiColumnEmbedding¶
-
oneflow.one_embedding.
MultiTableMultiColumnEmbedding
(name, embedding_dim, dtype, key_type, tables, store_options, default_initializer=None, padding_idx=None, seed=0)¶ MultiTableMultiColumnEmbedding represent multi Embedding tables with multi embedding_dim, same dtype, and key_type.
- Parameters
name (str) – The name of Embedding
embedding_dim (list) – list of the size of each embedding vector
dtype (flow.dtype) – the data type of embeddings
key_type (flow.dtype) – the data type of feature ids
tables (list) – list of table param which can be made by flow.one_embedding.make_table_options
store_options (dict) – store option of Embedding
default_initializer (dict, optional) – if tables param is None, use default_initializer to initialize table. Defaults to None.
padding_idx (int, optional) – If specified, the entries at
padding_idx
do not contribute to the gradient; therefore, the embedding vector atpadding_idx
is not updated during training, the embedding vector atpadding_idx
will default to all zeros.
For example:
>>> import oneflow as flow >>> import numpy as np >>> import oneflow.nn as nn >>> # a simple example with 3 table, every table has two column, the first column embedding_size is 10 and the second is 1. >>> # every table's first column initialize with uniform(-1/sqrt(table_size), 1/sqrt(table_size)), second column initialize with normal(0, 1/sqrt(table_size)) >>> table_size_array = [39884407, 39043, 17289] >>> vocab_size = sum(table_size_array) >>> num_tables = len(table_size_array) >>> embedding_size_list = [10, 1] >>> scales = np.sqrt(1 / np.array(table_size_array)) >>> tables = [ >>> flow.one_embedding.make_table_options( >>> [flow.one_embedding.make_column_options( >>> flow.one_embedding.make_uniform_initializer(low=-scale, high=scale)), >>> flow.one_embedding.make_column_options( >>> flow.one_embedding.make_normal_initializer(mean=0, std=scale))] >>> ) >>> for scale in scales >>> ] >>> store_options = flow.one_embedding.make_cached_ssd_store_options( >>> cache_budget_mb=8192, persistent_path="/your_path_to_ssd", capacity=vocab_size, >>> ) >>> embedding = flow.one_embedding.MultiTableMultiColumnEmbedding( >>> name="my_embedding", >>> embedding_dim=embedding_size_list, >>> dtype=flow.float, >>> key_type=flow.int64, >>> tables=tables, >>> store_options=store_options, >>> ) >>> embedding.to("cuda") >>> mlp = flow.nn.FusedMLP( >>> in_features=sum(embedding_size_list) * num_tables, >>> hidden_features=[512, 256, 128], >>> out_features=1, >>> skip_final_activation=True, >>> ) >>> mlp.to("cuda") >>> >>> class TrainGraph(flow.nn.Graph): >>> def __init__(self,): >>> super().__init__() >>> self.embedding_lookup = embedding >>> self.mlp = mlp >>> self.add_optimizer( >>> flow.optim.SGD(self.embedding_lookup.parameters(), lr=0.1, momentum=0.0) >>> ) >>> self.add_optimizer( >>> flow.optim.SGD(self.mlp.parameters(), lr=0.1, momentum=0.0) >>> ) >>> def build(self, ids): >>> embedding = self.embedding_lookup(ids) >>> loss = self.mlp(flow.reshape(embedding, (-1, num_tables * sum(embedding_size_list)))) >>> loss = loss.sum() >>> loss.backward() >>> return loss >>> ids = np.random.randint(0, 1000, (100, num_tables), dtype=np.int64) >>> ids_tensor = flow.tensor(ids, requires_grad=False).to("cuda") >>> graph = TrainGraph() >>> loss = graph(ids_tensor) >>> print(loss)
Embedding lookup operation |
|
save snapshot |
|
load snapshot |
Construct Graph for Training¶
OneEmbedding is only supported in Graph mode.
num_tables = 3
mlp = flow.nn.FusedMLP(
in_features=embedding_size * num_tables,
hidden_features=[512, 256, 128],
out_features=1,
skip_final_activation=True,
)
mlp.to("cuda")
class TrainGraph(flow.nn.Graph):
def __init__(self,):
super().__init__()
self.embedding_lookup = embedding
self.mlp = mlp
self.add_optimizer(
flow.optim.SGD(self.embedding_lookup.parameters(), lr=0.1, momentum=0.0)
)
self.add_optimizer(
flow.optim.SGD(self.mlp.parameters(), lr=0.1, momentum=0.0)
)
def build(self, ids):
embedding = self.embedding_lookup(ids)
loss = self.mlp(flow.reshape(embedding, (-1, num_tables * embedding_size)))
loss = loss.sum()
loss.backward()
return loss
Note
Please refer to Distributed Training: OneEmbedding for a brief introduction to learn about Graph For Training
Persistent Read & Write¶
Creates a reader for reading persistent table. |
|
Creates a writer for writing persistent table. |
Copyright 2020 The OneFlow Authors. All rights reserved.
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
-
class
oneflow.one_embedding.
Ftrl
(params: Union[Iterator[oneflow.nn.Parameter], List[Dict]], lr: float = 0.001, weight_decay: float = 0.0, lr_power: float = - 0.5, initial_accumulator_value: float = 0.1, lambda1: float = 0.0, lambda2: float = 0.0, beta: float = 0.0)¶ FTRL Optimizer.
The formula is:
\[\begin{split}\begin{align} accumlator_{i+1} = accumlator_{i} + grad * grad \\ sigma = (accumulator_{i+1}^{lr\_power} - accumulator_{i}^{lr\_power}) / learning\_rate \\ z_{i+1} = z_{i} + grad - sigma * param_{i} \\ \text{} param_{i+1} = \begin{cases} 0 & \text{ if } |z_{i+1}| < \lambda_1 \\ -(\frac{\beta+accumlator_{i+1}^{lr\_power}}{learning\_rate} + \lambda_2)*(z_{i+1} - sign(z_{i+1})*\lambda_1) & \text{ otherwise } \\ \end{cases} \end{align}\end{split}\]Example 1:
# Assume net is a custom model. ftrl = flow.one_embedding.FTRL(net.parameters(), lr=1e-3) for epoch in range(epochs): # Read data, Compute the loss and so on. # ... loss.backward() ftrl.step() ftrl.zero_grad()
- Parameters
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate. Defaults to 1e-3.
weight_decay (float, optional) – weight decay (L2 penalty). Defaults to 0.0.
lr_power (float, optional) – learning rate decrease factor. Defaults to -0.5.
initial_accumulator_value (float, optional) – The initial value of accumlator. Defaults to 0.1.
lambda1 (float, optional) – L1 regularization strength. Defaults to 0.0.
lambda2 (float, optional) – L2 regularization strength. Defaults to 0.0.
beta (float, optional) – The value of beta. Defaults to 0.0.
-
step
(closure: Optional[Callable] = None)¶ Performs a single optimization step.
- Parameters
closure (callable, optional) – A closure that reevaluates the model and returns the loss.
-
property
support_sparse
¶ Whether the Optimizer support sparse update.
Environment Variables¶
OneFlow has an extensive set of environment variables to tune for specific usage.
ONEFLOW_COMM_NET_IB_HCA¶
When there are multiple IB NIC(which can be checked by ibstatus
on the server), the system uses the first IB NIC for comm_net communication by default.
When this environment variable is set, the system will check all IB NIC and find the NIC with the corresponding name. #5626
Values accepted¶
The default value is empty, such as mlx5_0:1
、 mlx5_1:1
. When the port is 0, the default value is 1, representing the first port.
ONEFLOW_COMM_NET_IB_GID_INDEX¶
For the query of ibv_query_gid, and 0 represents success. It often used with ONEFLOW_COMM_NET_IB_HCA
. GID means the Global ID, QP under RoCE network must be built by this value, instead of just using the LID as in the IB network. #5626
Values accepted¶
The default value is 0, representing the port index value
ONEFLOW_COMM_NET_IB_QUEUE_DEPTH¶
Queue length of jobs in IB network.
This value effectively controls the size of the module without instead of using IB’s default size, such as ONEFLOW_COMM_NET_IB_MEM_BLOCK_SIZE
.
Values accepted¶
The default value is 1024
, receiving int64_t
. The system would compare with max_qp_wr
(Maximum number of outstanding WR on any work queue), and take the smaller one.
ONEFLOW_COMM_NET_IB_MEM_BLOCK_SIZE¶
The size of the module read when communicating.
The value can calculate the amount of module, and transmit it after encapsulation.
Values accepted¶
The default value is 8388608
(8M)
ONEFLOW_STREAM_CUDA_EVENT_FLAG_BLOCKING_SYNC¶
Represents stream, and marks Blocking synchronization in cuda. Detailed information, #5612, #5837
Values accepted¶
Define and set to false
, and would be true` only when the value is ``1
, true
, yes
, on
and y
.
ONEFLOW_LIBIBVERBS_PATH¶
To load the DynamicLibrary by dlopen at runtime, to find symbols of ibverbs functions by dlopen without linking during compile for better compatibility. #4852.
If it failed, it will output libibverbs not available, ibv_fork_init skipped
, if it worked, the import oneflow
will output such as loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
Values accepted¶
The default value is empty, but will load libibverbs.so.1
, libibverbs.so
.
ONEFLOW_DEBUG_MODE¶
Enable debug
mode, ONEFLOW_DEBUG
can do.
If debug
mode is on, it will output more INFO level logs, different prototxt
and dot
to files. The automatically inserted boxing information will be printed to the log file under eager global mode.
Values accepted¶
The default value is empty, but will receive any string.
ONEFLOW_DRY_RUN¶
Only for test running, it can generate log files like dot
.
Exit once the test is succeed, do not try real training.
Values accepted¶
The default value is empty, but will receive any string.
ONEFLOW_DEBUG_KERNEL_SYNC_CHECK_NUMERICS¶
Only used when debugging because the performance would be affected, it could detect which op in the network appears nan or inf.
It will create CpuCheckNumericsKernelObserver
under cpu
, and CudaCheckNumericsKernelObserver
under cuda
#6052 .
Values accepted¶
Define and set to false
, and would be true
only when the value is 1
, true
, yes
, on
and y
.
ONEFLOW_DEBUG_KERNEL_SYNC_CHECK¶
Only used when debugging because the performance would be affected.
It will create SyncCheckKernelObserver
and will be synced after each kernel.
It could be used to debug cuda errors. #6052
Values accepted¶
Define and set to false
, and would be true
only when the value is 1
, true
, yes
, on
and y
.
ONEFLOW_PROFILER_KERNEL_PROFILE_CUDA_MEMORY_BANDWIDTH¶
Used when generate profiler files by nsys.
Profiler is only valid for lazy temporarily.
It can estimate the memory bandwidth reached by kernel by counting the execution time of the GPU kernel and the size of the input and output memory, and help find potential kernels that can be optimized. Details
Values accepted¶
Define and set to false
. When using, the compiled package needs to enable BUILD_PROFILER
.
ONEFLOW_PROFILER_KERNEL_PROFILE_KERNEL_FORWARD_RANGE¶
The same as above. collect op name
Values accepted¶
Define and set to false
. When using, the compiled package needs to enable BUILD_PROFILER
.
ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER¶
Only use blob_access_checker after enabling, because blob_access_checker is for correctness assurance, and closing it in some cases can increase the kernel overhead. #5728
Values accepted¶
Define and set to false
, and would be true
only when the value is 1
, true
, yes
, on
and y
.
ONEFLOW_KERNEL_ENABLE_CUDA_GRAPH¶
Takes effect under WITH_CUDA_GRAPHS
and the default value is false
. It uses more memory, so when there’s just enough memory, it won’t run.
Turning on CUDA_GRAPH will use up more memory CUDA Graphs support. #5868
Values accepted¶
Define and set to false
, and would be true
only when the value is 1
, true
, yes
, on
and y
.
ONEFLOW_ACTOR_ENABLE_LIGHT_ACTOR¶
LightActor is a new type of Actor that only handles NormalForward and similar tasks where all regst_num is 1 or tasks with only one kernel. #5868. export ONEFLOW_KERNEL_ENABLE_CUDA_GRAPH=1
(Would use more memories), export ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE=1
, export ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER=1
, export ONEFLOW_ACTOR_ENABLE_LIGHT_ACTOR=1
, export ONEFLOW_STREAM_REUSE_CUDA_EVENT=1
can be used together.
Values accepted¶
Define and set to false
, and would be true
only when the value is 1
, true
, yes
, on
and y
.
ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE¶
#5720. It is used to enable local message queue, oneflow.config.thread_enable_local_message_queue(True)
is no longer used.
Values accepted¶
Define and set to false
, and would be true
only when the value is 1
, true
, yes
, on
and y
.
ONEFLOW_PERSISTENT_IN_STREAM_BUFFER_SIZE_BYTES¶
Represents the size of each read from disk. #5162
Values accepted¶
The default value is empty. If an invalid string or negative number is entered, the default value would be 32 * 1024
; 32KB.
ONEFLOW_DECODER_ENABLE_NVJPEG_HARDWARE_ACCELERATION¶
NVJPEG_VER_MAJOR
need to be bigger than 11
. It can accelerate nvjpeg hardware, warm up jpeg decoder and hw_jpeg decoder, #5851.
Hardware JPEG decoder and NVIDIA nvJPEG library on NVIDIA A100 GPUs
Values accepted¶
Define and set to true
, and would be true
only when the value is 1
, true
, yes
, on
and y
.
ONEFLOW_SERVING_DEBUG¶
For printing information of OneFlow Serving Debug
Values accepted¶
The default value is false
ONEFLOW_DISABLE_VIEW¶
To disable view mechanism, which means op related to view would stop running.
Values accepted¶
The default value is false
ONEFLOW_BOXING_DISABLE_MIDDLE_NODE_AND_CHECK¶
Whether to disable Middle Node. When it is false, all inter-SBP communication is supported
Values accepted¶
The default value is false
ONEFLOW_ONE_EMBEDDING_DISABLE_NUMA_AWARE_ALLOCATION¶
Whether to disable NUMA_AWARE memory allocation when the OneEmbedding module allocates video memory.
NUMA_AWARE memory allocation means that when allocating pinned host memory, the cpu close to the gpu will be considered (for example, if it is gpu 0 1, memory will be allocated on cpu0)
Values accepted¶
The default value is false
ONEFLOW_EP_CUDA_ENABLE_TF32_EXECUTION¶
Whether to allow CUDA to use TF32 numeric types for computation
Values accepted¶
The default value is true
ONEFLOW_FUNCTOR_DISABLE_FUSED_MLP¶
Whether to disable the fused_mlp operator implemented by cublasLt in FusedMLPFunctor, if disabled, it will degenerate into a multiple matrix multiplication operation.
Values accepted¶
The default value is false
ONEFLOW_ONE_EMBEDDING_EMBEDDING_SHUFFLE_INDEPENTENT_STREAM¶
Whether to put the EmbeddingShuffle of the OneEmbedding module on a separate stream for overlapping execution.
Values accepted¶
The default value is false
ONEFLOW_ONE_EMBEDDING_GRADIENT_SHUFFLE_USE_FP16¶
Whether to allow the EmbeddingGradientShuffle operator of the OneEmbedding module to use the FP16 data type in the AMP case.
Values accepted¶
The default value is true
ONEFLOW_ONE_EMBEDDING_NOT_FUSE_CAST_TO_UPDATE¶
Whether to disable the fusion of cast type conversion and parameter update of OneEmbedding parameters into one operator in the case of AMP
Values accepted¶
The default value is false
ONEFLOW_DEBUG_KERNEL_SYNC_CHECK_NUMERICS_DUMP¶
When the value appears NaN or Inf, save the data Dump.
Values accepted¶
The default value is false
ONEFLOW_MLIR_ENABLE_IR_PRINTING¶
Control whether to print ir when running each pass when debugging
Values accepted¶
The default value is false
ONEFLOW_MLIR_STDOUT¶
Control whether MLIR outputs log information in the console
Values accepted¶
The default value is false
ONEFLOW_MLIR_ENABLE_ROUND_TRIP¶
Control whether Oneflow Job goes into MLIR
Values accepted¶
The default value is false
ONEFLOW_KERNEL_REDUCE_SUM_USE_MATMUL¶
whether to use matrix multiplication for reduce_sum
Values accepted¶
The default value is false
ONEFLOW_ONE_EMBEDDING_ENABLE_QUANTIZED_COMM¶
Whether to quantify the shuffle application communication in the case of OneEmbedding multi-card
Values accepted¶
The default value is false
ONEFLOW_TENSOR_BUFFER_ALIGNED_SIZE¶
Align size when allocating TensorBuffer memory
Values accepted¶
The default value is 1024
ONEFLOW_TENSOR_BUFFER_POOL_THREAD_LOCAL_CACHE_SIZE¶
Control the size of thread_local_cache
in TensorBufferPool
Values accepted¶
The default value is 64
ONEFLOW_GRPC_MAX_MESSAGE_BYTE_SIZE¶
Set the maximum size of the gRPC transport message
Values accepted¶
The default value is -1
ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_CAPACITY_HINT¶
Control the initial capacity of the PersistentTable of OneEmbedding to avoid frequent expansion
Values accepted¶
OneEmbedding will calculate according to the actual situation, and users can also choose to configure a larger capacity.
ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_NUM_WORKERS¶
The number of threads used for reading and writing the PersistentTable of OneEmbedding
Values accepted¶
The default value is 4
ONEFLOW_EP_CUDA_CONST_BUFFER_ELEMENT_COUNT¶
Specify the size of the all zero and all one buffers on the CUDA device.
This buffer can be used with matrix multiplication to implement operations such as reduce_sum
Values accepted¶
The default value is 1024x1024
OMP_NUM_THREADS¶
Set the number of threads used by OMP
Values accepted¶
The default value will be generated by specific computational logic.
SBP_INFER_RULE_TAG¶
Specify SBP derivation rules
Values accepted¶
When the default value is 1
, select the SBP that satisfies the producer or the SBP with the smallest cost as much as possible.
When the default value is 2
, select the SBP that matches the most.
When the default value is 3
, select the SBP with the smallest cost.
ONEFLOW_TENSOR_BUFFER_GROWTH_FACTOR¶
Control the growth factor of TensorBuffer
Values accepted¶
The default value is 1.0
ONEFLOW_TENSOR_BUFFER_SHRINK_FACTOR¶
Controls the shrink factor of TensorBuffer
Values accepted¶
The default value is 0.7
ONEFLOW_TENSOR_BUFFER_POOL_SIZE_FACTOR¶
Controls the size factor of TensorBuffer
Values accepted¶
The default value is 2.0
AUTO_PARALLEL_TRANSFER_COST¶
Control the size of the automatic parallel transfer cost
Values accepted¶
The default value is 1.65e8
ONEFLOW_DEBUG_PASS¶
Pass names and print job before and after a specific pass, such as export ONEFLOW_DEBUG_PASS="FuseAddToOutputPass
.
Or ALL, print job before and after a specific pass, such as export ONEFLOW_DEBUG_PASS="ALL"
.
Values accepted¶
The default value is empty
ONEFLOW_PROFILER_HOST_THREAD_NAME_PREFIX¶
Add a prefix to the name of the named host thread in the profiling context to facilitate sorting in the visualization tool (nsight)
Values accepted¶
The default value is empty
oneflow.special¶
The oneflow.special module, modeled after SciPy’s special module.¶
Alias for |
|
Alias for |
|
Alias for |
|
Alias for |
|
Alias for |
|
Alias for |
|
Alias for |
|
Alias for |
|
Alias for |
|
Alias for |
|
Alias for |
|
Computes the Hurwitz zeta function, elementwise. |