Environment Variables

OneFlow has an extensive set of environment variables to tune for specific usage.


When there are multiple IB NIC(which can be checked by ibstatus on the server, the system uses the first IB NIC for comm_net communication by default.

When this environment variable is set, the system will check all IB NIC and find the NIC with the corresponding name. #5626

Values accepted

The default value is empty, such as mlx5_0:1mlx5_1:1. When the port is 0, the default value is 1, representing the first port.


For the query of ibv_query_gid, and 0 represents success. It often used with ONEFLOW_COMM_NET_IB_HCA. GID means the Global ID, QP under RoCE network must be built by this value, instead of just using the LID as in the IB network. #5626

Values accepted

The default value is 0, representing the port index value


Queue length of jobs in IB network.

This value effectively controls the size of the module without instead of using IB’s default size, such as ONEFLOW_COMM_NET_IB_MEM_BLOCK_SIZE.

Values accepted

The default value is 1024, receiving int64_t. The system would compare with max_qp_wr (Maximum number of outstanding WR on any work queue), and take the smaller one.


The size of the module read when communicating.

The value can calculate the amount of module, and transmit it after encapsulation.

Values accepted

The default value is 8388608 (8M)


Represents stream, and marks Blocking synchronization in cuda. Detailed information, #5612, #5837

Values accepted

Define and set to false, and would be true` only when the value is ``1, true, yes, on and y.


To load the DynamicLibrary by dlopen at runtime, to find symbols of ibverbs functions by dlopen without linking during compile for better compatibility. #4852.

If it failed, it will output libibverbs not available, ibv_fork_init skipped, if it worked, the import oneflow will output such as loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1

Values accepted

The default value is empty, but will load libibverbs.so.1, libibverbs.so.


Enable debug mode, ONEFLOW_DEBUG can do.

If debug mode is on, it will output more INFO level logs, different prototxt and dot to files. The automatically inserted boxing information will be printed to the log file under eager global mode.

Values accepted

The default value is empty, but will receive any string.


Only for test running, it can generate log files like dot.

Exit once the test is succeed, do not try real training.

Values accepted

The default value is empty, but will receive any string.


Only used when debugging because the performance would be affected, it could detect which op in the network appears nan or inf.

It will create CpuCheckNumericsKernelObserver under cpu , and CudaCheckNumericsKernelObserver under cuda #6052 .

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.


Only used when debugging because the performance would be affected.

It will create SyncCheckKernelObserver and will be synced after each kernel.

It could be used to debug cuda errors. #6052

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.


Used when generate profiler files by nsys.

Profiler is only valid for lazy temporarily.

It can estimate the memory bandwidth reached by kernel by counting the execution time of the GPU kernel and the size of the input and output memory, and help find potential kernels that can be optimized. Details

Values accepted

Define and set to false. When using, the compiled package needs to enable BUILD_PROFILER.


The same as above. collect op name

Values accepted

Define and set to false. When using, the compiled package needs to enable BUILD_PROFILER.


Only use blob_access_checker after enabling, because blob_access_checker is for correctness assurance, and closing it in some cases can increase the kernel overhead. #5728

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.


Takes effect under WITH_CUDA_GRAPHS and the default value is false. It uses more memory, so when there’s just enough memory, it won’t run.

Turning on CUDA_GRAPH will use up more memory CUDA Graphs support. #5868

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.


LightActor is a new type of Actor that only handles NormalForward and similar tasks where all regst_num is 1 or tasks with only one kernel. #5868. export ONEFLOW_KERNEL_ENABLE_CUDA_GRAPH=1 (Would use more memories), export ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE=1, export ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER=1, export ONEFLOW_ACTOR_ENABLE_LIGHT_ACTOR=1, export ONEFLOW_STREAM_REUSE_CUDA_EVENT=1 can be used together.

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.


#5720. It is used to enable local message queue, oneflow.config.thread_enable_local_message_queue(True) is no longer used.

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.


Represents the size of each read from disk. #5162

Values accepted

The default value is empty. If an invalid string or negative number is entered, the default value would be 32 * 1024; 32KB.


NVJPEG_VER_MAJOR need to be bigger than 11. It can accelerate nvjpeg hardware, warm up jpeg decoder and hw_jpeg decoder, #5851.

Hardware JPEG decoder and NVIDIA nvJPEG library on NVIDIA A100 GPUs

Values accepted

Define and set to true, and would be true only when the value is 1, true, yes, on and y.


For printing information of OneFlow Serving Debug

Values accepted

The default value is false


To disable view mechanism, which means op related to view would stop running.

Values accepted

The default value is false


Whether to disable Middle Node. When it is false, all inter-SBP communication is supported

Values accepted

The default value is false


Whether to disable NUMA_AWARE memory allocation when the OneEmbedding module allocates video memory.

NUMA_AWARE memory allocation means that when allocating pinned host memory, the cpu close to the gpu will be considered (for example, if it is gpu 0 1, memory will be allocated on cpu0)

Values accepted

The default value is false


Whether to allow CUDA to use TF32 numeric types for computation

Values accepted

The default value is true


Whether to disable the fused_mlp operator implemented by cublasLt in FusedMLPFunctor, if disabled, it will degenerate into a multiple matrix multiplication operation.

Values accepted

The default value is false


Whether to put the EmbeddingShuffle of the OneEmbedding module on a separate stream for overlapping execution.

Values accepted

The default value is false


Whether to allow the EmbeddingGradientShuffle operator of the OneEmbedding module to use the FP16 data type in the AMP case.

Values accepted

The default value is true


Whether to disable the fusion of cast type conversion and parameter update of OneEmbedding parameters into one operator in the case of AMP

Values accepted

The default value is false


When the value appears NaN or Inf, save the data Dump.

Values accepted

The default value is false


Control whether to print ir when running each pass when debugging

Values accepted

The default value is false


Control whether MLIR outputs log information in the console

Values accepted

The default value is false


Control whether to dump ir files

Values accepted

The default value is false


Control whether Oneflow Job goes into MLIR

Values accepted

The default value is false


whether to use matrix multiplication for reduce_sum

Values accepted

The default value is false


Whether to quantify the shuffle application communication in the case of OneEmbedding multi-card

Values accepted

The default value is false


Align size when allocating TensorBuffer memory

Values accepted

The default value is 1024


Control the size of thread_local_cache in TensorBufferPool

Values accepted

The default value is 64


Set the maximum size of the gRPC transport message

Values accepted

The default value is -1


Control the initial capacity of the PersistentTable of OneEmbedding to avoid frequent expansion

Values accepted

OneEmbedding will calculate according to the actual situation, and users can also choose to configure a larger capacity.


The number of threads used for reading and writing the PersistentTable of OneEmbedding

Values accepted

The default value is 4


Specify the size of the all zero and all one buffers on the CUDA device.

This buffer can be used with matrix multiplication to implement operations such as reduce_sum

Values accepted

The default value is 1024x1024


Set the number of threads used by OMP

Values accepted

The default value will be generated by specific computational logic.


Specify SBP derivation rules

Values accepted

When the default vaule is 1 , select the SBP that satisfies the producer or the SBP with the smallest cost as much as possible.

When the default value is 2, select the SBP that matches the most.

When the default value is 3, select the SBP with the smallest cost.


Control the growth factor of TensorBuffer

Values accepted

The default value is 1.0


Controls the shrink factor of TensorBuffer

Values accepted

The default value is 0.7


Controls the size factor of TensorBuffer

Values accepted

The default value is 2.0


Control the size of the automatic parallel transfer cost

Values accepted

The default value is 1.65e8


Pass names and print job before and after a specific pass, such as export ONEFLOW_DEBUG_PASS="FuseAddToOutputPass.

Or ALL, print job before and after a specific pass, such as export ONEFLOW_DEBUG_PASS="ALL".

Values accepted

The default value is empty


Add a prefix to the name of the named host thread in the profiling context to facilitate sorting in the visualization tool (nsight)

Values accepted

The default value is empty