Environment Variables

OneFlow has an extensive set of environment variables to tune for specific usage.

ONEFLOW_COMM_NET_IB_HCA

When there are multiple IB NIC(which can be checked by ibstatus on the server), the system uses the first IB NIC for comm_net communication by default.

When this environment variable is set, the system will check all IB NIC and find the NIC with the corresponding name. #5626

Values accepted

The default value is empty, such as mlx5_0:1mlx5_1:1. When the port is 0, the default value is 1, representing the first port.

ONEFLOW_COMM_NET_IB_GID_INDEX

For the query of ibv_query_gid, and 0 represents success. It often used with ONEFLOW_COMM_NET_IB_HCA. GID means the Global ID, QP under RoCE network must be built by this value, instead of just using the LID as in the IB network. #5626

Values accepted

The default value is 0, representing the port index value

ONEFLOW_COMM_NET_IB_QUEUE_DEPTH

Queue length of jobs in IB network.

This value effectively controls the size of the module without instead of using IB’s default size, such as ONEFLOW_COMM_NET_IB_MEM_BLOCK_SIZE.

Values accepted

The default value is 1024, receiving int64_t. The system would compare with max_qp_wr (Maximum number of outstanding WR on any work queue), and take the smaller one.

ONEFLOW_COMM_NET_IB_MEM_BLOCK_SIZE

The size of the module read when communicating.

The value can calculate the amount of module, and transmit it after encapsulation.

Values accepted

The default value is 8388608 (8M)

ONEFLOW_STREAM_CUDA_EVENT_FLAG_BLOCKING_SYNC

Represents stream, and marks Blocking synchronization in cuda. Detailed information, #5612, #5837

Values accepted

Define and set to false, and would be true` only when the value is ``1, true, yes, on and y.

ONEFLOW_LIBIBVERBS_PATH

To load the DynamicLibrary by dlopen at runtime, to find symbols of ibverbs functions by dlopen without linking during compile for better compatibility. #4852.

If it failed, it will output libibverbs not available, ibv_fork_init skipped, if it worked, the import oneflow will output such as loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1

Values accepted

The default value is empty, but will load libibverbs.so.1, libibverbs.so.

ONEFLOW_DEBUG_MODE

Enable debug mode, ONEFLOW_DEBUG can do.

If debug mode is on, it will output more INFO level logs, different prototxt and dot to files. The automatically inserted boxing information will be printed to the log file under eager global mode.

Values accepted

The default value is empty, but will receive any string.

ONEFLOW_DRY_RUN

Only for test running, it can generate log files like dot.

Exit once the test is succeed, do not try real training.

Values accepted

The default value is empty, but will receive any string.

ONEFLOW_DEBUG_KERNEL_SYNC_CHECK_NUMERICS

Only used when debugging because the performance would be affected, it could detect which op in the network appears nan or inf.

It will create CpuCheckNumericsKernelObserver under cpu , and CudaCheckNumericsKernelObserver under cuda #6052 .

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.

ONEFLOW_DEBUG_KERNEL_SYNC_CHECK

Only used when debugging because the performance would be affected.

It will create SyncCheckKernelObserver and will be synced after each kernel.

It could be used to debug cuda errors. #6052

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.

ONEFLOW_PROFILER_KERNEL_PROFILE_CUDA_MEMORY_BANDWIDTH

Used when generate profiler files by nsys.

Profiler is only valid for lazy temporarily.

It can estimate the memory bandwidth reached by kernel by counting the execution time of the GPU kernel and the size of the input and output memory, and help find potential kernels that can be optimized. Details

Values accepted

Define and set to false. When using, the compiled package needs to enable BUILD_PROFILER.

ONEFLOW_PROFILER_KERNEL_PROFILE_KERNEL_FORWARD_RANGE

The same as above. collect op name

Values accepted

Define and set to false. When using, the compiled package needs to enable BUILD_PROFILER.

ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER

Only use blob_access_checker after enabling, because blob_access_checker is for correctness assurance, and closing it in some cases can increase the kernel overhead. #5728

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.

ONEFLOW_KERNEL_ENABLE_CUDA_GRAPH

Takes effect under WITH_CUDA_GRAPHS and the default value is false. It uses more memory, so when there’s just enough memory, it won’t run.

Turning on CUDA_GRAPH will use up more memory CUDA Graphs support. #5868

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.

ONEFLOW_ACTOR_ENABLE_LIGHT_ACTOR

LightActor is a new type of Actor that only handles NormalForward and similar tasks where all regst_num is 1 or tasks with only one kernel. #5868. export ONEFLOW_KERNEL_ENABLE_CUDA_GRAPH=1 (Would use more memories), export ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE=1, export ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER=1, export ONEFLOW_ACTOR_ENABLE_LIGHT_ACTOR=1, export ONEFLOW_STREAM_REUSE_CUDA_EVENT=1 can be used together.

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.

ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE

#5720. It is used to enable local message queue, oneflow.config.thread_enable_local_message_queue(True) is no longer used.

Values accepted

Define and set to false, and would be true only when the value is 1, true, yes, on and y.

ONEFLOW_PERSISTENT_IN_STREAM_BUFFER_SIZE_BYTES

Represents the size of each read from disk. #5162

Values accepted

The default value is empty. If an invalid string or negative number is entered, the default value would be 32 * 1024; 32KB.

ONEFLOW_DECODER_ENABLE_NVJPEG_HARDWARE_ACCELERATION

NVJPEG_VER_MAJOR need to be bigger than 11. It can accelerate nvjpeg hardware, warm up jpeg decoder and hw_jpeg decoder, #5851.

Hardware JPEG decoder and NVIDIA nvJPEG library on NVIDIA A100 GPUs

Values accepted

Define and set to true, and would be true only when the value is 1, true, yes, on and y.

ONEFLOW_SERVING_DEBUG

For printing information of OneFlow Serving Debug

Values accepted

The default value is false

ONEFLOW_DISABLE_VIEW

To disable view mechanism, which means op related to view would stop running.

Values accepted

The default value is false

ONEFLOW_BOXING_DISABLE_MIDDLE_NODE_AND_CHECK

Whether to disable Middle Node. When it is false, all inter-SBP communication is supported

Values accepted

The default value is false

ONEFLOW_ONE_EMBEDDING_DISABLE_NUMA_AWARE_ALLOCATION

Whether to disable NUMA_AWARE memory allocation when the OneEmbedding module allocates video memory.

NUMA_AWARE memory allocation means that when allocating pinned host memory, the cpu close to the gpu will be considered (for example, if it is gpu 0 1, memory will be allocated on cpu0)

Values accepted

The default value is false

ONEFLOW_EP_CUDA_ENABLE_TF32_EXECUTION

Whether to allow CUDA to use TF32 numeric types for computation

Values accepted

The default value is true

ONEFLOW_FUNCTOR_DISABLE_FUSED_MLP

Whether to disable the fused_mlp operator implemented by cublasLt in FusedMLPFunctor, if disabled, it will degenerate into a multiple matrix multiplication operation.

Values accepted

The default value is false

ONEFLOW_ONE_EMBEDDING_EMBEDDING_SHUFFLE_INDEPENTENT_STREAM

Whether to put the EmbeddingShuffle of the OneEmbedding module on a separate stream for overlapping execution.

Values accepted

The default value is false

ONEFLOW_ONE_EMBEDDING_GRADIENT_SHUFFLE_USE_FP16

Whether to allow the EmbeddingGradientShuffle operator of the OneEmbedding module to use the FP16 data type in the AMP case.

Values accepted

The default value is true

ONEFLOW_ONE_EMBEDDING_NOT_FUSE_CAST_TO_UPDATE

Whether to disable the fusion of cast type conversion and parameter update of OneEmbedding parameters into one operator in the case of AMP

Values accepted

The default value is false

ONEFLOW_DEBUG_KERNEL_SYNC_CHECK_NUMERICS_DUMP

When the value appears NaN or Inf, save the data Dump.

Values accepted

The default value is false

ONEFLOW_MLIR_ENABLE_IR_PRINTING

Control whether to print ir when running each pass when debugging

Values accepted

The default value is false

ONEFLOW_MLIR_STDOUT

Control whether MLIR outputs log information in the console

Values accepted

The default value is false

ONEFLOW_MLIR_DUMP_IR

Control whether to dump ir files

Values accepted

The default value is false

ONEFLOW_MLIR_ENABLE_ROUND_TRIP

Control whether Oneflow Job goes into MLIR

Values accepted

The default value is false

ONEFLOW_KERNEL_REDUCE_SUM_USE_MATMUL

whether to use matrix multiplication for reduce_sum

Values accepted

The default value is false

ONEFLOW_ONE_EMBEDDING_ENABLE_QUANTIZED_COMM

Whether to quantify the shuffle application communication in the case of OneEmbedding multi-card

Values accepted

The default value is false

ONEFLOW_TENSOR_BUFFER_ALIGNED_SIZE

Align size when allocating TensorBuffer memory

Values accepted

The default value is 1024

ONEFLOW_TENSOR_BUFFER_POOL_THREAD_LOCAL_CACHE_SIZE

Control the size of thread_local_cache in TensorBufferPool

Values accepted

The default value is 64

ONEFLOW_GRPC_MAX_MESSAGE_BYTE_SIZE

Set the maximum size of the gRPC transport message

Values accepted

The default value is -1

ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_CAPACITY_HINT

Control the initial capacity of the PersistentTable of OneEmbedding to avoid frequent expansion

Values accepted

OneEmbedding will calculate according to the actual situation, and users can also choose to configure a larger capacity.

ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_NUM_WORKERS

The number of threads used for reading and writing the PersistentTable of OneEmbedding

Values accepted

The default value is 4

ONEFLOW_EP_CUDA_CONST_BUFFER_ELEMENT_COUNT

Specify the size of the all zero and all one buffers on the CUDA device.

This buffer can be used with matrix multiplication to implement operations such as reduce_sum

Values accepted

The default value is 1024x1024

OMP_NUM_THREADS

Set the number of threads used by OMP

Values accepted

The default value will be generated by specific computational logic.

SBP_INFER_RULE_TAG

Specify SBP derivation rules

Values accepted

When the default value is 1 , select the SBP that satisfies the producer or the SBP with the smallest cost as much as possible.

When the default value is 2, select the SBP that matches the most.

When the default value is 3, select the SBP with the smallest cost.

ONEFLOW_TENSOR_BUFFER_GROWTH_FACTOR

Control the growth factor of TensorBuffer

Values accepted

The default value is 1.0

ONEFLOW_TENSOR_BUFFER_SHRINK_FACTOR

Controls the shrink factor of TensorBuffer

Values accepted

The default value is 0.7

ONEFLOW_TENSOR_BUFFER_POOL_SIZE_FACTOR

Controls the size factor of TensorBuffer

Values accepted

The default value is 2.0

AUTO_PARALLEL_TRANSFER_COST

Control the size of the automatic parallel transfer cost

Values accepted

The default value is 1.65e8

ONEFLOW_DEBUG_PASS

Pass names and print job before and after a specific pass, such as export ONEFLOW_DEBUG_PASS="FuseAddToOutputPass.

Or ALL, print job before and after a specific pass, such as export ONEFLOW_DEBUG_PASS="ALL".

Values accepted

The default value is empty

ONEFLOW_PROFILER_HOST_THREAD_NAME_PREFIX

Add a prefix to the name of the named host thread in the profiling context to facilitate sorting in the visualization tool (nsight)

Values accepted

The default value is empty