oneflow.distributedΒΆ
run commands below to see more about usage.
python3 -m oneflow.distributed.launch -h
usage: launch.py [-h] [--nnodes NNODES] [--node_rank NODE_RANK]
[--nproc_per_node NPROC_PER_NODE] [--master_addr MASTER_ADDR]
[--master_port MASTER_PORT] [-m] [--no_python]
[--redirect_stdout_and_stderr] [--logdir LOGDIR]
training_script ...
OneFlow distributed training launch helper utility that will spawn up multiple
distributed processes
positional arguments:
training_script The full path to the single GPU training program/script to be
launched in parallel, followed by all the arguments for the
training script
training_script_args
optional arguments:
-h, --help show this help message and exit
--nnodes NNODES The number of nodes to use for distributed training
--node_rank NODE_RANK
The rank of the node for multi-node distributed training
--nproc_per_node NPROC_PER_NODE
The number of processes to launch on each node, for GPU
training, this is recommended to be set to the number of GPUs in
your system so that each process can be bound to a single GPU.
--master_addr MASTER_ADDR
Master node (rank 0)'s address, should be either the IP address
or the hostname of node 0, for single node multi-proc training,
the --master_addr can simply be 127.0.0.1
--master_port MASTER_PORT
Master node (rank 0)'s free port that needs to be used for
communication during distributed training
-m, --module Changes each process to interpret the launch script as a python
module, executing with the same behavior as'python -m'.
--no_python Do not prepend the training script with "python" - just exec it
directly. Useful when the script is not a Python script.
--redirect_stdout_and_stderr
write the stdout and stderr to files 'stdout' and 'stderr'. Only
available when logdir is set
--logdir LOGDIR Relative path to write subprocess logs to. Passing in a relative
path will create a directory if needed. Note that successive
runs with the same path to write logs to will overwrite existing
logs, so be sure to save logs as needed.