Issue
When I use npkit_trace_generator.py to convert the trace file generated by npkit to a json file, I get some errors.
Traceback (most recent call last):
File "/home/zhangshizhuo/msccl/tools/npkit_trace_generator.py", line 232, in <module>
convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def)
File "/home/zhangshizhuo/msccl/tools/npkit_trace_generator.py", line 211, in convert_npkit_dump_to_trace
gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhangshizhuo/msccl/tools/npkit_trace_generator.py", line 95, in parse_gpu_event_file
'ts': curr_cpu_base_time + parsed_gpu_event['timestamp'] / gpu_clock_scale - curr_gpu_base_time,
~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for +: 'NoneType' and 'float'
Specifically, I used the msccl-tools/examples/mscclang/allgather_recursive_doubling.py to generate the xml file and communicate on the cluster. This error also occurs when testing reduce scatter, but allreduce and alltoall not. Can you help me with this error? Looking forward to your reply.
Details
Generate xml file:
python /home/zhangshizhuo/msccl-tools/examples/mscclang/allgather_recursive_doubling.py 4 1 --protocol='Simple'> /home/zhangshizhuo/xml2/Allgather_test.xml
mpirun test:
mpirun --prefix /usr/local/openmpi \
-np 4 \
-H gpu1:4\
-map-by slot \
-mca btl_tcp_if_include 10.1.1.0/24 \
-x NCCL_SOCKET_IFNAME=ens16f0,enp75s0f0np0,ens6f0 \
-x LD_LIBRARY_PATH=/home/zhangshizhuo/msccl/build/lib/:$LD_LIBRARY_PATH \
-x NCCL_NET_SHARED_BUFFERS=0 \
-x NCCL_IGNORE_DISABLED_P2P=1 \
-x NCCL_SHM_Disable=1 \
-x NCCL_DEBUG=INFO \
-x NCCL_ALGO=MSCCL,RING \
-x MSCCL_XML_FILES=/home/zhangshizhuo/xml2/Allgather_test.xml \
-x NPKIT_DUMP_DIR=/home/zhangshizhuo/trace/trace_allgather/ \
-x CUDA_VISIBLE_DEVICES=0,1,2,3 \
bash -c ' cd /home/zhangshizhuo/nccl-tests/build/; \
./all_gather_perf -b 32M -e 32M -f 2 -g 1 -n 5 -w 3 -c 0 -z 1 '
Issue
When I use
npkit_trace_generator.pyto convert the trace file generated bynpkitto a json file, I get some errors.Specifically, I used the
msccl-tools/examples/mscclang/allgather_recursive_doubling.pyto generate the xml file and communicate on the cluster. This error also occurs when testing reduce scatter, but allreduce and alltoall not. Can you help me with this error? Looking forward to your reply.Details
Generate xml file:
mpirun test: