Skip to content

Throughput Failed On Mutiple GPUS #287

@westfly

Description

@westfly

I ran example throughput.cu and it failed on 4XGPU,

Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run:  [5/8] throughput_bench [Device=0]
Fail: Unexpected error: /data/github/build/cache/nvbench/b2fc/nvbench/detail/l2flush.cuh:55: Cuda API call returned error: cudaErrorInvalidValue: invalid argument
Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run:  [6/8] throughput_bench [Device=1]
Fail: Unexpected error: /data/github/build/cache/nvbench/b2fc/nvbench/detail/l2flush.cuh:55: Cuda API call returned error: cudaErrorInvalidValue: invalid argument
Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run:  [7/8] throughput_bench [Device=2]
Fail: Unexpected error: /data/github/build/cache/nvbench/b2fc/nvbench/detail/l2flush.cuh:55: Cuda API call returned error: cudaErrorInvalidValue: invalid argument
Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run:  [8/8] throughput_bench [Device=3]
Pass: Cold: 0.007061ms GPU, 0.016156ms CPU, 0.50s total GPU, 6.81s total wall, 70816x
Pass: Batch: 0.002299ms GPU, 0.50s total GPU, 0.50s to

I noticed examples/stream.cu that can set_cuda_stream

  state.set_cuda_stream(nvbench::make_cuda_stream_view(default_stream));

so I added it to throughput.cu which works fine

# Log



Run:  [1/4] throughput_bench [Device=0]
Pass: Cold: 0.663276ms GPU, 0.672594ms CPU, 0.51s total GPU, 0.54s total wall, 768x
Pass: Batch: 0.659212ms GPU, 0.53s total GPU, 0.53s total wall, 800x
Run:  [2/4] throughput_bench [Device=1]
Pass: Cold: 0.665058ms GPU, 0.674441ms CPU, 0.50s total GPU, 0.53s total wall, 752x
Pass: Batch: 0.660540ms GPU, 0.54s total GPU, 0.54s total wall, 815x
Run:  [3/4] throughput_bench [Device=2]
Pass: Cold: 0.664827ms GPU, 0.674139ms CPU, 0.51s total GPU, 0.55s total wall, 768x
Pass: Batch: 0.660413ms GPU, 0.53s total GPU, 0.53s total wall, 809x
Run:  [4/4] throughput_bench [Device=3]
Pass: Cold: 0.665416ms GPU, 0.674786ms CPU, 0.50s total GPU, 0.53s total wall, 752x
Pass: Batch: 0.660745ms GPU, 0.53s total GPU, 0.53s total wall, 807x

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions