Skip to content

perf(cuda/elementwise)#140

Merged
kilinchange merged 1 commit intomasterfrom
perf/cuda_elementwise_kernel
Apr 8, 2026
Merged

perf(cuda/elementwise)#140
kilinchange merged 1 commit intomasterfrom
perf/cuda_elementwise_kernel

Conversation

@kilinchange
Copy link
Copy Markdown
Collaborator

@kilinchange kilinchange commented Apr 7, 2026

pass broadcast strides by value to kill per-call cudaMallocAsync

@kilinchange kilinchange changed the title perf(cuda/elementwise): pass broadcast strides by value to kill per-call cudaMallocAsync perf(cuda/elementwise): Apr 7, 2026
@kilinchange kilinchange changed the title perf(cuda/elementwise): perf(cuda/elementwise) Apr 7, 2026
@kilinchange
Copy link
Copy Markdown
Collaborator Author

性能提升情况:
image

@kilinchange
Copy link
Copy Markdown
Collaborator Author

kilinchange commented Apr 7, 2026

精度情况:

  • gpt2_1_bfloat16 测例的精度波动参考 torch 在接受范围内(torch 精度:step 6/10 | train loss 5.063675,step 7/10 | train loss 4.845804,step 10/10 | train loss 5.221752)
  • lora 已知存在精度波动
image image

@kilinchange kilinchange requested a review from chen2021673 April 7, 2026 10:58
// Maximum number of dimensions supported by the broadcast metadata.
// Real-world tensors in this codebase top out at 4-5 dims, so 8 leaves comfortable headroom
// while keeping the struct under the 4 KB CUDA kernel parameter limit.
constexpr int kMaxBroadcastDims = 8;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确实,一般来讲8足够了。之后有超过8的场景还可以feedback到memcpy版本上

Copy link
Copy Markdown
Contributor

@chen2021673 chen2021673 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kilinchange kilinchange merged commit cfe7bf8 into master Apr 8, 2026
2 checks passed
@kilinchange kilinchange deleted the perf/cuda_elementwise_kernel branch April 8, 2026 03:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants