Skip to content

Commit 13cf84a

Browse files
zhangyue207zhangyue
andauthored
feat(ascend): op-simple group — Add, Mul, Cast, Cat, Matmul, Gemm, Linear (#65)
* feat(ascend): op-simple group — Add, Mul, Cast, Cat, Matmul, Gemm, Linear Seven foundational Ascend operators: | op | impl | |---|---| | Add | aclnnAdd | | Mul | aclnnMul | | Cast | aclnnCast | | Cat | aclnnCat | | Matmul | aclnnMatmul | | Gemm | aclnnMm (also carries the cached-executor / workspace-pool rework) | | Linear | aclnnMatmul + optional bias | Also ships: - `src/base/<op>.h` for the 5 new ops (cast/cat/linear/matmul/mul); `add.h` and `gemm.h` existed on master and are updated in-place - `src/cpu/<op>/<op>.h` reference impls for cast/cat/linear/mul (add/gemm/matmul had CPU refs on master already) - `tests/test_<op>.py` for each operator (add and gemm have MODIFY diffs; others are new) * fix(ascend): Add/Cat destructor — use `release()` for executor-owned caches - `add/kernel.h`: swap destroy() → release() on in_cache_/oth_cache_/out_cache_ and drop aclDestroyAclOpExecutor (both are referenced by the Repeatable executor; destroying them causes double-free at shutdown per the pattern documented in common.h and commit 64c367c). - `cat/kernel.h`: release all in_caches_[i] in the destructor; without it, ~AclTensorCache() on vector teardown double-frees descriptors held by tensor_list_ / executor_. - Also group the alpha_* storage members with blank lines to match file convention. * test: generate `implementation_index` dynamically from `active_implementation_indices` Replaces hardcoded `(0, 1)` / `(0, 1, 2)` tuples in test_add, test_gemm, test_rms_norm, test_swiglu with a union over the locally-available devices' active implementation indices. New helper `tests.utils.all_active_implementation_indices(op_cls)` only iterates `get_available_devices()` to avoid `DispatchFunc::std::abort` on device types outside the build's `ActiveDevices` set. Effect on Ascend CI: skipped-test count drops from 3246 to 1686 — impl=1 (`cuBLASLt`) no longer parametrized when no CUDA device is visible, and RmsNorm/Swiglu's custom-kernel slot drops out of the matrix on op-simple where the framework layer hasn't merged the AscendC impl yet. * test(conftest): joint `(device, implementation_index)` parametrize Replaces the per-test `@pytest.mark.parametrize("implementation_index", ...)` + runtime `if impl not in active_indices: skip` pattern with a single hook in `conftest.pytest_generate_tests` that emits only the (device, impl) pairs actually active on each device. Rationale: kernel dispatch is per-device, so cross-device union (previous `all_active_implementation_indices` helper) polluted the matrix with impls that the selected device can't run — runtime-skipped noise. Joint generation keeps the matrix to its semantic cell: "this device has this impl, so run it". - `tests/conftest.py`: when both `device` and `implementation_index` are in fixturenames, emit pairs via `op_cls.active_implementation_indices(dev)`; fall back to a skipped placeholder (`id="skip"`) when no device has an active impl, avoiding `[NOTSET-...]` test IDs. - `tests/{test_add,test_gemm,test_rms_norm,test_swiglu}.py`: drop the hardcoded `implementation_index` parametrize decorator and the runtime `active_indices` guard — conftest now handles both. - `tests/utils.py`: remove the `all_active_implementation_indices` helper (superseded by per-device generation in conftest). Same test outcome on Ascend CI (1935 passed / 1686 skipped) but the remaining skips are now either semantically mandatory (uint dtypes unsupported by `torch_npu`, Gemm impl=2 SFINAE-only workaround, op missing ascend impl on op-simple pending PR #66) rather than mechanism artifacts. * refactor(conftest): dedupe `_op_class_from_module`, short-circuit redundant fixture Post-review cleanup of the joint-parametrize refactor (1dd288f): - Extract `_op_class_from_module` as a shared helper; `skip_op_without_platform_impl` fixture now calls it instead of re-deriving the snake→pascal class name inline. - Short-circuit the fixture when `implementation_index` is already in callspec — `pytest_generate_tests` has already pruned empty-impl pairs, so per-case `active_implementation_indices` calls are wasted work. - Drop `try/except ImportError` inside the helper — collection has already imported `infini.ops` via test modules; masking a real import failure only turns it into a cryptic NOTSET fixture. - Drop the `devices[0] if devices else "cpu"` fallback — `get_available_devices()` always includes `"cpu"`, making the `else` arm unreachable. * refactor(cpu): flatten nested `DispatchFunc` in Cast; snake_case variables in Linear Per PR #65 review: - `src/cpu/cast/cast.h`: replace nested `DispatchFunc(in_dtype, ...)` inside `DispatchFunc(out_dtype, ...)` with a single multi-dispatch call `DispatchFunc<kCpu, AllTypes, AllTypes>({in, out}, [](in_tag, out_tag) {...})` per the multi-dispatch idiom documented in `CONTRIBUTING.md`. - `src/cpu/linear/linear.h`: rename PascalCase locals to snake_case: `A/B/Out/Bias` → `a_ptr/b_ptr/out_ptr/bias_ptr`, `A_batch/B_batch/Out_batch` → `a_batch/b_batch/out_batch`, `M/N/K` → `m/n/k` (matching master's `src/cpu/gemm/gemm.h` which already uses lowercase dim names `m_/n_/k_`). * refactor(cpu/linear): drop redundant `&& bias` guard + narrating comment - `if (bias_ptr && bias)` → `if (bias_ptr)` (line 75). `bias_ptr` is `nullptr` iff `!bias` by construction at line 38, so `&& bias` is dead. - Remove `// Determine `m`, `n`, `k` from shapes and transpose flags.` — the three lines below literally do exactly that; self-describing now that names are snake_case. --------- Co-authored-by: zhangyue <zhangyue@example.com>
1 parent a05713b commit 13cf84a

27 files changed

Lines changed: 1584 additions & 107 deletions

src/ascend/add/kernel.h

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
#ifndef INFINI_OPS_ASCEND_ADD_KERNEL_H_
2+
#define INFINI_OPS_ASCEND_ADD_KERNEL_H_
3+
4+
#include "acl/acl.h"
5+
#include "aclnn/aclnn_base.h"
6+
#include "aclnn_add.h"
7+
#include "ascend/common.h"
8+
#include "ascend/workspace_pool_.h"
9+
#include "base/add.h"
10+
#include "data_type.h"
11+
#include "operator.h"
12+
13+
namespace infini::ops {
14+
15+
template <>
16+
class Operator<Add, Device::Type::kAscend> : public Add {
17+
public:
18+
Operator(const Tensor input, const Tensor other, Tensor out)
19+
: Add(input, other, out),
20+
in_cache_(input),
21+
oth_cache_(other),
22+
out_cache_(out) {
23+
// `aclCreateScalar` stores the pointer rather than copying the value, so
24+
// `alpha_storage_*` must remain alive for the lifetime of `alpha_`.
25+
// The alpha scalar type must match the tensor dtype: use int64 for integer
26+
// dtypes and float for floating-point dtypes.
27+
if (ascend::IsIntegerDtype(input.dtype())) {
28+
alpha_ = aclCreateScalar(&alpha_int_storage_, ACL_INT64);
29+
} else {
30+
alpha_ = aclCreateScalar(&alpha_float_storage_, ACL_FLOAT);
31+
}
32+
}
33+
34+
~Operator() {
35+
if (!ascend::IsAclRuntimeAlive()) return;
36+
37+
// Null cached descriptors — see `AclTensorCache::release()`. The
38+
// descriptors are still referenced by the Repeatable `executor_`, so
39+
// skipping `aclDestroyTensor` (and leaking the executor at shutdown)
40+
// avoids a double-free; see `64c367c`.
41+
in_cache_.release();
42+
oth_cache_.release();
43+
out_cache_.release();
44+
45+
if (alpha_) aclDestroyScalar(alpha_);
46+
}
47+
48+
void operator()(const Tensor input, const Tensor other,
49+
Tensor out) const override {
50+
auto stream = static_cast<aclrtStream>(stream_);
51+
auto t_in = in_cache_.get(const_cast<void*>(input.data()));
52+
auto t_oth = oth_cache_.get(const_cast<void*>(other.data()));
53+
auto t_out = out_cache_.get(out.data());
54+
55+
if (!executor_) {
56+
aclnnAddGetWorkspaceSize(t_in, t_oth, alpha_, t_out, &ws_size_,
57+
&executor_);
58+
aclSetAclOpExecutorRepeatable(executor_);
59+
} else {
60+
aclSetInputTensorAddr(executor_, 0, t_in,
61+
const_cast<void*>(input.data()));
62+
aclSetInputTensorAddr(executor_, 1, t_oth,
63+
const_cast<void*>(other.data()));
64+
aclSetOutputTensorAddr(executor_, 0, t_out, out.data());
65+
}
66+
67+
auto& arena = ascend::GetWorkspacePool().Ensure(stream, ws_size_);
68+
aclnnAdd(arena.buf, ws_size_, executor_, stream);
69+
}
70+
71+
private:
72+
mutable ascend::AclTensorCache in_cache_;
73+
74+
mutable ascend::AclTensorCache oth_cache_;
75+
76+
mutable ascend::AclTensorCache out_cache_;
77+
78+
mutable aclOpExecutor* executor_ = nullptr;
79+
80+
mutable uint64_t ws_size_ = 0;
81+
82+
// Stable address for `aclCreateScalar` (float).
83+
float alpha_float_storage_ = 1.0f;
84+
85+
// Stable address for `aclCreateScalar` (int).
86+
int64_t alpha_int_storage_ = 1;
87+
88+
aclScalar* alpha_ = nullptr;
89+
};
90+
91+
} // namespace infini::ops
92+
93+
#endif

src/ascend/cast/kernel.h

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
#ifndef INFINI_OPS_ASCEND_CAST_KERNEL_H_
2+
#define INFINI_OPS_ASCEND_CAST_KERNEL_H_
3+
4+
#include "acl/acl.h"
5+
#include "aclnn/aclnn_base.h"
6+
#include "aclnnop/aclnn_cast.h"
7+
#include "ascend/common.h"
8+
#include "ascend/workspace_pool_.h"
9+
#include "base/cast.h"
10+
#include "operator.h"
11+
12+
namespace infini::ops {
13+
14+
template <>
15+
class Operator<Cast, Device::Type::kAscend> : public Cast {
16+
public:
17+
Operator(const Tensor input, Tensor out)
18+
: Cast(input, out),
19+
in_cache_(input),
20+
out_cache_(out),
21+
acl_out_dtype_(ascend::ToAclDtype(out.dtype())) {}
22+
23+
~Operator() {
24+
if (!ascend::IsAclRuntimeAlive()) return;
25+
26+
// Null cached descriptors — see `AclTensorCache::release()`.
27+
in_cache_.release();
28+
out_cache_.release();
29+
}
30+
31+
void operator()(const Tensor input, Tensor out) const override {
32+
auto stream = static_cast<aclrtStream>(stream_);
33+
auto t_in = in_cache_.get(const_cast<void*>(input.data()));
34+
auto t_out = out_cache_.get(out.data());
35+
36+
if (!executor_) {
37+
aclnnCastGetWorkspaceSize(t_in, acl_out_dtype_, t_out, &ws_size_,
38+
&executor_);
39+
aclSetAclOpExecutorRepeatable(executor_);
40+
} else {
41+
aclSetInputTensorAddr(executor_, 0, t_in,
42+
const_cast<void*>(input.data()));
43+
aclSetOutputTensorAddr(executor_, 0, t_out, out.data());
44+
}
45+
46+
auto& arena = ascend::GetWorkspacePool().Ensure(stream, ws_size_);
47+
aclnnCast(arena.buf, ws_size_, executor_, stream);
48+
}
49+
50+
private:
51+
mutable ascend::AclTensorCache in_cache_;
52+
53+
mutable ascend::AclTensorCache out_cache_;
54+
55+
aclDataType acl_out_dtype_;
56+
57+
mutable aclOpExecutor* executor_ = nullptr;
58+
59+
mutable uint64_t ws_size_ = 0;
60+
};
61+
62+
} // namespace infini::ops
63+
64+
#endif

src/ascend/cat/kernel.h

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
#ifndef INFINI_OPS_ASCEND_CAT_KERNEL_H_
2+
#define INFINI_OPS_ASCEND_CAT_KERNEL_H_
3+
4+
#include <vector>
5+
6+
#include "acl/acl.h"
7+
#include "aclnn/acl_meta.h"
8+
#include "aclnn/aclnn_base.h"
9+
#include "aclnnop/aclnn_cat.h"
10+
#include "ascend/common.h"
11+
#include "ascend/workspace_pool_.h"
12+
#include "base/cat.h"
13+
#include "operator.h"
14+
15+
namespace infini::ops {
16+
17+
template <>
18+
class Operator<Cat, Device::Type::kAscend> : public Cat {
19+
public:
20+
Operator(const Tensor first_input, std::vector<Tensor> rest_inputs,
21+
int64_t dim, Tensor out)
22+
: Cat(first_input, rest_inputs, dim, out), out_cache_(out) {
23+
// Build `AclTensorCache` for each input tensor.
24+
in_caches_.reserve(input_count_);
25+
in_caches_.emplace_back(first_input);
26+
for (const auto& t : rest_inputs) {
27+
in_caches_.emplace_back(t);
28+
}
29+
}
30+
31+
~Operator() {
32+
if (!ascend::IsAclRuntimeAlive()) return;
33+
34+
// Null cached descriptors — see `AclTensorCache::release()`. The input
35+
// descriptors are referenced by the Repeatable `executor_` via
36+
// `tensor_list_`, so every `in_caches_[i]` must be released alongside
37+
// `out_cache_`; otherwise `~AclTensorCache()` double-frees them when the
38+
// vector destructs.
39+
for (auto& c : in_caches_) {
40+
c.release();
41+
}
42+
out_cache_.release();
43+
44+
if (tensor_list_) aclDestroyTensorList(tensor_list_);
45+
}
46+
47+
void operator()(const Tensor first_input, std::vector<Tensor> rest_inputs,
48+
int64_t /*dim*/, Tensor out) const override {
49+
auto stream = static_cast<aclrtStream>(stream_);
50+
51+
// Collect all input tensors in order.
52+
std::vector<const Tensor*> inputs;
53+
inputs.reserve(input_count_);
54+
inputs.push_back(&first_input);
55+
for (const auto& t : rest_inputs) {
56+
inputs.push_back(&t);
57+
}
58+
59+
auto t_out = out_cache_.get(out.data());
60+
61+
if (!executor_) {
62+
// First call: create descriptors, tensor list, and executor.
63+
std::vector<aclTensor*> acl_tensors(input_count_);
64+
for (size_t i = 0; i < input_count_; ++i) {
65+
acl_tensors[i] =
66+
in_caches_[i].get(const_cast<void*>(inputs[i]->data()));
67+
}
68+
69+
tensor_list_ =
70+
aclCreateTensorList(const_cast<const aclTensor**>(acl_tensors.data()),
71+
static_cast<uint64_t>(input_count_));
72+
73+
aclnnCatGetWorkspaceSize(tensor_list_, dim_, t_out, &ws_size_,
74+
&executor_);
75+
aclSetAclOpExecutorRepeatable(executor_);
76+
} else {
77+
// Subsequent calls: update data pointers on cached descriptors via
78+
// `aclSetRawTensorAddr`. The executor holds references to the same
79+
// `aclTensor*` objects inside `tensor_list_`, so updating their data
80+
// pointers is sufficient — no `aclSetInputTensorAddr` needed.
81+
for (size_t i = 0; i < input_count_; ++i) {
82+
in_caches_[i].get(const_cast<void*>(inputs[i]->data()));
83+
}
84+
aclSetOutputTensorAddr(executor_, 0, t_out, out.data());
85+
}
86+
87+
auto& arena = ascend::GetWorkspacePool().Ensure(stream, ws_size_);
88+
aclnnCat(arena.buf, ws_size_, executor_, stream);
89+
}
90+
91+
private:
92+
mutable std::vector<ascend::AclTensorCache> in_caches_;
93+
94+
mutable ascend::AclTensorCache out_cache_;
95+
96+
mutable aclTensorList* tensor_list_ = nullptr;
97+
98+
mutable aclOpExecutor* executor_ = nullptr;
99+
100+
mutable uint64_t ws_size_ = 0;
101+
};
102+
103+
} // namespace infini::ops
104+
105+
#endif

src/ascend/gemm/kernel.h

Lines changed: 50 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -21,50 +21,63 @@ class Operator<Gemm, Device::Type::kAscend> : public Gemm {
2121
: Gemm(a, b, alpha, beta, trans_a, trans_b, c),
2222
batched_{batch_count_ > 1},
2323
alpha_val_{alpha.value_or(1.0f)},
24-
beta_val_{beta.value_or(1.0f)} {
24+
beta_val_{beta.value_or(1.0f)},
25+
self_cache_(c),
26+
a_cache_(a, trans_a_),
27+
b_cache_(b, trans_b_),
28+
out_cache_(c) {
2529
alpha_scalar_ = aclCreateScalar(&alpha_val_, ACL_FLOAT);
2630
beta_scalar_ = aclCreateScalar(&beta_val_, ACL_FLOAT);
2731
}
2832

2933
~Operator() {
30-
aclDestroyScalar(alpha_scalar_);
31-
aclDestroyScalar(beta_scalar_);
34+
if (!ascend::IsAclRuntimeAlive()) return;
35+
36+
// Null cached descriptors — see `AclTensorCache::release()`.
37+
self_cache_.release();
38+
a_cache_.release();
39+
b_cache_.release();
40+
out_cache_.release();
41+
42+
if (alpha_scalar_) aclDestroyScalar(alpha_scalar_);
43+
if (beta_scalar_) aclDestroyScalar(beta_scalar_);
3244
}
3345

3446
void operator()(const Tensor a, const Tensor b, std::optional<float> alpha,
3547
std::optional<float> beta, std::optional<int> trans_a,
3648
std::optional<int> trans_b, Tensor c) const override {
3749
auto stream = static_cast<aclrtStream>(stream_);
3850

39-
auto t_self = ascend::BuildAclTensor(c);
40-
auto t_a = ascend::BuildAclTensor(a, trans_a_);
41-
auto t_b = ascend::BuildAclTensor(b, trans_b_);
42-
auto t_out = ascend::BuildAclTensor(c);
43-
44-
uint64_t ws_needed = 0;
45-
aclOpExecutor* executor = nullptr;
46-
47-
if (batched_) {
48-
aclnnBaddbmmGetWorkspaceSize(t_self, t_a, t_b, beta_scalar_,
49-
alpha_scalar_, t_out, 0, &ws_needed,
50-
&executor);
51+
auto t_self = self_cache_.get(c.data());
52+
auto t_a = a_cache_.get(const_cast<void*>(a.data()));
53+
auto t_b = b_cache_.get(const_cast<void*>(b.data()));
54+
auto t_out = out_cache_.get(c.data());
55+
56+
if (!executor_) {
57+
if (batched_) {
58+
aclnnBaddbmmGetWorkspaceSize(t_self, t_a, t_b, beta_scalar_,
59+
alpha_scalar_, t_out, 0, &ws_size_,
60+
&executor_);
61+
} else {
62+
aclnnAddmmGetWorkspaceSize(t_self, t_a, t_b, beta_scalar_,
63+
alpha_scalar_, t_out, 0, &ws_size_,
64+
&executor_);
65+
}
66+
aclSetAclOpExecutorRepeatable(executor_);
5167
} else {
52-
aclnnAddmmGetWorkspaceSize(t_self, t_a, t_b, beta_scalar_, alpha_scalar_,
53-
t_out, 0, &ws_needed, &executor);
68+
aclSetInputTensorAddr(executor_, 0, t_self, c.data());
69+
aclSetInputTensorAddr(executor_, 1, t_a, const_cast<void*>(a.data()));
70+
aclSetInputTensorAddr(executor_, 2, t_b, const_cast<void*>(b.data()));
71+
aclSetOutputTensorAddr(executor_, 0, t_out, c.data());
5472
}
5573

56-
auto& arena = ascend::GetWorkspacePool().Ensure(stream, ws_needed);
74+
auto& arena = ascend::GetWorkspacePool().Ensure(stream, ws_size_);
5775

5876
if (batched_) {
59-
aclnnBaddbmm(arena.buf, ws_needed, executor, stream);
77+
aclnnBaddbmm(arena.buf, ws_size_, executor_, stream);
6078
} else {
61-
aclnnAddmm(arena.buf, ws_needed, executor, stream);
79+
aclnnAddmm(arena.buf, ws_size_, executor_, stream);
6280
}
63-
64-
aclDestroyTensor(t_self);
65-
aclDestroyTensor(t_a);
66-
aclDestroyTensor(t_b);
67-
aclDestroyTensor(t_out);
6881
}
6982

7083
private:
@@ -77,6 +90,18 @@ class Operator<Gemm, Device::Type::kAscend> : public Gemm {
7790
aclScalar* alpha_scalar_ = nullptr;
7891

7992
aclScalar* beta_scalar_ = nullptr;
93+
94+
mutable ascend::AclTensorCache self_cache_;
95+
96+
mutable ascend::AclTensorCache a_cache_;
97+
98+
mutable ascend::AclTensorCache b_cache_;
99+
100+
mutable ascend::AclTensorCache out_cache_;
101+
102+
mutable aclOpExecutor* executor_ = nullptr;
103+
104+
mutable uint64_t ws_size_ = 0;
80105
};
81106

82107
} // namespace infini::ops

0 commit comments

Comments
 (0)