Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ __pycache__/
# Cache
cache/

# Humanize / RLCR loop state
.humanize/

# JSON
*.json

Expand Down
81 changes: 81 additions & 0 deletions draft.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Operator Development Plan (diff, digamma, dist, logdet, pad)

## Goal Description
Fix, optimize, and successfully execute the 5 currently broken operators (`diff`, `digamma`, `dist`, `logdet`, `pad`) on a local NVIDIA RTX 5060Ti GPU. The objective is to ensure the codebase compiles properly, passes all official benchmark tests without modifying any built-in test cases, and to push the final working modifications to the target remote repository and branch (`2025-autumn-LaiQuan-conquer-T1-1-37`).

## Acceptance Criteria

Following TDD philosophy, each criterion includes positive and negative tests for deterministic verification.

- AC-1: Successful Library and Operator Compilation
- Positive Tests (expected to PASS):
- Executing `XMAKE_ROOT=y python scripts/install.py --omp=y --cpu=y --nv-gpu=y` completes successfully with no syntax errors, undefined references, or fatal aborts in the terminal.
- Negative Tests (expected to FAIL):
- Compilation halts due to C++/CUDA syntax errors, missing headers, or type mismatches in any of the 5 targeted operator files.
- AC-2: Official Benchmark Tests Execution
- Positive Tests:
- Executing `python test/infinicore/run.py --ops diff,digamma,dist,logdet,pad --nv-gpu --bench` runs successfully, printing "PASS" and the benchmark performance metrics for all 5 operators.
- Negative Tests:
- The test script crashes due to runtime errors (e.g., CUDA out-of-bounds memory access, segmentation fault, illegal memory access) or fails the official assertions due to incorrect mathematical logic.
- AC-3: Strict Preservation of Official Test Cases
- Positive Tests:
- Git status and diff show zero modifications, deletions, or bypasses to the official test cases located in the `test/infinicore/` directory.
- Negative Tests:
- Built-in test cases or the official test scripts are found to be modified to achieve a false positive pass.
- AC-4: Code Submission and Remote Push
- Positive Tests:
- Successfully committing and running `git push` to upload all local changes to the `2025-autumn-LaiQuan-conquer-T1-1-37` branch of the `git@github.com:LaiQuan-conquer/InfiniCore.git` repository.
- Negative Tests:
- Push gets rejected by the remote server due to incorrect branch naming, missing permissions, or non-fast-forward tracking errors.

## Path Boundaries

Path boundaries define the acceptable range of implementation quality and choices.

### Upper Bound (Maximum Acceptable Scope)
A highly optimized CUDA implementation for all five operators that fully utilizes the shared memory and parallel computing capabilities of the local RTX 5060Ti. The code gracefully handles complex index calculations and memory boundaries (especially for `pad` and `diff`), achieves optimal computational performance in the benchmark tests, and features clean formatting with proper grid/block dimension tuning.

### Lower Bound (Minimum Acceptable Scope)
A fundamentally sound algorithmic implementation that resolves all existing syntax and compilation bugs, correctly computes the required mathematical outputs, and successfully passes the target test commands on the local GPU, satisfying the minimum requirements for the competition without over-engineering.

### Allowed Choices
- Can use: Standard CUDA C/C++ programming paradigms, existing mathematical helper functions/macros within the InfiniCore framework, and local profiling/debugging commands (e.g., `nvidia-smi`).
- Cannot use: Any modifications to the official test scripts (including `run.py` and its dependencies), alterations to the built-in test cases, or unauthorized closed-source third-party acceleration libraries.

## Feasibility Hints and Suggestions

> **Note**: This section is for reference and understanding only. These are conceptual suggestions, not prescriptive requirements.

### Conceptual Approach
1. **Compilation Troubleshooting**: Address the immediate "cannot compile" issue by inspecting the terminal logs from `install.py`. Fix fundamental C++ issues such as missing header includes, uninitialized pointers, or kernel parameter mismatches.
2. **Operator-by-Operator Execution**:
- `diff`: Ensure correct stride and boundary checks when computing differences along specified dimensions.
- `digamma`: Implement or correctly call stable numerical approximations for the logarithmic derivative of the gamma function to avoid NaN results.
- `dist`: Focus on accurate norm calculations (e.g., p-norm) across vectors/matrices and ensure correct reduction implementation to prevent race conditions.
- `logdet`: This may require a stable approach for determinant calculation (such as leveraging LU or Cholesky decomposition equivalents available in the framework or robust custom kernels) to prevent underflow/overflow.
- `pad`: Pay close attention to index mapping between the padded output tensor and the original input tensor, handling various padding modes (e.g., constant, reflect, replicate).
3. **Iterative Testing**: Isolate the operators using the provided test script (e.g., test individually via `--ops pad`). Debug logic errors sequentially before proceeding to the combined full benchmark validation.

### Relevant References
- The source code directory of the kernel implementations to locate and refactor the currently non-functional logic.
- Framework-level common header files to utilize established memory access patterns.

## Dependencies and Sequence

### Milestones
1. Environment Configuration and Compilation Fixes
- Phase A: Run the installation script and collect the initial compilation error logs for the 5 operators.
- Phase B: Systematically patch syntax, template, and type errors until `install.py` executes successfully on the local environment.
2. Logic Correction and Individual Operator Verification
- Phase A: Run the test command for each operator individually to debug and correct the mathematical kernels.
- Phase B: Strictly verify via Git that the official built-in test case files remain untouched.
3. Benchmark Validation and Remote Submission
- Phase A: Execute the full benchmark test command to confirm that the performance and outputs of all 5 operators pass.
- Phase B: Commit the finalized code and push it to the designated Git repository and `2025-autumn-LaiQuan-conquer-T1-1-37` branch.

## Implementation Notes

### Code Style Requirements
- Implementation code and comments must NOT contain plan-specific terminology such as "AC-", "Milestone", "Step", "Phase", or similar workflow markers.
- These terms are strictly for plan documentation only.
- Use descriptive, mathematical, and domain-appropriate naming conventions within the actual C++/CUDA codebase.
15 changes: 15 additions & 0 deletions include/infinicore/ops/diff.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#pragma once

#include "../device.hpp"
#include "../graph/graph.hpp"
#include "common/op.hpp"

namespace infinicore::op {

INFINICORE_GRAPH_OP_CLASS(Diff, Tensor, const Tensor &, int, int);

Tensor diff(const Tensor &x, int n = 1, int dim = -1);
void diff_(Tensor y, const Tensor &x, int n = 1, int dim = -1);

} // namespace infinicore::op

15 changes: 15 additions & 0 deletions include/infinicore/ops/digamma.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#pragma once

#include "../device.hpp"
#include "../graph/graph.hpp"
#include "common/op.hpp"

namespace infinicore::op {

INFINICORE_GRAPH_OP_CLASS(Digamma, Tensor, const Tensor &);

Tensor digamma(const Tensor &x);
void digamma_(Tensor y, const Tensor &x);

} // namespace infinicore::op

15 changes: 15 additions & 0 deletions include/infinicore/ops/dist.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#pragma once

#include "../device.hpp"
#include "../graph/graph.hpp"
#include "common/op.hpp"

namespace infinicore::op {

INFINICORE_GRAPH_OP_CLASS(Dist, Tensor, const Tensor &, const Tensor &, double);

Tensor dist(const Tensor &x1, const Tensor &x2, double p = 2.0);
void dist_(Tensor y, const Tensor &x1, const Tensor &x2, double p = 2.0);

} // namespace infinicore::op

15 changes: 15 additions & 0 deletions include/infinicore/ops/logdet.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#pragma once

#include "../device.hpp"
#include "../graph/graph.hpp"
#include "common/op.hpp"

namespace infinicore::op {

INFINICORE_GRAPH_OP_CLASS(Logdet, Tensor, const Tensor &);

Tensor logdet(const Tensor &x);
void logdet_(Tensor y, const Tensor &x);

} // namespace infinicore::op

26 changes: 26 additions & 0 deletions include/infinicore/ops/pad.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#pragma once

#include "../device.hpp"
#include "../graph/graph.hpp"
#include "common/op.hpp"

#include <string>
#include <vector>

namespace infinicore::op {

INFINICORE_GRAPH_OP_CLASS(Pad, Tensor, const Tensor &, const std::vector<int> &, const std::string &, double);

Tensor pad(const Tensor &x,
const std::vector<int> &pad,
const std::string &mode = "constant",
double value = 0.0);

void pad_(Tensor y,
const Tensor &x,
const std::vector<int> &pad,
const std::string &mode = "constant",
double value = 0.0);

} // namespace infinicore::op

5 changes: 5 additions & 0 deletions include/infiniop.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,22 @@
#include "infiniop/ops/clip.h"
#include "infiniop/ops/conv.h"
#include "infiniop/ops/dequantize_awq.h"
#include "infiniop/ops/diff.h"
#include "infiniop/ops/digamma.h"
#include "infiniop/ops/dist.h"
#include "infiniop/ops/embedding.h"
#include "infiniop/ops/flash_attention.h"
#include "infiniop/ops/gelu.h"
#include "infiniop/ops/gemm.h"
#include "infiniop/ops/int8_gemm.h"
#include "infiniop/ops/kv_caching.h"
#include "infiniop/ops/layer_norm.h"
#include "infiniop/ops/logdet.h"
#include "infiniop/ops/logsoftmax.h"
#include "infiniop/ops/lp_norm.h"
#include "infiniop/ops/mul.h"
#include "infiniop/ops/ones.h"
#include "infiniop/ops/pad.h"
#include "infiniop/ops/paged_attention.h"
#include "infiniop/ops/paged_attention_prefill.h"
#include "infiniop/ops/paged_caching.h"
Expand Down
26 changes: 26 additions & 0 deletions include/infiniop/ops/diff.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#ifndef __INFINIOP_DIFF_API_H__
#define __INFINIOP_DIFF_API_H__

#include "../operator_descriptor.h"

typedef struct InfiniopDescriptor *infiniopDiffDescriptor_t;

__C __export infiniStatus_t infiniopCreateDiffDescriptor(infiniopHandle_t handle,
infiniopDiffDescriptor_t *desc_ptr,
infiniopTensorDescriptor_t y,
infiniopTensorDescriptor_t x,
int dim,
int n);

__C __export infiniStatus_t infiniopGetDiffWorkspaceSize(infiniopDiffDescriptor_t desc, size_t *size);

__C __export infiniStatus_t infiniopDiff(infiniopDiffDescriptor_t desc,
void *workspace,
size_t workspace_size,
void *y,
const void *x,
void *stream);

__C __export infiniStatus_t infiniopDestroyDiffDescriptor(infiniopDiffDescriptor_t desc);

#endif
24 changes: 24 additions & 0 deletions include/infiniop/ops/digamma.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#ifndef __INFINIOP_DIGAMMA_API_H__
#define __INFINIOP_DIGAMMA_API_H__

#include "../operator_descriptor.h"

typedef struct InfiniopDescriptor *infiniopDigammaDescriptor_t;

__C __export infiniStatus_t infiniopCreateDigammaDescriptor(infiniopHandle_t handle,
infiniopDigammaDescriptor_t *desc_ptr,
infiniopTensorDescriptor_t y,
infiniopTensorDescriptor_t x);

__C __export infiniStatus_t infiniopGetDigammaWorkspaceSize(infiniopDigammaDescriptor_t desc, size_t *size);

__C __export infiniStatus_t infiniopDigamma(infiniopDigammaDescriptor_t desc,
void *workspace,
size_t workspace_size,
void *y,
const void *x,
void *stream);

__C __export infiniStatus_t infiniopDestroyDigammaDescriptor(infiniopDigammaDescriptor_t desc);

#endif
27 changes: 27 additions & 0 deletions include/infiniop/ops/dist.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#ifndef __INFINIOP_DIST_API_H__
#define __INFINIOP_DIST_API_H__

#include "../operator_descriptor.h"

typedef struct InfiniopDescriptor *infiniopDistDescriptor_t;

__C __export infiniStatus_t infiniopCreateDistDescriptor(infiniopHandle_t handle,
infiniopDistDescriptor_t *desc_ptr,
infiniopTensorDescriptor_t y,
infiniopTensorDescriptor_t x1,
infiniopTensorDescriptor_t x2,
double p);

__C __export infiniStatus_t infiniopGetDistWorkspaceSize(infiniopDistDescriptor_t desc, size_t *size);

__C __export infiniStatus_t infiniopDist(infiniopDistDescriptor_t desc,
void *workspace,
size_t workspace_size,
void *y,
const void *x1,
const void *x2,
void *stream);

__C __export infiniStatus_t infiniopDestroyDistDescriptor(infiniopDistDescriptor_t desc);

#endif
24 changes: 24 additions & 0 deletions include/infiniop/ops/logdet.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#ifndef __INFINIOP_LOGDET_API_H__
#define __INFINIOP_LOGDET_API_H__

#include "../operator_descriptor.h"

typedef struct InfiniopDescriptor *infiniopLogdetDescriptor_t;

__C __export infiniStatus_t infiniopCreateLogdetDescriptor(infiniopHandle_t handle,
infiniopLogdetDescriptor_t *desc_ptr,
infiniopTensorDescriptor_t y,
infiniopTensorDescriptor_t x);

__C __export infiniStatus_t infiniopGetLogdetWorkspaceSize(infiniopLogdetDescriptor_t desc, size_t *size);

__C __export infiniStatus_t infiniopLogdet(infiniopLogdetDescriptor_t desc,
void *workspace,
size_t workspace_size,
void *y,
const void *x,
void *stream);

__C __export infiniStatus_t infiniopDestroyLogdetDescriptor(infiniopLogdetDescriptor_t desc);

#endif
28 changes: 28 additions & 0 deletions include/infiniop/ops/pad.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#ifndef __INFINIOP_PAD_API_H__
#define __INFINIOP_PAD_API_H__

#include "../operator_descriptor.h"

typedef struct InfiniopDescriptor *infiniopPadDescriptor_t;

__C __export infiniStatus_t infiniopCreatePadDescriptor(infiniopHandle_t handle,
infiniopPadDescriptor_t *desc_ptr,
infiniopTensorDescriptor_t y,
infiniopTensorDescriptor_t x,
void *pad,
size_t pad_size,
const char *mode,
double value);

__C __export infiniStatus_t infiniopGetPadWorkspaceSize(infiniopPadDescriptor_t desc, size_t *size);

__C __export infiniStatus_t infiniopPad(infiniopPadDescriptor_t desc,
void *workspace,
size_t workspace_size,
void *y,
const void *x,
void *stream);

__C __export infiniStatus_t infiniopDestroyPadDescriptor(infiniopPadDescriptor_t desc);

#endif
Loading