Skip to content

Commit e909c27

Browse files
authored
Merge pull request #1643 from codeflash-ai/cf-redesign-duplicate-detector
rework duplicate-code-detector a bit
2 parents f91278c + 0c78684 commit e909c27

4 files changed

Lines changed: 83 additions & 79 deletions

File tree

.claude/rules/architecture.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@ codeflash/
55
├── main.py # CLI entry point
66
├── cli_cmds/ # Command handling, console output (Rich)
77
├── discovery/ # Find optimizable functions
8-
├── context/ # Extract code dependencies and imports
98
├── optimization/ # Generate optimized code via AI
109
│ ├── optimizer.py # Main optimization orchestration
1110
│ └── function_optimizer.py # Per-function optimization logic
@@ -35,7 +34,7 @@ codeflash/
3534
| Optimization orchestration | `optimization/optimizer.py``run()` |
3635
| Per-function optimization | `optimization/function_optimizer.py` |
3736
| Function discovery | `discovery/functions_to_optimize.py` |
38-
| Context extraction | `context/code_context_extractor.py` |
37+
| Context extraction | `languages/<lang>/context/code_context_extractor.py` |
3938
| Test execution | `verification/test_runner.py`, `verification/pytest_plugin.py` |
4039
| Performance ranking | `benchmarking/function_ranker.py` |
4140
| Domain types | `models/models.py`, `models/function_types.py` |

.github/workflows/claude.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,8 @@ jobs:
9898
2. Security vulnerabilities
9999
3. Breaking API changes
100100
4. Test failures (methods with typos that wont run)
101+
5. Stale documentation — if files or directories were moved, renamed, or deleted, check that `.claude/rules/`, `CLAUDE.md`, and `AGENTS.md` don't reference paths that no longer exist
102+
6. New language support — if new language modules are added under `languages/`, check that `.github/workflows/duplicate-code-detector.yml` includes the new language in its file filters, search patterns, and cross-module checks
101103
102104
IMPORTANT:
103105
- First check existing review comments using `gh api repos/${{ github.repository }}/pulls/${{ github.event.pull_request.number }}/comments`. For each existing comment, check if the issue still exists in the current code.

.github/workflows/duplicate-code-detector.yml

Lines changed: 78 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -21,96 +21,99 @@ jobs:
2121
fetch-depth: 0
2222
ref: ${{ github.event.pull_request.head.ref || github.ref }}
2323

24-
- name: Start Serena MCP server
25-
run: |
26-
docker pull ghcr.io/github/serena-mcp-server:latest
27-
docker run -d --name serena \
28-
--network host \
29-
-v "${{ github.workspace }}:${{ github.workspace }}:rw" \
30-
ghcr.io/github/serena-mcp-server:latest \
31-
serena start-mcp-server --context codex --project "${{ github.workspace }}"
32-
33-
mkdir -p /tmp/mcp-config
34-
cat > /tmp/mcp-config/mcp-servers.json << 'EOF'
35-
{
36-
"mcpServers": {
37-
"serena": {
38-
"command": "docker",
39-
"args": ["exec", "-i", "serena", "serena", "start-mcp-server", "--context", "codex", "--project", "${{ github.workspace }}"]
40-
}
41-
}
42-
}
43-
EOF
44-
4524
- name: Configure AWS Credentials
4625
uses: aws-actions/configure-aws-credentials@v4
4726
with:
4827
role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
4928
aws-region: ${{ secrets.AWS_REGION }}
5029

30+
- name: Get changed source files
31+
id: changed-files
32+
run: |
33+
FILES=$(git diff --name-only origin/main...HEAD -- '*.py' '*.js' '*.ts' '*.java' \
34+
| grep -v -E '(test_|_test\.(py|js|ts)|\.test\.(js|ts)|\.spec\.(js|ts)|conftest\.py|/tests/|/test/|/__tests__/)' \
35+
| grep -v -E '^(\.github/|code_to_optimize/|\.tessl/|node_modules/)' \
36+
|| true)
37+
if [ -z "$FILES" ]; then
38+
echo "files=" >> "$GITHUB_OUTPUT"
39+
echo "No changed source files to analyze."
40+
else
41+
echo "files<<EOF" >> "$GITHUB_OUTPUT"
42+
echo "$FILES" >> "$GITHUB_OUTPUT"
43+
echo "EOF" >> "$GITHUB_OUTPUT"
44+
echo "Changed files:"
45+
echo "$FILES"
46+
fi
47+
5148
- name: Run Claude Code
49+
if: steps.changed-files.outputs.files != ''
5250
uses: anthropics/claude-code-action@v1
5351
with:
5452
use_bedrock: "true"
5553
use_sticky_comment: true
5654
allowed_bots: "claude[bot],codeflash-ai[bot]"
57-
claude_args: '--mcp-config /tmp/mcp-config/mcp-servers.json --allowedTools "Read,Glob,Grep,Bash(git diff:*),Bash(git log:*),Bash(git show:*),Bash(wc *),Bash(find *),mcp__serena__*"'
55+
claude_args: '--allowedTools "Read,Glob,Grep,Bash(git diff:*),Bash(git log:*),Bash(git show:*),Bash(wc *),Bash(gh pr comment:*)"'
5856
prompt: |
59-
You are a duplicate code detector with access to Serena semantic code analysis.
57+
REPO: ${{ github.repository }}
58+
PR NUMBER: ${{ github.event.pull_request.number }}
59+
60+
You are a duplicate code detector for a multi-language codebase (Python, JavaScript, TypeScript, Java). Check whether this PR introduces code that duplicates logic already present elsewhere in the repository — including across languages. Focus on finding true duplicates, not just similar-looking code.
6061
61-
## Setup
62+
## Changed files
6263
63-
First activate the project in Serena:
64-
- Use `mcp__serena__activate_project` with the workspace path `${{ github.workspace }}`
64+
```
65+
${{ steps.changed-files.outputs.files }}
66+
```
6567
6668
## Steps
6769
68-
1. Get the list of changed .py files (excluding tests):
69-
`git diff --name-only origin/main...HEAD -- '*.py' | grep -v -E '(test_|_test\.py|/tests/|/test/)'`
70-
71-
2. Use Serena's semantic analysis on changed files:
72-
- `mcp__serena__get_symbols_overview` to understand file structure
73-
- `mcp__serena__find_symbol` to search for similarly named symbols across the codebase
74-
- `mcp__serena__find_referencing_symbols` to understand usage patterns
75-
- `mcp__serena__search_for_pattern` to find similar code patterns
76-
77-
3. For each changed file, look for:
78-
- **Exact Duplication**: Identical code blocks (>10 lines) in multiple locations
79-
- **Structural Duplication**: Same logic with minor variations (different variable names)
80-
- **Functional Duplication**: Different implementations of the same functionality
81-
- **Copy-Paste Programming**: Similar blocks that could be extracted into shared utilities
82-
83-
4. Cross-reference against the rest of the codebase using Serena:
84-
- Search for similar function signatures and logic patterns
85-
- Check if new code duplicates existing utilities or helpers
86-
- Look for repeated patterns across modules
87-
88-
## What to Report
89-
90-
- Identical or nearly identical functions in different files
91-
- Repeated code blocks that could be extracted to utilities
92-
- Similar classes or modules with overlapping functionality
93-
- Copy-pasted code with minor modifications
94-
- Duplicated business logic across components
95-
96-
## What to Skip
97-
98-
- Standard boilerplate (imports, __init__, etc.)
99-
- Test setup/teardown code
100-
- Configuration with similar structure
101-
- Language-specific patterns (constructors, getters/setters)
102-
- Small snippets (<5 lines) unless highly repetitive
103-
- Workflow files under .github/
104-
105-
## Output
106-
107-
Post a single PR comment with your findings. For each pattern found:
108-
- Severity (High/Medium/Low)
109-
- File locations with line numbers
110-
- Code samples showing the duplication
111-
- Concrete refactoring suggestion
112-
113-
If no significant duplication is found, say so briefly. Do not create issues — just comment on the PR.
114-
- name: Stop Serena
115-
if: always()
116-
run: docker stop serena && docker rm serena || true
70+
1. **Read changed files.** For each file above, read it and identify functions or methods that were added or substantially modified (longer than 5 lines).
71+
72+
2. **Search for duplicates.** For each function, use Grep to search the codebase for:
73+
- The same function name defined elsewhere (`def function_name` for Python, `function function_name` / `const function_name` / `module.exports` for the JS files under `packages/`)
74+
- 2-3 distinctive operations from the body (specific API calls, algorithm patterns, string literals, exception types) — this catches duplicates that have different names but implement the same logic
75+
76+
3. **Cross-module check.** This codebase has parallel Python modules under `languages/python/`, `languages/javascript/`, and `languages/java/` that handle the same concerns (parsing, code replacement, test running, etc.) for different target languages. It also has a JS runtime under `packages/codeflash/runtime/` and a Java runtime under `codeflash-java-runtime/`. When a changed file is under one of these areas, also search the others for equivalent logic. For example:
77+
- `languages/javascript/code_replacer.py` and `languages/python/static_analysis/code_replacer.py` both handle code replacement — shared logic should be extracted
78+
- Shared concepts (AST traversal, scope analysis, import resolution, test running) are prime candidates for duplication across these modules
79+
80+
4. **Compare candidates.** When a Grep hit looks promising (not just a shared import or call site), read the full function and compare semantics. Flag it only if it matches one of these patterns:
81+
- **Same function in two modules** — a function with the same or very similar body exists in another module. One should import from the other instead (within the same language).
82+
- **Shared logic across sibling files** — the same helper logic repeated in files within the same package. Should be extracted to a common module.
83+
- **Repeated pattern across classes** — multiple classes implement the same logic inline (e.g., identical traversal, identical validation). Should be a mixin or shared helper.
84+
- **Cross-module reimplementation** — the same algorithm or utility implemented in both `languages/python/` and `languages/javascript/` (both are Python) or between Python orchestration code and JS runtime code in `packages/`. Note: some duplication is unavoidable (each target language needs its own parser, for example). Only flag cases where the logic is genuinely shared or where one module could import from the other.
85+
86+
5. **Report findings.** Post a single PR comment. Report at most 5 findings.
87+
88+
**If duplicates found**, for each one:
89+
- **Confidence**: HIGH (identical or near-identical logic) / MEDIUM (same intent, minor differences worth reviewing)
90+
- **Locations**: `file_path:line_number` for both the new and existing code
91+
- **What's duplicated**: One sentence describing the shared logic
92+
- **Suggestion**: How to consolidate — import from canonical location, extract to shared module, create a mixin. For cross-module duplicates (between language directories or Python↔JS runtime), just flag it for a tech lead to review rather than prescribing a specific fix.
93+
94+
**If no duplicates found**, post a comment that just says "No duplicates detected." so the sticky comment gets updated.
95+
96+
## Examples (illustrative — these are past cases, some already resolved)
97+
98+
**IS a duplicate (HIGH):** A 12-line `is_build_output_dir()` function was defined identically in two modules (`setup/detector.py` and `code_utils/config_js.py`). Fix: delete one, import from the other.
99+
100+
**IS a duplicate (MEDIUM):** `is_assignment_used()` was implemented separately in two context files with the same logic. Fix: move to a shared module, import from both call sites.
101+
102+
**IS a duplicate (MEDIUM, cross-module):** `normalize_path()` implemented in both `languages/python/support.py` and `languages/javascript/support.py` with identical logic. Flagging for tech lead review — should likely be extracted to `languages/base.py` or a shared utility.
103+
104+
**NOT a duplicate:** Two classes each define a `visit()` method that traverses an AST, but they handle different node types and produce different outputs. This is intentional polymorphism.
105+
106+
**NOT a duplicate (cross-module):** `languages/python/static_analysis/code_extractor.py` and `languages/javascript/parse.py` both extract functions from source code, but they use fundamentally different parsing strategies (Python AST vs tree-sitter). The logic is necessarily different.
107+
108+
## DO NOT report
109+
110+
- Standard boilerplate (`__init__`, `__repr__`, `__str__`, `__eq__`, simple property accessors, constructors)
111+
- Functions under 5 lines
112+
- Config/setup code that naturally has similar structure
113+
- Intentional polymorphism (same method name, genuinely different behavior)
114+
- Test files, conftest files, spec files
115+
- Import statements and logging setup
116+
- Files under `.github/`, `code_to_optimize/`, `.tessl/`
117+
- Code across language modules that must differ due to target-language semantics (parsers, AST node types, runtime-specific APIs)
118+
119+
Do NOT create issues or edit any files. Only post a PR comment.

CLAUDE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Project Overview
44

5-
CodeFlash is an AI-powered Python code optimizer that automatically improves code performance while maintaining correctness. It uses LLMs to generate optimization candidates, verifies correctness through test execution, and benchmarks performance improvements.
5+
CodeFlash is an AI-powered code optimizer that automatically improves performance while maintaining correctness. It supports Python, JavaScript, and TypeScript, with more languages planned. It uses LLMs to generate optimization candidates, verifies correctness through test execution, and benchmarks performance improvements.
66

77
## Optimization Pipeline
88

@@ -12,7 +12,7 @@ Discovery → Ranking → Context Extraction → Test Gen + Optimization → Bas
1212

1313
1. **Discovery** (`discovery/`): Find optimizable functions across the codebase
1414
2. **Ranking** (`benchmarking/function_ranker.py`): Rank functions by addressable time using trace data
15-
3. **Context** (`context/`): Extract code dependencies (read-writable code + read-only imports)
15+
3. **Context** (`languages/<lang>/context/`): Extract code dependencies (read-writable code + read-only imports)
1616
4. **Optimization** (`optimization/`, `api/`): Generate candidates via AI service, run in parallel with test generation
1717
5. **Verification** (`verification/`): Run candidates against tests, compare outputs via custom pytest plugin
1818
6. **Benchmarking** (`benchmarking/`): Measure performance, select best candidate by speedup

0 commit comments

Comments
 (0)