Support nested GitLab groups in repository name resolution#60
Support nested GitLab groups in repository name resolution#60Anthony-Bible wants to merge 3 commits into
Conversation
Fixes two bugs that caused silent chunk-resolution failures for GitLab nested-group repos (e.g. gitlab.com/org/subgroup/repo): 1. job_processor.go used FullName() for the zoekt repo name, which only returns the first two path segments. For nested groups this truncated the path (e.g. "gitlab.com/org/sub" instead of "gitlab.com/org/sub/repo"). Fixed by adding FullPath() to RepositoryURL and using it instead. 2. zoektRepoNameCandidates only generated 2 candidates (original + strip-host). For a 4-segment zoekt name the third "sub/repo" variant (matching ExtractRepositoryNameFromURL's last-two-segments format) was never tried. Replaced the two-candidate logic with a progressive segment-stripping algorithm that stops before single-segment results. Closes #56 https://claude.ai/code/session_018DGnjmwnT5WZPB9MHPYSvH
There was a problem hiding this comment.
Pull request overview
This PR improves repository name handling for deep repository paths (e.g., nested GitLab groups) by introducing a RepositoryURL.FullPath() helper, updating Zoekt indexing to use the full path, and enhancing hybrid search chunk resolution by trying multiple plausible repository-name candidates.
Changes:
- Added
RepositoryURL.FullPath()plus unit tests. - Updated Zoekt repo naming during indexing to use
Host()/FullPath()instead ofHost()/FullName(). - Replaced host-stripping logic with
zoektRepoNameCandidates()and added tests (including a nested GitLab integration-style resolution test).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/domain/valueobject/repository_url.go | Adds FullPath() for preserving all path segments. |
| internal/domain/valueobject/repository_url_test.go | Adds unit tests for FullPath(). |
| internal/application/worker/job_processor.go | Uses FullPath() when constructing the Zoekt repository name. |
| internal/application/service/search_service.go | Adds zoektRepoNameCandidates() and uses it for fallback chunk resolution. |
| internal/application/service/search_service_test.go | Adds unit tests for candidates and a nested GitLab resolution test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
GitHub and Bitbucket repos always have exactly owner/repo (2 path segments). FullPath() now truncates any extra segments (e.g. from a /tree/<branch> UI URL) to return a clean repo identifier. GitLab and other hosts are left unchanged so nested group paths (org/sub/repo) are preserved correctly. Addresses Copilot review comment on PR #60. https://claude.ai/code/session_018DGnjmwnT5WZPB9MHPYSvH
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| }, | ||
| { | ||
| name: "GitHub URL with UI tree suffix is truncated to repo root", | ||
| rawURL: "https://github.com/owner/repo", |
There was a problem hiding this comment.
The test case named "GitHub URL with UI tree suffix is truncated to repo root" uses the same rawURL as the "standard GitHub repo" case (no /tree/... suffix), so it doesn’t actually verify the truncation behavior. Update the rawURL to include a UI suffix (e.g., /tree/ or /blob//file) so this test would fail if FullPath stops stripping extra segments for GitHub.
| rawURL: "https://github.com/owner/repo", | |
| rawURL: "https://github.com/owner/repo/tree/main", |
| func (r RepositoryURL) FullPath() string { | ||
| parsedURL, _ := url.Parse(r.normalized) | ||
| trimmedPath := strings.Trim(parsedURL.Path, "/") | ||
| if trimmedPath == "" { | ||
| return "" | ||
| } | ||
|
|
||
| parts := strings.Split(trimmedPath, "/") | ||
|
|
||
| // GitHub and Bitbucket repos are always owner/repo (exactly 2 path segments). | ||
| // Trim any extra segments that indicate a UI page rather than the repo root. | ||
| switch strings.ToLower(parsedURL.Hostname()) { | ||
| case "github.com", "www.github.com", "bitbucket.org", "www.bitbucket.org": | ||
| if len(parts) > MinPathPartsForRepoURL { | ||
| return strings.Join(parts[:MinPathPartsForRepoURL], "/") | ||
| } | ||
| } | ||
|
|
||
| return trimmedPath |
There was a problem hiding this comment.
FullPath() relies on the normalized URL path. For GitLab (and any host not in the GitHub/Bitbucket allowlist), it returns the entire path as-is, which will include common UI suffixes like /-/tree/<branch> (and note the normalizer currently rewrites /- to -, turning that into part of the repo path). Since FullPath() is now used to construct the Zoekt repo name, storing a non-repo-root GitLab URL can produce a zoektRepoName that will never match the actual indexed repository. Consider stripping GitLab UI suffixes (e.g., anything after a /-/ segment) and/or applying the same UI-suffix trimming logic to configured extra hosts that follow the owner/repo pattern (e.g., GitHub Enterprise).
Addresses Copilot review comments on PR #60: - Fix 'GitHub URL with UI tree suffix' test case to actually use a URL with a /tree/main suffix (previously used same URL as standard case) - Add GitLab nested group test with direct /tree/main sentinel - Add GitLab nested group test with /-/ UI separator (normalised to -/) https://claude.ai/code/session_018DGnjmwnT5WZPB9MHPYSvH
Summary
This PR fixes repository name resolution for nested GitLab groups and other hosting services with deep path structures. Previously, the system only handled two-segment repository names (e.g., "owner/repo"), causing failures when resolving chunks for repositories with nested groups (e.g., "org/sub/repo").
Key Changes
Added
FullPath()method toRepositoryURL: Returns the complete URL path without the host, preserving all path segments. This correctly handles nested GitLab groups and other deep repository structures, unlike the existingFullName()which only returns the last two segments.Replaced
stripZoektRepoHost()withzoektRepoNameCandidates(): The new function generates all plausible database repository name variants by progressively stripping leading path segments. For example:"gitlab.com/org/sub/repo"→["gitlab.com/org/sub/repo", "org/sub/repo", "sub/repo"]"github.com/owner/repo"→["github.com/owner/repo", "owner/repo"]Updated job processor: Changed to use
FullPath()instead ofFullName()when constructing the zoekt repository name, ensuring nested groups are properly indexed.Added comprehensive test coverage:
zoektRepoNameCandidates()covering various repository path structuresFullPath()methodImplementation Details
The resolution strategy now tries repository name candidates in order from most to least specific, stopping when a match is found. This allows the system to correctly resolve chunks even when the database stores repository names with fewer path segments than the zoekt index uses.
https://claude.ai/code/session_018DGnjmwnT5WZPB9MHPYSvH