Skip to content

Commit b7c19eb

Browse files
committed
Merge remote-tracking branch 'private/main' into codex/gcp-terraform
Signed-off-by: betterclever <paliwal.pranjal83@gmail.com> # Conflicts: # .prettierignore # bun.lock # frontend/vite.config.ts # worker/package.json # worker/src/components/security/subfinder.ts # worker/src/components/security/supabase-scanner.ts # worker/src/temporal/activities/mcp.activity.ts
2 parents df2fc2b + 8759da6 commit b7c19eb

263 files changed

Lines changed: 18716 additions & 4406 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
# Analytics Output Port Design
2+
3+
## Status: Approved
4+
5+
## Date: 2025-01-21
6+
7+
## Problem Statement
8+
9+
When connecting a component's `rawOutput` (which contains complex nested JSON) to the Analytics Sink, OpenSearch hits the default field limit of 1000 fields. This is because:
10+
11+
1. **Dynamic mapping explosion**: Elasticsearch/OpenSearch creates a field for every unique JSON path
12+
2. **Nested structures**: Arrays with objects like `issues[0].metadata.schema` create many paths
13+
3. **Varying schemas**: Different scanner outputs accumulate unique field paths over time
14+
15+
Example error:
16+
17+
```
18+
illegal_argument_exception: Limit of total fields [1000] has been exceeded
19+
```
20+
21+
## Solution
22+
23+
### Design Decisions
24+
25+
1. **Each component owns its analytics schema**
26+
- Components output structured `list<json>` through dedicated ports (`findings`, `results`, `secrets`, `issues`)
27+
- Component authors define the structure appropriate for their tool
28+
- No generic "one schema fits all" approach
29+
30+
2. **Analytics Sink accepts `list<json>`**
31+
- Input type: `z.array(z.record(z.string(), z.unknown()))`
32+
- Each item in the array is indexed as a separate document
33+
- Rejects arbitrary nested objects (must be an array)
34+
35+
3. **Same timestamp for all findings in a batch**
36+
- All findings from one component execution share the same `@timestamp`
37+
- Captured once at the start of indexing, applied to all documents
38+
39+
4. **Nested `shipsec` context**
40+
- Workflow context stored under `shipsec.*` namespace
41+
- Prevents field name collision with component data
42+
- Clear separation: component fields at root, system fields under `shipsec`
43+
44+
5. **Nested objects serialized before indexing**
45+
- Any nested object or array within a finding is JSON-stringified
46+
- Prevents field explosion from dynamic mapping
47+
- Trade-off: Can't query inside serialized fields directly, but prevents index corruption
48+
49+
6. **No `data` wrapper**
50+
- Original PRD design wrapped component output in a `data` field
51+
- New design: finding fields are at the top level for easier querying
52+
53+
### Document Structure
54+
55+
**Before (PRD design):**
56+
57+
```json
58+
{
59+
"workflow_id": "...",
60+
"workflow_name": "...",
61+
"run_id": "...",
62+
"node_ref": "...",
63+
"component_id": "...",
64+
"@timestamp": "...",
65+
"asset_key": "...",
66+
"data": {
67+
"check_id": "DB_RLS_DISABLED",
68+
"severity": "CRITICAL",
69+
"metadata": { "schema": "public", "table": "users" }
70+
}
71+
}
72+
```
73+
74+
**After (new design):**
75+
76+
```json
77+
{
78+
"check_id": "DB_RLS_DISABLED",
79+
"severity": "CRITICAL",
80+
"title": "RLS Disabled on Table: users",
81+
"resource": "public.users",
82+
"metadata": "{\"schema\":\"public\",\"table\":\"users\"}",
83+
"scanner": "supabase-scanner",
84+
"asset_key": "abcdefghij1234567890",
85+
"finding_hash": "a1b2c3d4e5f67890",
86+
87+
"shipsec": {
88+
"organization_id": "org_123",
89+
"run_id": "shipsec-run-xxx",
90+
"workflow_id": "d1d33161-929f-4af4-9a64-xxx",
91+
"workflow_name": "Supabase Security Audit",
92+
"component_id": "core.analytics.sink",
93+
"node_ref": "analytics-sink-1"
94+
},
95+
96+
"@timestamp": "2025-01-21T10:30:00.000Z"
97+
}
98+
```
99+
100+
### Component Output Ports
101+
102+
Components should use their existing structured list outputs:
103+
104+
| Component | Port | Type | Notes |
105+
| ---------------- | --------- | -------------------------------------------- | ------------------------- |
106+
| Nuclei | `results` | `z.array(z.record(z.string(), z.unknown()))` | Scanner + asset_key added |
107+
| TruffleHog | `results` | `z.array(z.record(z.string(), z.unknown()))` | Scanner + asset_key added |
108+
| Supabase Scanner | `results` | `z.array(z.record(z.string(), z.unknown()))` | Scanner + asset_key added |
109+
110+
All `results` ports include:
111+
112+
- `scanner`: Scanner identifier (e.g., `'nuclei'`, `'trufflehog'`, `'supabase-scanner'`)
113+
- `asset_key`: Primary asset identifier from the finding
114+
- `finding_hash`: Stable hash for deduplication (16-char hex from SHA-256)
115+
116+
### Finding Hash for Deduplication
117+
118+
The `finding_hash` enables tracking findings across workflow runs:
119+
120+
**Generation:**
121+
122+
```typescript
123+
import { createHash } from 'crypto';
124+
125+
function generateFindingHash(...fields: (string | undefined | null)[]): string {
126+
const normalized = fields.map((f) => (f ?? '').toLowerCase().trim()).join('|');
127+
return createHash('sha256').update(normalized).digest('hex').slice(0, 16);
128+
}
129+
```
130+
131+
**Key fields per scanner:**
132+
| Scanner | Hash Fields |
133+
|---------|-------------|
134+
| Nuclei | `templateId + host + matchedAt` |
135+
| TruffleHog | `DetectorType + Redacted + filePath` |
136+
| Supabase Scanner | `check_id + projectRef + resource` |
137+
138+
**Use cases:**
139+
140+
- **New vs recurring**: Is this finding appearing for the first time?
141+
- **First-seen / last-seen**: When did we first detect this? Is it still present?
142+
- **Resolution tracking**: Findings that stop appearing may be resolved
143+
- **Deduplication**: Remove duplicates in dashboards across runs
144+
145+
### `shipsec` Context Fields
146+
147+
The indexer automatically adds these fields under `shipsec`:
148+
149+
| Field | Description |
150+
| ----------------- | --------------------------------------------- |
151+
| `organization_id` | Organization that owns the workflow |
152+
| `run_id` | Unique identifier for this workflow execution |
153+
| `workflow_id` | ID of the workflow definition |
154+
| `workflow_name` | Human-readable workflow name |
155+
| `component_id` | Component type (e.g., `core.analytics.sink`) |
156+
| `node_ref` | Node reference in the workflow graph |
157+
| `asset_key` | Auto-detected or specified asset identifier |
158+
159+
### Querying in OpenSearch
160+
161+
With this structure, users can:
162+
163+
- Filter by organization: `shipsec.organization_id: "org_123"`
164+
- Filter by workflow: `shipsec.workflow_id: "xxx"`
165+
- Filter by run: `shipsec.run_id: "xxx"`
166+
- Filter by asset: `asset_key: "api.example.com"`
167+
- Filter by scanner: `scanner: "nuclei"`
168+
- Filter by component-specific fields: `severity: "CRITICAL"`
169+
- Aggregate by severity: `terms` aggregation on `severity` field
170+
- Track finding history: `finding_hash: "a1b2c3d4" | sort @timestamp`
171+
- Find recurring findings: Group by `finding_hash`, count occurrences
172+
173+
### Trade-offs
174+
175+
| Decision | Pro | Con |
176+
| ------------------------ | ------------------------- | ------------------------------------------ |
177+
| Serialize nested objects | Prevents field explosion | Can't query inside serialized fields |
178+
| `shipsec` namespace | No field collision | Slightly more verbose queries |
179+
| No generic schema | Better fit per component | Less consistency across components |
180+
| Same timestamp per batch | Accurate (same scan time) | Can't distinguish individual finding times |
181+
182+
### Implementation Files
183+
184+
1. `/worker/src/utils/opensearch-indexer.ts` - Add `shipsec` context, serialize nested objects
185+
2. `/worker/src/components/core/analytics-sink.ts` - Accept `list<json>`, consistent timestamp
186+
3. Component files - Ensure structured output, add `results` port where missing
187+
188+
### Backward Compatibility
189+
190+
- Existing workflows connecting `rawOutput` to Analytics Sink will still work
191+
- Analytics Sink continues to accept any data type for backward compatibility
192+
- New `list<json>` processing only triggers when input is an array
193+
194+
### Future Considerations
195+
196+
1. **Index templates**: Create OpenSearch index template with explicit mappings for `shipsec.*` fields
197+
2. **Field discovery**: Build UI to show available fields from indexed data
198+
3. **Schema validation**: Optional strict mode to validate findings against expected schema

.github/workflows/release.yml

Lines changed: 77 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -200,16 +200,32 @@ jobs:
200200
cat CHANGELOG.md >> $GITHUB_OUTPUT
201201
echo "EOF" >> $GITHUB_OUTPUT
202202
203-
- name: Create GitHub Release
204-
uses: softprops/action-gh-release@v1
205-
with:
206-
tag_name: ${{ steps.version.outputs.version }}
207-
name: Release ${{ steps.version.outputs.version }}
208-
body_path: CHANGELOG.md
209-
draft: false
210-
prerelease: ${{ contains(steps.version.outputs.version, '-') }}
203+
- name: Create or update GitHub Release
211204
env:
212-
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
205+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
206+
run: |
207+
VERSION="${{ steps.version.outputs.version }}"
208+
IS_PRERELEASE="${{ contains(steps.version.outputs.version, '-') }}"
209+
PRERELEASE_FLAG=""
210+
if [ "$IS_PRERELEASE" = "true" ]; then
211+
PRERELEASE_FLAG="--prerelease"
212+
fi
213+
214+
# Check if release already exists
215+
if gh release view "$VERSION" --repo "${{ github.repository }}" > /dev/null 2>&1; then
216+
echo "Release $VERSION already exists, updating..."
217+
gh release edit "$VERSION" \
218+
--repo "${{ github.repository }}" \
219+
--notes-file CHANGELOG.md \
220+
$PRERELEASE_FLAG
221+
else
222+
echo "Creating new release $VERSION..."
223+
gh release create "$VERSION" \
224+
--repo "${{ github.repository }}" \
225+
--title "Release $VERSION" \
226+
--notes-file CHANGELOG.md \
227+
$PRERELEASE_FLAG
228+
fi
213229
214230
- name: Update version check service
215231
env:
@@ -262,11 +278,63 @@ jobs:
262278
echo " This might be expected if GitHub release fetching is enabled"
263279
fi
264280
281+
- name: Bump package.json version via PR
282+
if: steps.is_latest.outputs.is_latest == 'true'
283+
continue-on-error: true
284+
env:
285+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
286+
run: |
287+
VERSION_CLEAN="${{ steps.version.outputs.version_clean }}"
288+
RUN_ID="${{ github.run_id }}"
289+
BRANCH="chore/bump-version-${VERSION_CLEAN}-${RUN_ID}"
290+
echo "Bumping root package.json version to ${VERSION_CLEAN}..."
291+
292+
git config user.name "github-actions[bot]"
293+
git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
294+
295+
git fetch origin main
296+
297+
# Clean up any existing bump branch for this version
298+
git push origin --delete "chore/bump-version-${VERSION_CLEAN}" 2>/dev/null || true
299+
300+
git checkout -b "$BRANCH" origin/main
301+
302+
# Update the root package.json version
303+
jq --arg v "$VERSION_CLEAN" '.version = $v' package.json > package.json.tmp && mv package.json.tmp package.json
304+
305+
# Only create PR if there's actually a change
306+
if git diff --quiet package.json; then
307+
echo "package.json already at version ${VERSION_CLEAN}, skipping."
308+
else
309+
git add package.json
310+
git commit -m "chore: bump version to ${VERSION_CLEAN} [skip ci]"
311+
git push origin "$BRANCH"
312+
313+
# Create PR and enable auto-merge
314+
PR_URL=$(gh pr create \
315+
--title "chore: bump version to ${VERSION_CLEAN}" \
316+
--body "Automated version bump from release workflow (${VERSION_CLEAN})." \
317+
--base main \
318+
--head "$BRANCH")
319+
320+
echo "✅ Created PR: $PR_URL"
321+
322+
# Attempt to enable auto-merge (requires repo setting enabled)
323+
gh pr merge "$PR_URL" --auto --squash 2>/dev/null \
324+
&& echo "✅ Auto-merge enabled" \
325+
|| echo "⚠️ Auto-merge not available — merge the PR manually"
326+
fi
327+
265328
- name: Release summary
266329
run: |
267330
echo "📋 Release Summary:"
268331
echo " ✅ Docker images built and pushed"
269332
echo " ✅ GitHub Release created: ${{ steps.version.outputs.version }}"
270333
echo " ✅ Version check service updated"
334+
if [ "${{ steps.is_latest.outputs.is_latest }}" = "true" ]; then
335+
echo " ✅ package.json version bump PR created"
336+
else
337+
echo " ⏭️ package.json NOT updated (not the latest version)"
338+
fi
271339
echo ""
272340
echo "🎉 Users will now receive update notifications for version ${{ steps.version.outputs.version_clean }}"
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
name: Sync upstream main
2+
3+
on:
4+
schedule:
5+
- cron: "0 9 * * *" # Once daily at 9am UTC
6+
workflow_dispatch: # Manual trigger
7+
8+
permissions:
9+
contents: write
10+
pull-requests: write
11+
12+
jobs:
13+
sync:
14+
runs-on: ubuntu-latest
15+
steps:
16+
- name: Checkout
17+
uses: actions/checkout@v4
18+
with:
19+
fetch-depth: 0
20+
21+
- name: Add upstream remote
22+
run: |
23+
git remote add upstream https://github.com/ShipSecAI/studio.git || true
24+
git fetch upstream main
25+
26+
- name: Check for divergence
27+
id: check
28+
run: |
29+
UPSTREAM_SHA=$(git rev-parse upstream/main)
30+
# Check if upstream-sync branch exists on origin
31+
if git ls-remote --exit-code origin upstream-sync &>/dev/null; then
32+
CURRENT_SHA=$(git rev-parse origin/upstream-sync)
33+
else
34+
CURRENT_SHA=""
35+
fi
36+
37+
if [ "$UPSTREAM_SHA" = "$CURRENT_SHA" ]; then
38+
echo "skip=true" >> "$GITHUB_OUTPUT"
39+
echo "No new upstream commits"
40+
else
41+
echo "skip=false" >> "$GITHUB_OUTPUT"
42+
AHEAD=$(git rev-list --count origin/main..upstream/main)
43+
echo "ahead=$AHEAD" >> "$GITHUB_OUTPUT"
44+
echo "Upstream is $AHEAD commits ahead"
45+
fi
46+
47+
- name: Push upstream-sync branch
48+
if: steps.check.outputs.skip == 'false'
49+
run: |
50+
git checkout -B upstream-sync upstream/main
51+
git push origin upstream-sync --force
52+
53+
- name: Create or update PR
54+
if: steps.check.outputs.skip == 'false'
55+
env:
56+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
57+
run: |
58+
EXISTING_PR=$(gh pr list --head upstream-sync --base main --state open --json number --jq '.[0].number' 2>/dev/null || echo "")
59+
60+
if [ -n "$EXISTING_PR" ]; then
61+
echo "PR #$EXISTING_PR already exists, updated sync branch"
62+
gh pr comment "$EXISTING_PR" --body "Sync branch updated. Upstream is now ${{ steps.check.outputs.ahead }} commits ahead of main."
63+
else
64+
gh pr create \
65+
--head upstream-sync \
66+
--base main \
67+
--title "sync: merge upstream main" \
68+
--body "$(cat <<'EOF'
69+
Automated sync from [ShipSecAI/studio](https://github.com/ShipSecAI/studio) main.
70+
71+
**${{ steps.check.outputs.ahead }} new upstream commits.**
72+
73+
Review the changes and merge when ready. If there are conflicts, resolve them locally:
74+
```bash
75+
git fetch origin upstream-sync main
76+
git checkout main
77+
git merge origin/upstream-sync
78+
# resolve conflicts
79+
git push origin main
80+
```
81+
EOF
82+
)"
83+
fi

0 commit comments

Comments
 (0)