ShipSecAI
diff --git a/‎.ai/analytics-output-port-design.md‎
Lines changed: 198 additions & 0 deletions b/‎.ai/analytics-output-port-design.md‎
Lines changed: 198 additions & 0 deletions
diff --git a/‎.github/workflows/release.yml‎
Lines changed: 77 additions & 9 deletions b/‎.github/workflows/release.yml‎
Lines changed: 77 additions & 9 deletions
diff --git a/‎.github/workflows/upstream-sync.yml‎
Lines changed: 83 additions & 0 deletions b/‎.github/workflows/upstream-sync.yml‎
Lines changed: 83 additions & 0 deletions
@@ -0,0 +1,198 @@
+# Analytics Output Port Design
+
+## Status: Approved
+
+## Date: 2025-01-21
+
+## Problem Statement
+
+When connecting a component's `rawOutput` (which contains complex nested JSON) to the Analytics Sink, OpenSearch hits the default field limit of 1000 fields. This is because:
+
+1. **Dynamic mapping explosion**: Elasticsearch/OpenSearch creates a field for every unique JSON path
+2. **Nested structures**: Arrays with objects like `issues[0].metadata.schema` create many paths
+3. **Varying schemas**: Different scanner outputs accumulate unique field paths over time
+
+Example error:
+
+```
+illegal_argument_exception: Limit of total fields [1000] has been exceeded
+```
+
+## Solution
+
+### Design Decisions
+
+1. **Each component owns its analytics schema**
+   - Components output structured `list<json>` through dedicated ports (`findings`, `results`, `secrets`, `issues`)
+   - Component authors define the structure appropriate for their tool
+   - No generic "one schema fits all" approach
+
+2. **Analytics Sink accepts `list<json>`**
+   - Input type: `z.array(z.record(z.string(), z.unknown()))`
+   - Each item in the array is indexed as a separate document
+   - Rejects arbitrary nested objects (must be an array)
+
+3. **Same timestamp for all findings in a batch**
+   - All findings from one component execution share the same `@timestamp`
+   - Captured once at the start of indexing, applied to all documents
+
+4. **Nested `shipsec` context**
+   - Workflow context stored under `shipsec.*` namespace
+   - Prevents field name collision with component data
+   - Clear separation: component fields at root, system fields under `shipsec`
+
+5. **Nested objects serialized before indexing**
+   - Any nested object or array within a finding is JSON-stringified
+   - Prevents field explosion from dynamic mapping
+   - Trade-off: Can't query inside serialized fields directly, but prevents index corruption
+
+6. **No `data` wrapper**
+   - Original PRD design wrapped component output in a `data` field
+   - New design: finding fields are at the top level for easier querying
+
+### Document Structure
+
+**Before (PRD design):**
+
+```json
+{
+  "workflow_id": "...",
+  "workflow_name": "...",
+  "run_id": "...",
+  "node_ref": "...",
+  "component_id": "...",
+  "@timestamp": "...",
+  "asset_key": "...",
+  "data": {
+    "check_id": "DB_RLS_DISABLED",
+    "severity": "CRITICAL",
+    "metadata": { "schema": "public", "table": "users" }
+  }
+}
+```
+
+**After (new design):**
+
+```json
+{
+  "check_id": "DB_RLS_DISABLED",
+  "severity": "CRITICAL",
+  "title": "RLS Disabled on Table: users",
+  "resource": "public.users",
+  "metadata": "{\"schema\":\"public\",\"table\":\"users\"}",
+  "scanner": "supabase-scanner",
+  "asset_key": "abcdefghij1234567890",
+  "finding_hash": "a1b2c3d4e5f67890",
+
+  "shipsec": {
+    "organization_id": "org_123",
+    "run_id": "shipsec-run-xxx",
+    "workflow_id": "d1d33161-929f-4af4-9a64-xxx",
+    "workflow_name": "Supabase Security Audit",
+    "component_id": "core.analytics.sink",
+    "node_ref": "analytics-sink-1"
+  },
+
+  "@timestamp": "2025-01-21T10:30:00.000Z"
+}
+```
+
+### Component Output Ports
+
+Components should use their existing structured list outputs:
+
+| Component        | Port      | Type                                         | Notes                     |
+| ---------------- | --------- | -------------------------------------------- | ------------------------- |
+| Nuclei           | `results` | `z.array(z.record(z.string(), z.unknown()))` | Scanner + asset_key added |
+| TruffleHog       | `results` | `z.array(z.record(z.string(), z.unknown()))` | Scanner + asset_key added |
+| Supabase Scanner | `results` | `z.array(z.record(z.string(), z.unknown()))` | Scanner + asset_key added |
+
+All `results` ports include:
+
+- `scanner`: Scanner identifier (e.g., `'nuclei'`, `'trufflehog'`, `'supabase-scanner'`)
+- `asset_key`: Primary asset identifier from the finding
+- `finding_hash`: Stable hash for deduplication (16-char hex from SHA-256)
+
+### Finding Hash for Deduplication
+
+The `finding_hash` enables tracking findings across workflow runs:
+
+**Generation:**
+
+```typescript
+import { createHash } from 'crypto';
+
+function generateFindingHash(...fields: (string | undefined | null)[]): string {
+  const normalized = fields.map((f) => (f ?? '').toLowerCase().trim()).join('|');
+  return createHash('sha256').update(normalized).digest('hex').slice(0, 16);
+}
+```
+
+**Key fields per scanner:**
+| Scanner | Hash Fields |
+|---------|-------------|
+| Nuclei | `templateId + host + matchedAt` |
+| TruffleHog | `DetectorType + Redacted + filePath` |
+| Supabase Scanner | `check_id + projectRef + resource` |
+
+**Use cases:**
+
+- **New vs recurring**: Is this finding appearing for the first time?
+- **First-seen / last-seen**: When did we first detect this? Is it still present?
+- **Resolution tracking**: Findings that stop appearing may be resolved
+- **Deduplication**: Remove duplicates in dashboards across runs
+
+### `shipsec` Context Fields
+
+The indexer automatically adds these fields under `shipsec`:
+
+| Field             | Description                                   |
+| ----------------- | --------------------------------------------- |
+| `organization_id` | Organization that owns the workflow           |
+| `run_id`          | Unique identifier for this workflow execution |
+| `workflow_id`     | ID of the workflow definition                 |
+| `workflow_name`   | Human-readable workflow name                  |
+| `component_id`    | Component type (e.g., `core.analytics.sink`)  |
+| `node_ref`        | Node reference in the workflow graph          |
+| `asset_key`       | Auto-detected or specified asset identifier   |
+
+### Querying in OpenSearch
+
+With this structure, users can:
+
+- Filter by organization: `shipsec.organization_id: "org_123"`
+- Filter by workflow: `shipsec.workflow_id: "xxx"`
+- Filter by run: `shipsec.run_id: "xxx"`
+- Filter by asset: `asset_key: "api.example.com"`
+- Filter by scanner: `scanner: "nuclei"`
+- Filter by component-specific fields: `severity: "CRITICAL"`
+- Aggregate by severity: `terms` aggregation on `severity` field
+- Track finding history: `finding_hash: "a1b2c3d4" | sort @timestamp`
+- Find recurring findings: Group by `finding_hash`, count occurrences
+
+### Trade-offs
+
+| Decision                 | Pro                       | Con                                        |
+| ------------------------ | ------------------------- | ------------------------------------------ |
+| Serialize nested objects | Prevents field explosion  | Can't query inside serialized fields       |
+| `shipsec` namespace      | No field collision        | Slightly more verbose queries              |
+| No generic schema        | Better fit per component  | Less consistency across components         |
+| Same timestamp per batch | Accurate (same scan time) | Can't distinguish individual finding times |
+
+### Implementation Files
+
+1. `/worker/src/utils/opensearch-indexer.ts` - Add `shipsec` context, serialize nested objects
+2. `/worker/src/components/core/analytics-sink.ts` - Accept `list<json>`, consistent timestamp
+3. Component files - Ensure structured output, add `results` port where missing
+
+### Backward Compatibility
+
+- Existing workflows connecting `rawOutput` to Analytics Sink will still work
+- Analytics Sink continues to accept any data type for backward compatibility
+- New `list<json>` processing only triggers when input is an array
+
+### Future Considerations
+
+1. **Index templates**: Create OpenSearch index template with explicit mappings for `shipsec.*` fields
+2. **Field discovery**: Build UI to show available fields from indexed data
+3. **Schema validation**: Optional strict mode to validate findings against expected schema
@@ -200,16 +200,32 @@ jobs:
           cat CHANGELOG.md >> $GITHUB_OUTPUT
           echo "EOF" >> $GITHUB_OUTPUT
 
-      - name: Create GitHub Release
-        uses: softprops/action-gh-release@v1
-        with:
-          tag_name: ${{ steps.version.outputs.version }}
-          name: Release ${{ steps.version.outputs.version }}
-          body_path: CHANGELOG.md
-          draft: false
-          prerelease: ${{ contains(steps.version.outputs.version, '-') }}
+      - name: Create or update GitHub Release
         env:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          VERSION="${{ steps.version.outputs.version }}"
+          IS_PRERELEASE="${{ contains(steps.version.outputs.version, '-') }}"
+          PRERELEASE_FLAG=""
+          if [ "$IS_PRERELEASE" = "true" ]; then
+            PRERELEASE_FLAG="--prerelease"
+          fi
+
+          # Check if release already exists
+          if gh release view "$VERSION" --repo "${{ github.repository }}" > /dev/null 2>&1; then
+            echo "Release $VERSION already exists, updating..."
+            gh release edit "$VERSION" \
+              --repo "${{ github.repository }}" \
+              --notes-file CHANGELOG.md \
+              $PRERELEASE_FLAG
+          else
+            echo "Creating new release $VERSION..."
+            gh release create "$VERSION" \
+              --repo "${{ github.repository }}" \
+              --title "Release $VERSION" \
+              --notes-file CHANGELOG.md \
+              $PRERELEASE_FLAG
+          fi
 
       - name: Update version check service
         env:
@@ -262,11 +278,63 @@ jobs:
             echo "   This might be expected if GitHub release fetching is enabled"
           fi
 
+      - name: Bump package.json version via PR
+        if: steps.is_latest.outputs.is_latest == 'true'
+        continue-on-error: true
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          VERSION_CLEAN="${{ steps.version.outputs.version_clean }}"
+          RUN_ID="${{ github.run_id }}"
+          BRANCH="chore/bump-version-${VERSION_CLEAN}-${RUN_ID}"
+          echo "Bumping root package.json version to ${VERSION_CLEAN}..."
+
+          git config user.name "github-actions[bot]"
+          git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
+
+          git fetch origin main
+
+          # Clean up any existing bump branch for this version
+          git push origin --delete "chore/bump-version-${VERSION_CLEAN}" 2>/dev/null || true
+
+          git checkout -b "$BRANCH" origin/main
+
+          # Update the root package.json version
+          jq --arg v "$VERSION_CLEAN" '.version = $v' package.json > package.json.tmp && mv package.json.tmp package.json
+
+          # Only create PR if there's actually a change
+          if git diff --quiet package.json; then
+            echo "package.json already at version ${VERSION_CLEAN}, skipping."
+          else
+            git add package.json
+            git commit -m "chore: bump version to ${VERSION_CLEAN} [skip ci]"
+            git push origin "$BRANCH"
+
+            # Create PR and enable auto-merge
+            PR_URL=$(gh pr create \
+              --title "chore: bump version to ${VERSION_CLEAN}" \
+              --body "Automated version bump from release workflow (${VERSION_CLEAN})." \
+              --base main \
+              --head "$BRANCH")
+
+            echo "✅ Created PR: $PR_URL"
+
+            # Attempt to enable auto-merge (requires repo setting enabled)
+            gh pr merge "$PR_URL" --auto --squash 2>/dev/null \
+              && echo "✅ Auto-merge enabled" \
+              || echo "⚠️  Auto-merge not available — merge the PR manually"
+          fi
+
       - name: Release summary
         run: |
           echo "📋 Release Summary:"
           echo "   ✅ Docker images built and pushed"
           echo "   ✅ GitHub Release created: ${{ steps.version.outputs.version }}"
           echo "   ✅ Version check service updated"
+          if [ "${{ steps.is_latest.outputs.is_latest }}" = "true" ]; then
+            echo "   ✅ package.json version bump PR created"
+          else
+            echo "   ⏭️  package.json NOT updated (not the latest version)"
+          fi
           echo ""
           echo "🎉 Users will now receive update notifications for version ${{ steps.version.outputs.version_clean }}"
@@ -0,0 +1,83 @@
+name: Sync upstream main
+
+on:
+  schedule:
+    - cron: "0 9 * * *" # Once daily at 9am UTC
+  workflow_dispatch: # Manual trigger
+
+permissions:
+  contents: write
+  pull-requests: write
+
+jobs:
+  sync:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - name: Add upstream remote
+        run: |
+          git remote add upstream https://github.com/ShipSecAI/studio.git || true
+          git fetch upstream main
+
+      - name: Check for divergence
+        id: check
+        run: |
+          UPSTREAM_SHA=$(git rev-parse upstream/main)
+          # Check if upstream-sync branch exists on origin
+          if git ls-remote --exit-code origin upstream-sync &>/dev/null; then
+            CURRENT_SHA=$(git rev-parse origin/upstream-sync)
+          else
+            CURRENT_SHA=""
+          fi
+
+          if [ "$UPSTREAM_SHA" = "$CURRENT_SHA" ]; then
+            echo "skip=true" >> "$GITHUB_OUTPUT"
+            echo "No new upstream commits"
+          else
+            echo "skip=false" >> "$GITHUB_OUTPUT"
+            AHEAD=$(git rev-list --count origin/main..upstream/main)
+            echo "ahead=$AHEAD" >> "$GITHUB_OUTPUT"
+            echo "Upstream is $AHEAD commits ahead"
+          fi
+
+      - name: Push upstream-sync branch
+        if: steps.check.outputs.skip == 'false'
+        run: |
+          git checkout -B upstream-sync upstream/main
+          git push origin upstream-sync --force
+
+      - name: Create or update PR
+        if: steps.check.outputs.skip == 'false'
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          EXISTING_PR=$(gh pr list --head upstream-sync --base main --state open --json number --jq '.[0].number' 2>/dev/null || echo "")
+
+          if [ -n "$EXISTING_PR" ]; then
+            echo "PR #$EXISTING_PR already exists, updated sync branch"
+            gh pr comment "$EXISTING_PR" --body "Sync branch updated. Upstream is now ${{ steps.check.outputs.ahead }} commits ahead of main."
+          else
+            gh pr create \
+              --head upstream-sync \
+              --base main \
+              --title "sync: merge upstream main" \
+              --body "$(cat <<'EOF'
+Automated sync from [ShipSecAI/studio](https://github.com/ShipSecAI/studio) main.
+
+**${{ steps.check.outputs.ahead }} new upstream commits.**
+
+Review the changes and merge when ready. If there are conflicts, resolve them locally:
+```bash
+git fetch origin upstream-sync main
+git checkout main
+git merge origin/upstream-sync
+# resolve conflicts
+git push origin main
+```
+EOF
+)"
+          fi