Skip to content

fix: cleanup orphaned assets if crawler crashes before database transaction commits#2543

Closed
VedantMadane wants to merge 6 commits intokarakeep-app:mainfrom
VedantMadane:fix/cleanup-orphaned-assets-on-error
Closed

fix: cleanup orphaned assets if crawler crashes before database transaction commits#2543
VedantMadane wants to merge 6 commits intokarakeep-app:mainfrom
VedantMadane:fix/cleanup-orphaned-assets-on-error

Conversation

@VedantMadane
Copy link
Copy Markdown

Fixes #2519.

Problem

When the background crawler saves assets (screenshots, PDFs, images) to storage (S3 or local disk), there is a window of time before the database transaction commits the �ssetId to the �ssets table. If the worker crashes or OOMs during this window, the assets remain in storage with no database record pointing to them, effectively becoming orphaned.

Solution

  • Added ry...catch blocks around the core asset processing logic in crawlAndParseUrl and handleAsAssetBookmark.
  • Introduced a tracking array
    ewAssetIds to collect all successfully stored assets during a single crawl attempt.
  • If an error occurs before the database transaction completes, the catch block iterates through
    ewAssetIds and calls silentDeleteAsset to remove them from storage.
  • This ensures data integrity between the storage layer and the database even during unexpected crashes.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Staged asset creation and collection were introduced: assets (screenshot, PDF, HTML, banner) are accumulated during precrawl/live crawl into newAssetIds, persistence is deferred into a single transaction, and robust cleanup removes partially created assets on failure; download/enqueue flows and logging were hardened.

Changes

Cohort / File(s) Summary
Crawler worker
apps/workers/workers/crawlerWorker.ts
Reworked crawl flow to stream collected assets into a newAssetIds list, defer persistence to a single transactional phase (using transactionCommitted), add downloadedAssetId / assetPersisted tracking and try/catch around downloads, convert LINK bookmarks to ASSET and insert bookmarkAssets inside the transaction, enqueue AssetPreprocessingQueue for persisted assets, add banner-image download handling, preserve precrawled-archive usage, and implement cleanup of partially created assets via silentDeleteAsset when transactions fail or errors occur. Enhanced logging for enqueue/cleanup failures; no public API changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding cleanup logic for orphaned assets when the crawler crashes before the database transaction commits.
Description check ✅ Passed The description clearly explains the problem, solution, and implementation details related to the changeset, detailing the try/catch blocks, newAssetIds tracking, and cleanup logic.
Linked Issues check ✅ Passed The PR directly addresses issue #2519 by implementing error handling and asset cleanup when crawler crashes before database transaction commits, preventing orphaned assets and invalid data.
Out of Scope Changes check ✅ Passed All changes in crawlerWorker.ts focus on cleanup logic, error handling, and asset tracking to prevent orphaned assets, which are within scope of the linked issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/workers/workers/crawlerWorker.ts`:
- Around line 1887-1971: The transaction currently pushes already-started
deletion promises into assetDeletionTasks inside the txn callback (see
assetDeletionTasks, silentDeleteAsset, updateAsset and the db.transaction
block), which can remove storage before the DB commits; change this by
collecting IDs (or functions) of assets to delete inside the transaction without
starting deletion, commit the transaction as-is (keep updates via updateAsset
and txn operations), then after the await db.transaction(...) resolve the
collected list and call silentDeleteAsset (or execute deletion promises) so
deletions run only after successful commit.
- Around line 1560-1618: The cleanup currently deletes the downloaded asset
(downloadedAssetId) on any error even if the DB transaction (db.transaction)
already committed and created bookmark rows; change the logic to only delete
when the DB transaction did NOT complete: introduce a boolean (e.g.,
assetPersisted = false) set to true immediately after the await
db.transaction(...) returns successfully (inside the same scope where
downloadAndStoreFile, downloadedAssetId, and AssetPreprocessingQueue.enqueue are
used), then update the catch block to call silentDeleteAsset(userId,
downloadedAssetId) only when downloadedAssetId is set AND assetPersisted is
false; reference downloadedAssetId, db.transaction,
AssetPreprocessingQueue.enqueue, and silentDeleteAsset to locate places to
modify.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 85e63ae and 8970dee.

📒 Files selected for processing (1)
  • apps/workers/workers/crawlerWorker.ts

Comment thread apps/workers/workers/crawlerWorker.ts
Comment thread apps/workers/workers/crawlerWorker.ts Outdated
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Mar 1, 2026

Greptile Summary

This PR attempts to prevent orphaned assets by adding try-catch blocks that clean up storage assets if the crawler crashes. However, the implementation has a critical bug that creates the opposite problem.

Key Issues

  • Inverted problem: The cleanup logic runs even after the database transaction succeeds, deleting assets that are already committed to the database
  • Orphaned DB references: Users will have database records pointing to non-existent storage objects, causing broken functionality when they try to access these assets
  • Multiple trigger points: The bug can be triggered by AssetPreprocessingQueue.enqueue() failures, abort signals, or any error in post-transaction operations

What Happens

Both crawlAndParseUrl and handleAsAssetBookmark follow this flawed pattern:

  1. Create and store assets in S3/disk, track their IDs
  2. Commit asset references to database via transaction (succeeds)
  3. Perform post-transaction operations
  4. If step 3 throws → catch block deletes assets that are already in database

Root Cause

The asset tracking variables (newAssetIds array, downloadedAssetId) are not cleared after the database transaction commits, so the catch block incorrectly assumes all tracked assets need cleanup.

Recommended Fix

Clear the tracking variables immediately after successful database transactions to ensure cleanup only happens for uncommitted assets.

Confidence Score: 0/5

  • This PR is not safe to merge - it contains a critical bug that inverts the intended fix
  • Score reflects a critical logical error that creates orphaned database references (the opposite of what the PR intends to fix). The bug will cause broken functionality when users try to access assets that were deleted from storage but remain referenced in the database.
  • The logic in apps/workers/workers/crawlerWorker.ts needs immediate attention - both functions have the same critical flaw

Important Files Changed

Filename Overview
apps/workers/workers/crawlerWorker.ts Added try-catch blocks to clean up orphaned assets, but implementation has critical bug that deletes assets after DB commit, creating opposite problem (orphaned DB references instead of orphaned storage)

Last reviewed commit: 8970dee

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +2013 to +2023
};
} catch (error) {
logger.error(
`[Crawler][${jobId}] crawlAndParseUrl encountered an error, cleaning up new assets: ${error}`,
);
// Clean up any assets that were created during this attempt but not committed
await Promise.all(
newAssetIds.map((assetId) => silentDeleteAsset(userId, assetId)),
);
throw error;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical bug: assets are deleted even after database transaction succeeds. If db.transaction() commits (line 1880) but Promise.all(assetDeletionTasks) (line 1971) or the abort signal throws, this catch block deletes assets already in the database, creating orphaned DB references (DB records pointing to non-existent storage).

Fix: clear newAssetIds immediately after successful transaction:

Suggested change
};
} catch (error) {
logger.error(
`[Crawler][${jobId}] crawlAndParseUrl encountered an error, cleaning up new assets: ${error}`,
);
// Clean up any assets that were created during this attempt but not committed
await Promise.all(
newAssetIds.map((assetId) => silentDeleteAsset(userId, assetId)),
);
throw error;
}
} catch (error) {
logger.error(
`[Crawler][${jobId}] crawlAndParseUrl encountered an error, cleaning up new assets: ${error}`,
);
// Only clean up assets if they weren't committed to the database
if (newAssetIds.length > 0) {
await Promise.all(
newAssetIds.map((assetId) => silentDeleteAsset(userId, assetId)),
);
}
throw error;
}

And add after line 1968:

        });

        // Clear tracking array - assets are now owned by database
        newAssetIds.length = 0;

        // Delete the old assets if any
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/workers/workers/crawlerWorker.ts
Line: 2013-2023

Comment:
Critical bug: assets are deleted even after database transaction succeeds. If `db.transaction()` commits (line 1880) but `Promise.all(assetDeletionTasks)` (line 1971) or the abort signal throws, this catch block deletes assets already in the database, creating orphaned DB references (DB records pointing to non-existent storage).

Fix: clear `newAssetIds` immediately after successful transaction:

```suggestion
      } catch (error) {
        logger.error(
          `[Crawler][${jobId}] crawlAndParseUrl encountered an error, cleaning up new assets: ${error}`,
        );
        // Only clean up assets if they weren't committed to the database
        if (newAssetIds.length > 0) {
          await Promise.all(
            newAssetIds.map((assetId) => silentDeleteAsset(userId, assetId)),
          );
        }
        throw error;
      }
```

And add after line 1968:
```
        });

        // Clear tracking array - assets are now owned by database
        newAssetIds.length = 0;

        // Delete the old assets if any
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +1611 to +1619
} catch (error) {
if (downloadedAssetId) {
logger.error(
`[Crawler][${jobId}] handleAsAssetBookmark encountered an error, cleaning up new asset ${downloadedAssetId}: ${error}`,
);
await silentDeleteAsset(userId, downloadedAssetId);
}
throw error;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same critical bug: if db.transaction() (line 1574) succeeds but AssetPreprocessingQueue.enqueue() (line 1603) throws, the asset gets deleted even though it's already in the database.

Fix: clear downloadedAssetId after successful transaction:

Suggested change
} catch (error) {
if (downloadedAssetId) {
logger.error(
`[Crawler][${jobId}] handleAsAssetBookmark encountered an error, cleaning up new asset ${downloadedAssetId}: ${error}`,
);
await silentDeleteAsset(userId, downloadedAssetId);
}
throw error;
}
} catch (error) {
if (downloadedAssetId) {
logger.error(
`[Crawler][${jobId}] handleAsAssetBookmark encountered an error, cleaning up new asset ${downloadedAssetId}: ${error}`,
);
await silentDeleteAsset(userId, downloadedAssetId);
}
throw error;
}

And add after line 1600:

        });

        // Clear tracking - asset is now owned by database
        downloadedAssetId = undefined;

        await AssetPreprocessingQueue.enqueue(
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/workers/workers/crawlerWorker.ts
Line: 1611-1619

Comment:
Same critical bug: if `db.transaction()` (line 1574) succeeds but `AssetPreprocessingQueue.enqueue()` (line 1603) throws, the asset gets deleted even though it's already in the database.

Fix: clear `downloadedAssetId` after successful transaction:

```suggestion
      } catch (error) {
        if (downloadedAssetId) {
          logger.error(
            `[Crawler][${jobId}] handleAsAssetBookmark encountered an error, cleaning up new asset ${downloadedAssetId}: ${error}`,
          );
          await silentDeleteAsset(userId, downloadedAssetId);
        }
        throw error;
      }
```

And add after line 1600:
```
        });

        // Clear tracking - asset is now owned by database
        downloadedAssetId = undefined;

        await AssetPreprocessingQueue.enqueue(
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/workers/workers/crawlerWorker.ts`:
- Around line 1995-2027: The archive asset is persisted to storage by
archiveWebpage (producing fullPageArchiveAssetId) before the DB linkage is
committed with db.transaction/updateAsset, which can leave an orphaned asset if
the DB transaction fails; modify the flow so that you either (a) perform the DB
transaction to create/update the asset record and only call archiveWebpage to
upload after the DB commit, or (b) if you must upload first, wrap the
db.transaction/updateAsset call in try/catch and on any failure call
silentDeleteAsset(userId, fullPageArchiveAssetId) to remove the uploaded asset
and rethrow the error; ensure the logic references archiveWebpage,
fullPageArchiveAssetId, oldFullPageArchiveAssetId, db.transaction, updateAsset,
and silentDeleteAsset so the cleanup or reordered persistence is applied
consistently.
- Around line 1839-1855: The Promise.race calls around storeScreenshot and
storePdf cause un-cancelled background writes; remove Promise.race/abortPromise
usage and instead perform explicit abort checks with
abortSignal.throwIfAborted() before calling storeScreenshot(...) and
storePdf(...), then call those functions directly and push their returned
assetId into newAssetIds when truthy (e.g., const screenshotAssetInfo = await
storeScreenshot(...); if (screenshotAssetInfo)
newAssetIds.push(screenshotAssetInfo.assetId);), ensuring no operation is
started if abortSignal is already aborted; keep the subsequent
abortSignal.throwIfAborted() checks as needed and preserve error handling around
these calls.
- Around line 1807-1816: The parseDate helper currently constructs new
Date(date) but doesn't detect "Invalid Date" because new Date never throws;
update parseDate to create the Date object (keeping the existing signature) and
then validate it by checking the timestamp (e.g., using isNaN(d.getTime()) or
Number.isNaN(d.valueOf())); return the Date when valid and null when invalid or
when input is null/undefined. Ensure you still handle null/undefined early and
replace the try/catch with this explicit validation in the parseDate function.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8970dee and 5ffc216.

📒 Files selected for processing (1)
  • apps/workers/workers/crawlerWorker.ts

Comment thread apps/workers/workers/crawlerWorker.ts
Comment thread apps/workers/workers/crawlerWorker.ts Outdated
Comment thread apps/workers/workers/crawlerWorker.ts
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
apps/workers/workers/crawlerWorker.ts (3)

1996-2029: ⚠️ Potential issue | 🟠 Major

Archive upload is not rolled back if archive DB transaction fails.

archiveWebpage uploads first, but if the DB transaction at Line 2011 fails, fullPageArchiveAssetId is left orphaned in storage.

💡 Proposed fix
             if (archiveResult) {
               const {
                 assetId: fullPageArchiveAssetId,
                 size,
                 contentType,
               } = archiveResult;

-              await db.transaction(async (txn) => {
-                await updateAsset(
-                  oldFullPageArchiveAssetId,
-                  {
-                    id: fullPageArchiveAssetId,
-                    bookmarkId,
-                    userId,
-                    assetType: AssetTypes.LINK_FULL_PAGE_ARCHIVE,
-                    contentType,
-                    size,
-                    fileName: null,
-                  },
-                  txn,
-                );
-              });
-              newAssetIds.length = 0;
+              try {
+                await db.transaction(async (txn) => {
+                  await updateAsset(
+                    oldFullPageArchiveAssetId,
+                    {
+                      id: fullPageArchiveAssetId,
+                      bookmarkId,
+                      userId,
+                      assetType: AssetTypes.LINK_FULL_PAGE_ARCHIVE,
+                      contentType,
+                      size,
+                      fileName: null,
+                    },
+                    txn,
+                  );
+                });
+              } catch (error) {
+                await silentDeleteAsset(userId, fullPageArchiveAssetId);
+                throw error;
+              }
               if (oldFullPageArchiveAssetId) {
                 await silentDeleteAsset(userId, oldFullPageArchiveAssetId);
               }
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/workers/workers/crawlerWorker.ts` around lines 1996 - 2029,
archiveWebpage uploads the archive before the db.transaction that calls
updateAsset, so if updateAsset inside db.transaction fails the uploaded asset
(fullPageArchiveAssetId) is left orphaned; wrap the db.transaction call in a
try/catch and on failure call silentDeleteAsset(userId, fullPageArchiveAssetId)
(or queue it for cleanup) before rethrowing, or alternatively move the upload
into the same transactional flow so no upload occurs unless the transaction
succeeds; reference archiveWebpage, fullPageArchiveAssetId,
oldFullPageArchiveAssetId, updateAsset, db.transaction, and silentDeleteAsset to
locate and implement the cleanup.

1807-1816: ⚠️ Potential issue | 🟡 Minor

parseDate still accepts invalid date strings.

At Line 1812, new Date(date) does not throw for invalid input, so the try/catch won’t reject bad values.

💡 Proposed fix
         const parseDate = (date: string | null | undefined) => {
           if (!date) {
             return null;
           }
-          try {
-            return new Date(date);
-          } catch {
-            return null;
-          }
+          const parsed = new Date(date);
+          return Number.isNaN(parsed.getTime()) ? null : parsed;
         };
Does JavaScript `new Date("invalid-date")` throw an exception, or return an `Invalid Date` object?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/workers/workers/crawlerWorker.ts` around lines 1807 - 1816, The
parseDate helper currently uses new Date(date) inside a try/catch but new
Date(...) does not throw on invalid strings; change parseDate to construct the
Date (e.g., const d = new Date(date)), then verify it is valid by checking
Number.isNaN(d.getTime()) or isNaN(d.getTime()) and return null for invalid
dates, otherwise return the Date object; update the function named parseDate to
perform this validation so callers get null for invalid/Unparsable inputs.

1839-1851: ⚠️ Potential issue | 🔴 Critical

Promise.race here can still leak untracked asset writes on abort.

At Line 1839 and Line 1848, abortPromise winning the race does not cancel storeScreenshot/storePdf. Those operations can finish in background and store assets that never get added to newAssetIds.

💡 Proposed fix
-        const screenshotAssetInfo = await Promise.race([
-          storeScreenshot(screenshot, userId, jobId),
-          abortPromise(abortSignal),
-        ]);
+        abortSignal.throwIfAborted();
+        const screenshotAssetInfo = await storeScreenshot(
+          screenshot,
+          userId,
+          jobId,
+        );
         if (screenshotAssetInfo) {
           newAssetIds.push(screenshotAssetInfo.assetId);
         }
         abortSignal.throwIfAborted();

-        const pdfAssetInfo = await Promise.race([
-          storePdf(pdf, userId, jobId),
-          abortPromise(abortSignal),
-        ]);
+        abortSignal.throwIfAborted();
+        const pdfAssetInfo = await storePdf(pdf, userId, jobId);
         if (pdfAssetInfo) {
           newAssetIds.push(pdfAssetInfo.assetId);
         }
#!/bin/bash
# Verify non-cancelable race pattern and function signatures in crawler worker.
rg -nP 'async function storeScreenshot|async function storePdf|Promise\.race\(\[' apps/workers/workers/crawlerWorker.ts -A6 -B3
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/workers/workers/crawlerWorker.ts` around lines 1839 - 1851, The
Promise.race usage lets storeScreenshot/storePdf continue writing even if
abortPromise wins, causing leaked assets; modify storeScreenshot and storePdf to
accept an AbortSignal (or options object) and make them cooperative (throw/stop
and cleanup when abortSignal.aborted), then call them with the existing
abortSignal instead of racing with abortPromise; update the calls in
crawlerWorker.ts (the await lines that call storeScreenshot(...) and
storePdf(...), and the code that pushes into newAssetIds) to pass the
abortSignal so the storage operations cancel and never produce orphaned asset
writes when abortSignal triggers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@apps/workers/workers/crawlerWorker.ts`:
- Around line 1996-2029: archiveWebpage uploads the archive before the
db.transaction that calls updateAsset, so if updateAsset inside db.transaction
fails the uploaded asset (fullPageArchiveAssetId) is left orphaned; wrap the
db.transaction call in a try/catch and on failure call silentDeleteAsset(userId,
fullPageArchiveAssetId) (or queue it for cleanup) before rethrowing, or
alternatively move the upload into the same transactional flow so no upload
occurs unless the transaction succeeds; reference archiveWebpage,
fullPageArchiveAssetId, oldFullPageArchiveAssetId, updateAsset, db.transaction,
and silentDeleteAsset to locate and implement the cleanup.
- Around line 1807-1816: The parseDate helper currently uses new Date(date)
inside a try/catch but new Date(...) does not throw on invalid strings; change
parseDate to construct the Date (e.g., const d = new Date(date)), then verify it
is valid by checking Number.isNaN(d.getTime()) or isNaN(d.getTime()) and return
null for invalid dates, otherwise return the Date object; update the function
named parseDate to perform this validation so callers get null for
invalid/Unparsable inputs.
- Around line 1839-1851: The Promise.race usage lets storeScreenshot/storePdf
continue writing even if abortPromise wins, causing leaked assets; modify
storeScreenshot and storePdf to accept an AbortSignal (or options object) and
make them cooperative (throw/stop and cleanup when abortSignal.aborted), then
call them with the existing abortSignal instead of racing with abortPromise;
update the calls in crawlerWorker.ts (the await lines that call
storeScreenshot(...) and storePdf(...), and the code that pushes into
newAssetIds) to pass the abortSignal so the storage operations cancel and never
produce orphaned asset writes when abortSignal triggers.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ffc216 and fc53d9b.

📒 Files selected for processing (1)
  • apps/workers/workers/crawlerWorker.ts

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/workers/workers/crawlerWorker.ts`:
- Around line 1604-1621: The catch currently always rethrows which will retry
the whole crawl even after the bookmark has been converted to an asset; change
the catch logic so that if assetPersisted is true you do not rethrow: log the
AssetPreprocessingQueue.enqueue error (include jobId, downloadedAssetId) and
continue (do not trigger job retry), while retaining the existing
cleanup-and-rethrow behavior only when assetPersisted is false (i.e., keep the
downloadedAssetId && !assetPersisted branch that calls silentDeleteAsset and
then throws). Specifically adjust the catch around
AssetPreprocessingQueue.enqueue in handleAsAssetBookmark to suppress rethrow
when assetPersisted === true and ensure clear logging instead.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 884b9488-9d6e-499d-88ba-97ca2ae7c609

📥 Commits

Reviewing files that changed from the base of the PR and between fc53d9b and faa2f6c.

📒 Files selected for processing (1)
  • apps/workers/workers/crawlerWorker.ts

Comment thread apps/workers/workers/crawlerWorker.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
apps/workers/workers/crawlerWorker.ts (1)

2034-2034: Redundant array clearing.

newAssetIds.length = 0 here is a no-op since the array was already cleared at line 1989 before this function was returned, and nothing adds to newAssetIds inside the returned archival function.

Suggested fix
               });
             } catch (e) {
               await silentDeleteAsset(userId, fullPageArchiveAssetId);
               throw e;
             }
-            newAssetIds.length = 0;
             if (oldFullPageArchiveAssetId) {
               await silentDeleteAsset(userId, oldFullPageArchiveAssetId);
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/workers/workers/crawlerWorker.ts` at line 2034, Remove the redundant
clearing of the newAssetIds array inside the archival closure: locate the
archival function that references newAssetIds (the function returned near where
newAssetIds was previously cleared at line ~1989) and delete the statement
"newAssetIds.length = 0" since newAssetIds is already reset before the function
returns and nothing in the closure adds to it; ensure no other logic depends on
that final clearing and run tests to confirm behavior remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@apps/workers/workers/crawlerWorker.ts`:
- Line 2034: Remove the redundant clearing of the newAssetIds array inside the
archival closure: locate the archival function that references newAssetIds (the
function returned near where newAssetIds was previously cleared at line ~1989)
and delete the statement "newAssetIds.length = 0" since newAssetIds is already
reset before the function returns and nothing in the closure adds to it; ensure
no other logic depends on that final clearing and run tests to confirm behavior
remains unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: be4302c5-470b-4008-ba1e-1f4cdfee9e28

📥 Commits

Reviewing files that changed from the base of the PR and between faa2f6c and f9f7be6.

📒 Files selected for processing (1)
  • apps/workers/workers/crawlerWorker.ts

Address CodeRabbit pre-merge check requiring 80% docstring coverage.
Added JSDoc to 20 undocumented functions (abortPromise, crawlPage,
storeScreenshot, storePdf, archiveWebpage, crawlAndParseUrl, etc.).

Also verified the three CodeRabbit review concerns:
- parseDate already validates with isNaN(getTime())
- archiveWebpage already cleans up orphaned assets via silentDeleteAsset
- Promise.race screenshot/pdf writes are guarded by throwIfAborted()
@VedantMadane
Copy link
Copy Markdown
Author

Addressing the CodeRabbit/Greptile review comments:

Re: asset cleanup after DB commit (comments 1, 3, 4):
The \�ssetPersisted\ flag on line 1620 is set only after the DB transaction completes. The outer catch (line 1638) checks !assetPersisted\ before deleting. If the enqueue fails post-commit, it's caught in the inner try/catch (lines 1621-1636) and logged without cleanup since the asset is already persisted. This is intentional -- the asset exists in the DB, so deleting the storage object would create an orphan DB record.

Re: deletions started before transaction commit (comment 2):
The \�ssetDeletionTasks\ array stores lambda functions (() => silentDeleteAsset(...)), not promises. They're only invoked on line 2011 after the transaction commits (line 2005) and \ ransactionCommitted\ is set (line 2007). No storage deletion runs inside the transaction.

Re: archive asset rollback (comment 7):
The archive step (lines 2013+) runs after the main transaction commits. If archive storage or DB linkage fails, the main bookmark data is already committed and consistent. The archive asset becomes orphaned in storage, which is acceptable since it's a supplementary asset.

Re: retrying after post-commit enqueue failure (comment 8):
The inner try/catch around \AssetPreprocessingQueue.enqueue\ (line 1631) catches enqueue failures separately and returns gracefully without rethrowing. The crawl is not retried.

Happy to adjust any of this if a maintainer sees issues with the approach.

@VedantMadane
Copy link
Copy Markdown
Author

Closing this -- the reviewers correctly identified a fundamental race condition: the catch-block cleanup deletes assets even after the database transaction has already committed, which creates orphaned DB references. A proper fix needs a different approach (e.g., deferred cleanup that checks the assetPersisted flag before any deletion, or a two-phase approach where deletion promises are collected but only awaited after verifying the transaction result). Thanks to the reviewers for the thorough analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

If the parser OOMs, orphaned assets / invalid data can be left in the database

1 participant