fix: cleanup orphaned assets if crawler crashes before database transaction commits#2543
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughStaged asset creation and collection were introduced: assets (screenshot, PDF, HTML, banner) are accumulated during precrawl/live crawl into newAssetIds, persistence is deferred into a single transaction, and robust cleanup removes partially created assets on failure; download/enqueue flows and logging were hardened. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/workers/workers/crawlerWorker.ts`:
- Around line 1887-1971: The transaction currently pushes already-started
deletion promises into assetDeletionTasks inside the txn callback (see
assetDeletionTasks, silentDeleteAsset, updateAsset and the db.transaction
block), which can remove storage before the DB commits; change this by
collecting IDs (or functions) of assets to delete inside the transaction without
starting deletion, commit the transaction as-is (keep updates via updateAsset
and txn operations), then after the await db.transaction(...) resolve the
collected list and call silentDeleteAsset (or execute deletion promises) so
deletions run only after successful commit.
- Around line 1560-1618: The cleanup currently deletes the downloaded asset
(downloadedAssetId) on any error even if the DB transaction (db.transaction)
already committed and created bookmark rows; change the logic to only delete
when the DB transaction did NOT complete: introduce a boolean (e.g.,
assetPersisted = false) set to true immediately after the await
db.transaction(...) returns successfully (inside the same scope where
downloadAndStoreFile, downloadedAssetId, and AssetPreprocessingQueue.enqueue are
used), then update the catch block to call silentDeleteAsset(userId,
downloadedAssetId) only when downloadedAssetId is set AND assetPersisted is
false; reference downloadedAssetId, db.transaction,
AssetPreprocessingQueue.enqueue, and silentDeleteAsset to locate places to
modify.
Greptile SummaryThis PR attempts to prevent orphaned assets by adding try-catch blocks that clean up storage assets if the crawler crashes. However, the implementation has a critical bug that creates the opposite problem. Key Issues
What HappensBoth
Root CauseThe asset tracking variables ( Recommended FixClear the tracking variables immediately after successful database transactions to ensure cleanup only happens for uncommitted assets. Confidence Score: 0/5
Important Files Changed
Last reviewed commit: 8970dee |
| }; | ||
| } catch (error) { | ||
| logger.error( | ||
| `[Crawler][${jobId}] crawlAndParseUrl encountered an error, cleaning up new assets: ${error}`, | ||
| ); | ||
| // Clean up any assets that were created during this attempt but not committed | ||
| await Promise.all( | ||
| newAssetIds.map((assetId) => silentDeleteAsset(userId, assetId)), | ||
| ); | ||
| throw error; | ||
| } |
There was a problem hiding this comment.
Critical bug: assets are deleted even after database transaction succeeds. If db.transaction() commits (line 1880) but Promise.all(assetDeletionTasks) (line 1971) or the abort signal throws, this catch block deletes assets already in the database, creating orphaned DB references (DB records pointing to non-existent storage).
Fix: clear newAssetIds immediately after successful transaction:
| }; | |
| } catch (error) { | |
| logger.error( | |
| `[Crawler][${jobId}] crawlAndParseUrl encountered an error, cleaning up new assets: ${error}`, | |
| ); | |
| // Clean up any assets that were created during this attempt but not committed | |
| await Promise.all( | |
| newAssetIds.map((assetId) => silentDeleteAsset(userId, assetId)), | |
| ); | |
| throw error; | |
| } | |
| } catch (error) { | |
| logger.error( | |
| `[Crawler][${jobId}] crawlAndParseUrl encountered an error, cleaning up new assets: ${error}`, | |
| ); | |
| // Only clean up assets if they weren't committed to the database | |
| if (newAssetIds.length > 0) { | |
| await Promise.all( | |
| newAssetIds.map((assetId) => silentDeleteAsset(userId, assetId)), | |
| ); | |
| } | |
| throw error; | |
| } |
And add after line 1968:
});
// Clear tracking array - assets are now owned by database
newAssetIds.length = 0;
// Delete the old assets if any
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/workers/workers/crawlerWorker.ts
Line: 2013-2023
Comment:
Critical bug: assets are deleted even after database transaction succeeds. If `db.transaction()` commits (line 1880) but `Promise.all(assetDeletionTasks)` (line 1971) or the abort signal throws, this catch block deletes assets already in the database, creating orphaned DB references (DB records pointing to non-existent storage).
Fix: clear `newAssetIds` immediately after successful transaction:
```suggestion
} catch (error) {
logger.error(
`[Crawler][${jobId}] crawlAndParseUrl encountered an error, cleaning up new assets: ${error}`,
);
// Only clean up assets if they weren't committed to the database
if (newAssetIds.length > 0) {
await Promise.all(
newAssetIds.map((assetId) => silentDeleteAsset(userId, assetId)),
);
}
throw error;
}
```
And add after line 1968:
```
});
// Clear tracking array - assets are now owned by database
newAssetIds.length = 0;
// Delete the old assets if any
```
How can I resolve this? If you propose a fix, please make it concise.| } catch (error) { | ||
| if (downloadedAssetId) { | ||
| logger.error( | ||
| `[Crawler][${jobId}] handleAsAssetBookmark encountered an error, cleaning up new asset ${downloadedAssetId}: ${error}`, | ||
| ); | ||
| await silentDeleteAsset(userId, downloadedAssetId); | ||
| } | ||
| throw error; | ||
| } |
There was a problem hiding this comment.
Same critical bug: if db.transaction() (line 1574) succeeds but AssetPreprocessingQueue.enqueue() (line 1603) throws, the asset gets deleted even though it's already in the database.
Fix: clear downloadedAssetId after successful transaction:
| } catch (error) { | |
| if (downloadedAssetId) { | |
| logger.error( | |
| `[Crawler][${jobId}] handleAsAssetBookmark encountered an error, cleaning up new asset ${downloadedAssetId}: ${error}`, | |
| ); | |
| await silentDeleteAsset(userId, downloadedAssetId); | |
| } | |
| throw error; | |
| } | |
| } catch (error) { | |
| if (downloadedAssetId) { | |
| logger.error( | |
| `[Crawler][${jobId}] handleAsAssetBookmark encountered an error, cleaning up new asset ${downloadedAssetId}: ${error}`, | |
| ); | |
| await silentDeleteAsset(userId, downloadedAssetId); | |
| } | |
| throw error; | |
| } |
And add after line 1600:
});
// Clear tracking - asset is now owned by database
downloadedAssetId = undefined;
await AssetPreprocessingQueue.enqueue(
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/workers/workers/crawlerWorker.ts
Line: 1611-1619
Comment:
Same critical bug: if `db.transaction()` (line 1574) succeeds but `AssetPreprocessingQueue.enqueue()` (line 1603) throws, the asset gets deleted even though it's already in the database.
Fix: clear `downloadedAssetId` after successful transaction:
```suggestion
} catch (error) {
if (downloadedAssetId) {
logger.error(
`[Crawler][${jobId}] handleAsAssetBookmark encountered an error, cleaning up new asset ${downloadedAssetId}: ${error}`,
);
await silentDeleteAsset(userId, downloadedAssetId);
}
throw error;
}
```
And add after line 1600:
```
});
// Clear tracking - asset is now owned by database
downloadedAssetId = undefined;
await AssetPreprocessingQueue.enqueue(
```
How can I resolve this? If you propose a fix, please make it concise.… deletion until after commit
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/workers/workers/crawlerWorker.ts`:
- Around line 1995-2027: The archive asset is persisted to storage by
archiveWebpage (producing fullPageArchiveAssetId) before the DB linkage is
committed with db.transaction/updateAsset, which can leave an orphaned asset if
the DB transaction fails; modify the flow so that you either (a) perform the DB
transaction to create/update the asset record and only call archiveWebpage to
upload after the DB commit, or (b) if you must upload first, wrap the
db.transaction/updateAsset call in try/catch and on any failure call
silentDeleteAsset(userId, fullPageArchiveAssetId) to remove the uploaded asset
and rethrow the error; ensure the logic references archiveWebpage,
fullPageArchiveAssetId, oldFullPageArchiveAssetId, db.transaction, updateAsset,
and silentDeleteAsset so the cleanup or reordered persistence is applied
consistently.
- Around line 1839-1855: The Promise.race calls around storeScreenshot and
storePdf cause un-cancelled background writes; remove Promise.race/abortPromise
usage and instead perform explicit abort checks with
abortSignal.throwIfAborted() before calling storeScreenshot(...) and
storePdf(...), then call those functions directly and push their returned
assetId into newAssetIds when truthy (e.g., const screenshotAssetInfo = await
storeScreenshot(...); if (screenshotAssetInfo)
newAssetIds.push(screenshotAssetInfo.assetId);), ensuring no operation is
started if abortSignal is already aborted; keep the subsequent
abortSignal.throwIfAborted() checks as needed and preserve error handling around
these calls.
- Around line 1807-1816: The parseDate helper currently constructs new
Date(date) but doesn't detect "Invalid Date" because new Date never throws;
update parseDate to create the Date object (keeping the existing signature) and
then validate it by checking the timestamp (e.g., using isNaN(d.getTime()) or
Number.isNaN(d.valueOf())); return the Date when valid and null when invalid or
when input is null/undefined. Ensure you still handle null/undefined early and
replace the try/catch with this explicit validation in the parseDate function.
There was a problem hiding this comment.
♻️ Duplicate comments (3)
apps/workers/workers/crawlerWorker.ts (3)
1996-2029:⚠️ Potential issue | 🟠 MajorArchive upload is not rolled back if archive DB transaction fails.
archiveWebpageuploads first, but if the DB transaction at Line 2011 fails,fullPageArchiveAssetIdis left orphaned in storage.💡 Proposed fix
if (archiveResult) { const { assetId: fullPageArchiveAssetId, size, contentType, } = archiveResult; - await db.transaction(async (txn) => { - await updateAsset( - oldFullPageArchiveAssetId, - { - id: fullPageArchiveAssetId, - bookmarkId, - userId, - assetType: AssetTypes.LINK_FULL_PAGE_ARCHIVE, - contentType, - size, - fileName: null, - }, - txn, - ); - }); - newAssetIds.length = 0; + try { + await db.transaction(async (txn) => { + await updateAsset( + oldFullPageArchiveAssetId, + { + id: fullPageArchiveAssetId, + bookmarkId, + userId, + assetType: AssetTypes.LINK_FULL_PAGE_ARCHIVE, + contentType, + size, + fileName: null, + }, + txn, + ); + }); + } catch (error) { + await silentDeleteAsset(userId, fullPageArchiveAssetId); + throw error; + } if (oldFullPageArchiveAssetId) { await silentDeleteAsset(userId, oldFullPageArchiveAssetId); } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/workers/workers/crawlerWorker.ts` around lines 1996 - 2029, archiveWebpage uploads the archive before the db.transaction that calls updateAsset, so if updateAsset inside db.transaction fails the uploaded asset (fullPageArchiveAssetId) is left orphaned; wrap the db.transaction call in a try/catch and on failure call silentDeleteAsset(userId, fullPageArchiveAssetId) (or queue it for cleanup) before rethrowing, or alternatively move the upload into the same transactional flow so no upload occurs unless the transaction succeeds; reference archiveWebpage, fullPageArchiveAssetId, oldFullPageArchiveAssetId, updateAsset, db.transaction, and silentDeleteAsset to locate and implement the cleanup.
1807-1816:⚠️ Potential issue | 🟡 Minor
parseDatestill accepts invalid date strings.At Line 1812,
new Date(date)does not throw for invalid input, so thetry/catchwon’t reject bad values.💡 Proposed fix
const parseDate = (date: string | null | undefined) => { if (!date) { return null; } - try { - return new Date(date); - } catch { - return null; - } + const parsed = new Date(date); + return Number.isNaN(parsed.getTime()) ? null : parsed; };Does JavaScript `new Date("invalid-date")` throw an exception, or return an `Invalid Date` object?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/workers/workers/crawlerWorker.ts` around lines 1807 - 1816, The parseDate helper currently uses new Date(date) inside a try/catch but new Date(...) does not throw on invalid strings; change parseDate to construct the Date (e.g., const d = new Date(date)), then verify it is valid by checking Number.isNaN(d.getTime()) or isNaN(d.getTime()) and return null for invalid dates, otherwise return the Date object; update the function named parseDate to perform this validation so callers get null for invalid/Unparsable inputs.
1839-1851:⚠️ Potential issue | 🔴 Critical
Promise.racehere can still leak untracked asset writes on abort.At Line 1839 and Line 1848,
abortPromisewinning the race does not cancelstoreScreenshot/storePdf. Those operations can finish in background and store assets that never get added tonewAssetIds.💡 Proposed fix
- const screenshotAssetInfo = await Promise.race([ - storeScreenshot(screenshot, userId, jobId), - abortPromise(abortSignal), - ]); + abortSignal.throwIfAborted(); + const screenshotAssetInfo = await storeScreenshot( + screenshot, + userId, + jobId, + ); if (screenshotAssetInfo) { newAssetIds.push(screenshotAssetInfo.assetId); } abortSignal.throwIfAborted(); - const pdfAssetInfo = await Promise.race([ - storePdf(pdf, userId, jobId), - abortPromise(abortSignal), - ]); + abortSignal.throwIfAborted(); + const pdfAssetInfo = await storePdf(pdf, userId, jobId); if (pdfAssetInfo) { newAssetIds.push(pdfAssetInfo.assetId); }#!/bin/bash # Verify non-cancelable race pattern and function signatures in crawler worker. rg -nP 'async function storeScreenshot|async function storePdf|Promise\.race\(\[' apps/workers/workers/crawlerWorker.ts -A6 -B3🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/workers/workers/crawlerWorker.ts` around lines 1839 - 1851, The Promise.race usage lets storeScreenshot/storePdf continue writing even if abortPromise wins, causing leaked assets; modify storeScreenshot and storePdf to accept an AbortSignal (or options object) and make them cooperative (throw/stop and cleanup when abortSignal.aborted), then call them with the existing abortSignal instead of racing with abortPromise; update the calls in crawlerWorker.ts (the await lines that call storeScreenshot(...) and storePdf(...), and the code that pushes into newAssetIds) to pass the abortSignal so the storage operations cancel and never produce orphaned asset writes when abortSignal triggers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@apps/workers/workers/crawlerWorker.ts`:
- Around line 1996-2029: archiveWebpage uploads the archive before the
db.transaction that calls updateAsset, so if updateAsset inside db.transaction
fails the uploaded asset (fullPageArchiveAssetId) is left orphaned; wrap the
db.transaction call in a try/catch and on failure call silentDeleteAsset(userId,
fullPageArchiveAssetId) (or queue it for cleanup) before rethrowing, or
alternatively move the upload into the same transactional flow so no upload
occurs unless the transaction succeeds; reference archiveWebpage,
fullPageArchiveAssetId, oldFullPageArchiveAssetId, updateAsset, db.transaction,
and silentDeleteAsset to locate and implement the cleanup.
- Around line 1807-1816: The parseDate helper currently uses new Date(date)
inside a try/catch but new Date(...) does not throw on invalid strings; change
parseDate to construct the Date (e.g., const d = new Date(date)), then verify it
is valid by checking Number.isNaN(d.getTime()) or isNaN(d.getTime()) and return
null for invalid dates, otherwise return the Date object; update the function
named parseDate to perform this validation so callers get null for
invalid/Unparsable inputs.
- Around line 1839-1851: The Promise.race usage lets storeScreenshot/storePdf
continue writing even if abortPromise wins, causing leaked assets; modify
storeScreenshot and storePdf to accept an AbortSignal (or options object) and
make them cooperative (throw/stop and cleanup when abortSignal.aborted), then
call them with the existing abortSignal instead of racing with abortPromise;
update the calls in crawlerWorker.ts (the await lines that call
storeScreenshot(...) and storePdf(...), and the code that pushes into
newAssetIds) to pass the abortSignal so the storage operations cancel and never
produce orphaned asset writes when abortSignal triggers.
…alidation Made-with: Cursor
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/workers/workers/crawlerWorker.ts`:
- Around line 1604-1621: The catch currently always rethrows which will retry
the whole crawl even after the bookmark has been converted to an asset; change
the catch logic so that if assetPersisted is true you do not rethrow: log the
AssetPreprocessingQueue.enqueue error (include jobId, downloadedAssetId) and
continue (do not trigger job retry), while retaining the existing
cleanup-and-rethrow behavior only when assetPersisted is false (i.e., keep the
downloadedAssetId && !assetPersisted branch that calls silentDeleteAsset and
then throws). Specifically adjust the catch around
AssetPreprocessingQueue.enqueue in handleAsAssetBookmark to suppress rethrow
when assetPersisted === true and ensure clear logging instead.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 884b9488-9d6e-499d-88ba-97ca2ae7c609
📒 Files selected for processing (1)
apps/workers/workers/crawlerWorker.ts
Made-with: Cursor
There was a problem hiding this comment.
🧹 Nitpick comments (1)
apps/workers/workers/crawlerWorker.ts (1)
2034-2034: Redundant array clearing.
newAssetIds.length = 0here is a no-op since the array was already cleared at line 1989 before this function was returned, and nothing adds tonewAssetIdsinside the returned archival function.Suggested fix
}); } catch (e) { await silentDeleteAsset(userId, fullPageArchiveAssetId); throw e; } - newAssetIds.length = 0; if (oldFullPageArchiveAssetId) { await silentDeleteAsset(userId, oldFullPageArchiveAssetId); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/workers/workers/crawlerWorker.ts` at line 2034, Remove the redundant clearing of the newAssetIds array inside the archival closure: locate the archival function that references newAssetIds (the function returned near where newAssetIds was previously cleared at line ~1989) and delete the statement "newAssetIds.length = 0" since newAssetIds is already reset before the function returns and nothing in the closure adds to it; ensure no other logic depends on that final clearing and run tests to confirm behavior remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@apps/workers/workers/crawlerWorker.ts`:
- Line 2034: Remove the redundant clearing of the newAssetIds array inside the
archival closure: locate the archival function that references newAssetIds (the
function returned near where newAssetIds was previously cleared at line ~1989)
and delete the statement "newAssetIds.length = 0" since newAssetIds is already
reset before the function returns and nothing in the closure adds to it; ensure
no other logic depends on that final clearing and run tests to confirm behavior
remains unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: be4302c5-470b-4008-ba1e-1f4cdfee9e28
📒 Files selected for processing (1)
apps/workers/workers/crawlerWorker.ts
Address CodeRabbit pre-merge check requiring 80% docstring coverage. Added JSDoc to 20 undocumented functions (abortPromise, crawlPage, storeScreenshot, storePdf, archiveWebpage, crawlAndParseUrl, etc.). Also verified the three CodeRabbit review concerns: - parseDate already validates with isNaN(getTime()) - archiveWebpage already cleans up orphaned assets via silentDeleteAsset - Promise.race screenshot/pdf writes are guarded by throwIfAborted()
|
Addressing the CodeRabbit/Greptile review comments: Re: asset cleanup after DB commit (comments 1, 3, 4): Re: deletions started before transaction commit (comment 2): Re: archive asset rollback (comment 7): Re: retrying after post-commit enqueue failure (comment 8): Happy to adjust any of this if a maintainer sees issues with the approach. |
|
Closing this -- the reviewers correctly identified a fundamental race condition: the catch-block cleanup deletes assets even after the database transaction has already committed, which creates orphaned DB references. A proper fix needs a different approach (e.g., deferred cleanup that checks the assetPersisted flag before any deletion, or a two-phase approach where deletion promises are collected but only awaited after verifying the transaction result). Thanks to the reviewers for the thorough analysis. |
Fixes #2519.
Problem
When the background crawler saves assets (screenshots, PDFs, images) to storage (S3 or local disk), there is a window of time before the database transaction commits the �ssetId to the �ssets table. If the worker crashes or OOMs during this window, the assets remain in storage with no database record pointing to them, effectively becoming orphaned.
Solution
ewAssetIds to collect all successfully stored assets during a single crawl attempt.
ewAssetIds and calls silentDeleteAsset to remove them from storage.