Skip to content

feat: Implement storage aliases#3636

Open
janbuchar wants to merge 24 commits into
simplify-storage-subclientsfrom
implement-storage-aliases
Open

feat: Implement storage aliases#3636
janbuchar wants to merge 24 commits into
simplify-storage-subclientsfrom
implement-storage-aliases

Conversation

@janbuchar
Copy link
Copy Markdown
Contributor

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label May 7, 2026
@janbuchar janbuchar requested review from B4nan and barjin May 7, 2026 15:39
@janbuchar janbuchar linked an issue May 7, 2026 that may be closed by this pull request
Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @janbuchar ! I have a few questions below ⬇️

Comment on lines +1302 to +1303
// The default queue is an unnamed storage (opened with no identifier), so it has no name.
const isDefaultQueue = this.requestQueue?.name === undefined;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree with this, e.g., in WCC, we create unnamed queues to pass to the subcrawlers, see:

const fileQueue = await RequestQueue.open();

const mainCrawler = new Crawler(({ req, res }) => {
    if(res is file) await fileQueue.addRequest(req);
});

const fileCrawler = new Crawler({ rq: fileQueue, ... })

In this use-case, the RQ is supposed to, e.g., survive migration and keep the enqueued file links until these are processed by fileCrawler.

I guess my point is unnamed !== exclusively used by one crawler. Perhaps we could have a similar flag as with SessionPool?

/**
* Indicates whether the crawler owns the session pool (it was not passed from the outside using the `sessionPool` constructor option).
*/
private ownsSessionPool: boolean;

Comment on lines +122 to +125
/**
* Ensure that the same string is not used as both a name and an alias for the same
* storage class + backend combination. Mirrors crawlee-python's `_check_name_alias_conflict`.
*/
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive me for being dense, but why isn't this supported? Two supporting arguments I can think of:

A) The legacy Dataset.open(idOrName: string) signature doesn't support alias, so it's unambiguous in that way. To use alias, the user has to be explicit, e.g., Dataset.open({ name: 'x' }) or Dataset.open({ alias: 'x' })

B) If this is because of the memory-storage implementation (both alias and name become folder names), then it's imo memory-storage implementation leaking into the abstraction, and we should change this. Perhaps add an indirection layer for aliases?

Is there another reason I'm not seeing?

Comment on lines +125 to +127
store.id === entryNameOrId ||
store.name?.toLowerCase() === entryNameOrId.toLowerCase() ||
store.directoryName.toLowerCase() === entryNameOrId.toLowerCase(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Given that directoryName is always populated, doesn't the last clause (directoryName) always contain the first two (id / name)?

directoryNameMatches => idMatches \/ nameMatches

@janbuchar janbuchar force-pushed the simplify-storage-subclients branch from d998bc7 to 1e349f5 Compare May 12, 2026 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for non-default unnamed storages

3 participants