Skip to content

[ENHANCEMENT] Sub-optimal sdk implementation for uploading collections #77

@gauravagerwala

Description

@gauravagerwala

Confirmation of Request Source

  • I confirm that this feature request is for the xAI SDK Python library (e.g., client features, SDK methods, or documentation) and not for the xAI API itself (e.g., model capabilities or API changes).

Describe the feature you'd like

We built a design review tool for the X AI hackathon.
The tool found scaling bottlenecks in the python sdk for collections management. The tool relies on artifacts it had generated (attached [here)]([url](

workflows.json
design-workflow-3-collections-management.md

))

Scaling Bottlenecks in the Third Workflow: Collections Management

The third workflow, as documented in .exp/design-workflow-3-collections-management.md and workflows.json, focuses on managing vector collections for document storage, embedding, indexing, searching, and retrieval. It involves operations like creating collections, uploading/indexing documents, and searching via gRPC calls to management and documents services. Exploration of the codebase (src/xai_sdk/collections.py, src/xai_sdk/sync/collections.py, src/xai_sdk/files.py, src/xai_sdk/poll_timer.py, proto files, and examples/sync/collection.py) reveals several client-side scaling bottlenecks, particularly for high-volume scenarios (e.g., ingesting thousands of documents or handling large files). These limit efficiency, increase latency, memory usage, and API call overhead:

1. Lack of Batch Operations for Document Addition (Major Bottleneck)

  • The underlying protobuf API (in proto/v5/collections_pb2_grpc.py and v6) supports BatchAddDocumentToCollection(BatchAddDocumentToCollectionRequest) RPC for adding multiple documents in a single call.
  • However, the SDK client (sync/collections.py) only exposes single-document methods: add_existing_document (single AddDocumentToCollection RPC) and upload_document (which internally calls single add after file upload).
  • Impact for Scaling: For bulk ingestion (common in RAG/knowledge base workflows), users must loop over individual calls, resulting in N gRPC RPCs for N documents. This amplifies network latency (each RPC has overhead), risks server-side rate limiting/throttling, and serializes processing. Examples show sequential small uploads, but real-world scale (e.g., 10k+ docs) would be prohibitively slow without user-implemented parallelism (threads/asyncio).
  • Evidence: No batch_add_documents method in client; proto grep confirms RPC exists but unimplemented. Design doc mentions "batch operations," but client only implements batch_get_documents.

2. Memory Inefficiency in Document Upload for Large Files

  • upload_document(..., data: bytes) in sync/collections.py requires loading entire document into memory to compute len(data) and slice chunks via _chunk_file_data (in files.py:93-129), which yields fixed-size chunks ( _CHUNK_SIZE, likely ~4MB) for streaming gRPC upload.
  • While upload itself streams (good), initial bytes load is client-side burden.
  • Impact for Scaling: Large documents (e.g., PDFs >100MB, datasets) cause high memory spikes per upload, risking OOM in batch loops or low-memory envs. No overload for path: str or file objects in upload_document; users must detour via client.files.upload(path=...) (which streams from disk via open("rb").read(_CHUNK_SIZE) in files.py:130-164), get file_id, then add_existing_document. This adds complexity and still requires single adds for batch.
  • Evidence: Examples use small b"""...""" strings; files.upload(path) handles streaming well, but collections workflow doesn't integrate it seamlessly.

3. Inefficient Per-Document Polling for Async Indexing Status

  • Document indexing (chunking, embedding, HNSW indexing) is server-side async after AddDocumentToCollection.
  • upload_document(wait_for_indexing=True) or manual get_document loops poll status via PollTimer (poll_timer.py), checking DocumentStatus (PROCESSING → PROCESSED/FAILED) with default 10s intervals (DEFAULT_INDEXING_POLL_INTERVAL).
  • No batch status method (e.g., poll multiple via batch_get_documents is possible but manual and still per-batch-loop).
  • Impact for Scaling: In bulk workflows, waiting for many docs requires many polls (e.g., 10s intervals × N docs = excessive gRPC traffic, even with configurable intervals). Busy-waiting wastes resources; no pub/sub or webhook alternatives. Timeouts (default 2min) may fail under load if server backlog.
  • Evidence: _wait_for_indexing_to_complete in sync/collections.py:319-356 is single-doc loop with time.sleep; design doc notes "optional polling," but no optimized batch/multi-wait.

4. Synchronous Blocking and Lack of Built-in Concurrency

  • Sync client (sync/collections.py) blocks on each RPC (upload, add, poll, search), making loops inherently sequential.
  • No high-level batch/parallel methods (e.g., upload_documents(paths: list[str], parallel=True) with progress or threading).
  • Async client (aio/collections.py) exists for concurrency but mirrors issues (single ops, no batch).
  • Impact for Scaling: High-throughput ingestion (e.g., processing directories of files) requires user-side concurrency (e.g., concurrent.futures), increasing code complexity and potential for race conditions (e.g., adding before upload complete). Telemetry/interceptors add minor per-call overhead under load.
  • Evidence: Examples (examples/sync/collection.py) run ops sequentially; BaseClient reuses channels (good), but no parallelism abstraction.

5. Minor Overhead in Repeated Validations and Conversions

  • Every create/update involves Pydantic TypeAdapter validations (FieldDefinitionValidator, ChunkConfigurationValidator) and dict-to-pb conversions (e.g., _field_definition_to_pb, _chunk_configuration_to_pb).
  • Impact for Scaling: Negligible for few ops, but in loops for many collections/fields, adds CPU cycles. Proto versioning (v5/v6 stubs) requires correct channel selection, potential misconfig under scale.
  • Evidence: collections.py converters called per-call; good type safety but runtime cost.

Recommendations for Mitigation (Not Implemented Changes)

  • Expose batch_add_documents(file_ids: list[str], fields: list[dict]) wrapping proto RPC.
  • Add upload_documents overloads for paths/files with optional concurrency and batch add.
  • Implement batch_wait_for_indexing(file_ids: list[str]) using batch_get_documents loops or proto batch status if available.
  • Integrate files streaming directly in collections for path-based uploads.
  • Add optional async wrappers or concurrency helpers in docs/examples for scale.

These issues primarily affect client-side efficiency for large-scale document management, aligning with the workflow's emphasis on "document storage, embedding, indexing" at volume. Server-side (HNSW search, embedding) scales via design (e.g., approximate NN), but SDK gaps force suboptimal usage. For current small-scale (as in examples), fine; for production RAG pipelines, significant hurdles.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions