[ENHANCEMENT]  Sub-optimal sdk implementation for uploading collections

### Confirmation of Request Source

- [x] I confirm that this feature request is for the xAI SDK Python library (e.g., client features, SDK methods, or documentation) and not for the xAI API itself (e.g., model capabilities or API changes).

### Describe the feature you'd like

We built a design review tool for the X AI hackathon.
The tool found scaling bottlenecks in the python sdk for collections management. The tool relies on artifacts it had generated (attached [here)]([url](

[workflows.json](https://github.com/user-attachments/files/24018197/workflows.json)
[design-workflow-3-collections-management.md](https://github.com/user-attachments/files/24018198/design-workflow-3-collections-management.md)

))

**Scaling Bottlenecks in the Third Workflow: Collections Management**

The third workflow, as documented in `.exp/design-workflow-3-collections-management.md` and `workflows.json`, focuses on managing vector collections for document storage, embedding, indexing, searching, and retrieval. It involves operations like creating collections, uploading/indexing documents, and searching via gRPC calls to management and documents services. Exploration of the codebase (`src/xai_sdk/collections.py`, `src/xai_sdk/sync/collections.py`, `src/xai_sdk/files.py`, `src/xai_sdk/poll_timer.py`, proto files, and `examples/sync/collection.py`) reveals several client-side scaling bottlenecks, particularly for high-volume scenarios (e.g., ingesting thousands of documents or handling large files). These limit efficiency, increase latency, memory usage, and API call overhead:

### 1. **Lack of Batch Operations for Document Addition (Major Bottleneck)**
   - The underlying protobuf API (in `proto/v5/collections_pb2_grpc.py` and `v6`) supports `BatchAddDocumentToCollection(BatchAddDocumentToCollectionRequest)` RPC for adding multiple documents in a single call.
   - However, the SDK client (`sync/collections.py`) only exposes single-document methods: `add_existing_document` (single `AddDocumentToCollection` RPC) and `upload_document` (which internally calls single add after file upload).
   - **Impact for Scaling**: For bulk ingestion (common in RAG/knowledge base workflows), users must loop over individual calls, resulting in N gRPC RPCs for N documents. This amplifies network latency (each RPC has overhead), risks server-side rate limiting/throttling, and serializes processing. Examples show sequential small uploads, but real-world scale (e.g., 10k+ docs) would be prohibitively slow without user-implemented parallelism (threads/asyncio).
   - **Evidence**: No `batch_add_documents` method in client; proto grep confirms RPC exists but unimplemented. Design doc mentions "batch operations," but client only implements `batch_get_documents`.

### 2. **Memory Inefficiency in Document Upload for Large Files**
   - `upload_document(..., data: bytes)` in `sync/collections.py` requires loading entire document into memory to compute `len(data)` and slice chunks via `_chunk_file_data` (in `files.py:93-129`), which yields fixed-size chunks ( `_CHUNK_SIZE`, likely ~4MB) for streaming gRPC upload.
   - While upload itself streams (good), initial `bytes` load is client-side burden.
   - **Impact for Scaling**: Large documents (e.g., PDFs >100MB, datasets) cause high memory spikes per upload, risking OOM in batch loops or low-memory envs. No overload for `path: str` or file objects in `upload_document`; users must detour via `client.files.upload(path=...)` (which streams from disk via `open("rb").read(_CHUNK_SIZE)` in `files.py:130-164`), get `file_id`, then `add_existing_document`. This adds complexity and still requires single adds for batch.
   - **Evidence**: Examples use small `b"""..."""` strings; `files.upload(path)` handles streaming well, but collections workflow doesn't integrate it seamlessly.

### 3. **Inefficient Per-Document Polling for Async Indexing Status**
   - Document indexing (chunking, embedding, HNSW indexing) is server-side async after `AddDocumentToCollection`.
   - `upload_document(wait_for_indexing=True)` or manual `get_document` loops poll status via `PollTimer` (`poll_timer.py`), checking `DocumentStatus` (PROCESSING → PROCESSED/FAILED) with default 10s intervals (`DEFAULT_INDEXING_POLL_INTERVAL`).
   - No batch status method (e.g., poll multiple via `batch_get_documents` is possible but manual and still per-batch-loop).
   - **Impact for Scaling**: In bulk workflows, waiting for many docs requires many polls (e.g., 10s intervals × N docs = excessive gRPC traffic, even with configurable intervals). Busy-waiting wastes resources; no pub/sub or webhook alternatives. Timeouts (default 2min) may fail under load if server backlog.
   - **Evidence**: `_wait_for_indexing_to_complete` in `sync/collections.py:319-356` is single-doc loop with `time.sleep`; design doc notes "optional polling," but no optimized batch/multi-wait.

### 4. **Synchronous Blocking and Lack of Built-in Concurrency**
   - Sync client (`sync/collections.py`) blocks on each RPC (upload, add, poll, search), making loops inherently sequential.
   - No high-level batch/parallel methods (e.g., `upload_documents(paths: list[str], parallel=True)` with progress or threading).
   - Async client (`aio/collections.py`) exists for concurrency but mirrors issues (single ops, no batch).
   - **Impact for Scaling**: High-throughput ingestion (e.g., processing directories of files) requires user-side concurrency (e.g., `concurrent.futures`), increasing code complexity and potential for race conditions (e.g., adding before upload complete). Telemetry/interceptors add minor per-call overhead under load.
   - **Evidence**: Examples (`examples/sync/collection.py`) run ops sequentially; `BaseClient` reuses channels (good), but no parallelism abstraction.

### 5. **Minor Overhead in Repeated Validations and Conversions**
   - Every create/update involves Pydantic `TypeAdapter` validations (`FieldDefinitionValidator`, `ChunkConfigurationValidator`) and dict-to-pb conversions (e.g., `_field_definition_to_pb`, `_chunk_configuration_to_pb`).
   - **Impact for Scaling**: Negligible for few ops, but in loops for many collections/fields, adds CPU cycles. Proto versioning (v5/v6 stubs) requires correct channel selection, potential misconfig under scale.
   - **Evidence**: `collections.py` converters called per-call; good type safety but runtime cost.

### Recommendations for Mitigation (Not Implemented Changes)
- Expose `batch_add_documents(file_ids: list[str], fields: list[dict])` wrapping proto RPC.
- Add `upload_documents` overloads for paths/files with optional concurrency and batch add.
- Implement `batch_wait_for_indexing(file_ids: list[str])` using `batch_get_documents` loops or proto batch status if available.
- Integrate files streaming directly in collections for path-based uploads.
- Add optional async wrappers or concurrency helpers in docs/examples for scale.

These issues primarily affect client-side efficiency for large-scale document management, aligning with the workflow's emphasis on "document storage, embedding, indexing" at volume. Server-side (HNSW search, embedding) scales via design (e.g., approximate NN), but SDK gaps force suboptimal usage. For current small-scale (as in examples), fine; for production RAG pipelines, significant hurdles.


### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENHANCEMENT] Sub-optimal sdk implementation for uploading collections #77

Confirmation of Request Source

Describe the feature you'd like

1. Lack of Batch Operations for Document Addition (Major Bottleneck)

2. Memory Inefficiency in Document Upload for Large Files

3. Inefficient Per-Document Polling for Async Indexing Status

4. Synchronous Blocking and Lack of Built-in Concurrency

5. Minor Overhead in Repeated Validations and Conversions

Recommendations for Mitigation (Not Implemented Changes)

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ENHANCEMENT] Sub-optimal sdk implementation for uploading collections #77

Description

Confirmation of Request Source

Describe the feature you'd like

1. Lack of Batch Operations for Document Addition (Major Bottleneck)

2. Memory Inefficiency in Document Upload for Large Files

3. Inefficient Per-Document Polling for Async Indexing Status

4. Synchronous Blocking and Lack of Built-in Concurrency

5. Minor Overhead in Repeated Validations and Conversions

Recommendations for Mitigation (Not Implemented Changes)

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions