Checkpoint save fails with ENOENT during OCDBT atomic rename on TPU v5litepod

## Environment

- **TPU:** v5litepod-32 (32 chips, 8 hosts)
- **JAX:** 0.9.0
- **jaxlib:** 0.9.0
- **orbax-checkpoint:** 0.11.32
- **Python:** 3.12
- **OS:** Ubuntu 22.04 (tpu-ubuntu2204-base)

## Problem

When converting a HuggingFace Llama 8B checkpoint to MaxText format using `llama_or_mistral_ckpt.py`, the checkpoint save fails silently. The layer conversion completes successfully (32/32 layers), but the Orbax save operation fails during atomic file rename.

## Error Message

```
ValueError: NOT_FOUND: Error writing "params.params.decoder.decoder_norm.scale/c/0" in OCDBT database at local file "/home/user/maxtext-checkpoints/r1-distill-llama-8b/0.orbax-checkpoint-tmp/items.orbax-checkpoint-tmp/ocdbt.process_0/": Failed to rename fd: 14 "/home/user/maxtext-checkpoints/r1-distill-llama-8b/0.orbax-checkpoint-tmp/items.orbax-checkpoint-tmp/ocdbt.process_0/d/8f39efd1e47e6a0ada381d99719b652c.__lock" to: "/home/user/maxtext-checkpoints/r1-distill-llama-8b/0.orbax-checkpoint-tmp/items.orbax-checkpoint-tmp/ocdbt.process_0/d/8f39efd1e47e6a0ada381d99719b652c" [OS error 2: ENOENT No such file or directory] [source locations='tensorstore/internal/os/file_util_posix.cc:406\ntensorstore/kvstore/file/file_key_value_store.cc:662\ntensorstore/kvstore/transaction.cc:133']
```

## Full Stack Trace

```python
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/futures/future.py", line 309, in _target_setting_result
    self._result = target()
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/futures/future.py", line 426, in <lambda>
    target=lambda: asyncio_utils.run_sync(coro),
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/asyncio_utils.py", line 36, in run_sync
    return asyncio.run(coro)
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/jax_array_handlers.py", line 373, in _serialize_without_dispatcher
    await async_serialize_replica_slices_batch(
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/jax_array_handlers.py", line 618, in _async_serialize_replica_slices
    await asyncio.gather(*write_coros)
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/serialization.py", line 265, in async_serialize_from_host
    await asyncio.gather(*write_coros)
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/serialization.py", line 253, in write_fragment
    await t[fragment.index].write(
ValueError: NOT_FOUND: Error writing "params.params.decoder.decoder_norm.scale/c/0" in OCDBT database...
```

## Steps to Reproduce

1. Create a v5litepod-32 TPU
2. Install MaxText: `pip install -e .` from https://github.com/AI-Hypercomputer/maxtext
3. Download a HuggingFace Llama model (e.g., deepseek-ai/DeepSeek-R1-Distill-Llama-8B)
4. Run the checkpoint conversion:

```bash
JAX_PLATFORMS=cpu python3 llama_or_mistral_ckpt.py \
    --base-model-path=/path/to/hf-model \
    --maxtext-model-path=/path/to/output \
    --model-size=llama3-8b \
    --huggingface-checkpoint=True
```

## Observed Behavior

1. Layer conversion completes: `layers: 100%|██████████| 32/32`
2. Sharding conversion completes
3. Orbax save starts but fails during atomic rename
4. Checkpoint directory contains ~11GB of data but is incomplete (missing tree structure metadata)
5. Subsequent attempts to load the checkpoint fail with "No structure could be identified"

## What We Tried

- Saving to local disk instead of GCS - same issue
- Using `JAX_PLATFORMS=cpu` - same issue
- Disabling OCDBT/Zarr3 with legacy format flags - same issue (save appears to succeed but 0 bytes written)
- Different JAX versions (0.5.0, 0.8.1, 0.9.0) - same issue

## Analysis

The error appears to be a race condition or file descriptor issue in the tensorstore OCDBT layer:
1. Orbax creates a `.lock` file during write
2. Tries to atomically rename `.lock` → final file
3. Rename fails with ENOENT despite the file existing moments before

This may be related to async coordination between multiple write operations.

## Expected Behavior

Checkpoint conversion should complete successfully and produce a loadable MaxText checkpoint.

## Impact

This completely blocks the ability to convert HuggingFace models to MaxText format on multi-host TPUs, which is a primary use case for the MaxText + JetStream serving stack.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint save fails with ENOENT during OCDBT atomic rename on TPU v5litepod #2837

Environment

Problem

Error Message

Full Stack Trace

Steps to Reproduce

Observed Behavior

What We Tried

Analysis

Expected Behavior

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Checkpoint save fails with ENOENT during OCDBT atomic rename on TPU v5litepod #2837

Description

Environment

Problem

Error Message

Full Stack Trace

Steps to Reproduce

Observed Behavior

What We Tried

Analysis

Expected Behavior

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions