Environment
- TPU: v5litepod-32 (32 chips, 8 hosts)
- JAX: 0.9.0
- jaxlib: 0.9.0
- orbax-checkpoint: 0.11.32
- Python: 3.12
- OS: Ubuntu 22.04 (tpu-ubuntu2204-base)
Problem
When converting a HuggingFace Llama 8B checkpoint to MaxText format using llama_or_mistral_ckpt.py, the checkpoint save fails silently. The layer conversion completes successfully (32/32 layers), but the Orbax save operation fails during atomic file rename.
Error Message
ValueError: NOT_FOUND: Error writing "params.params.decoder.decoder_norm.scale/c/0" in OCDBT database at local file "/home/user/maxtext-checkpoints/r1-distill-llama-8b/0.orbax-checkpoint-tmp/items.orbax-checkpoint-tmp/ocdbt.process_0/": Failed to rename fd: 14 "/home/user/maxtext-checkpoints/r1-distill-llama-8b/0.orbax-checkpoint-tmp/items.orbax-checkpoint-tmp/ocdbt.process_0/d/8f39efd1e47e6a0ada381d99719b652c.__lock" to: "/home/user/maxtext-checkpoints/r1-distill-llama-8b/0.orbax-checkpoint-tmp/items.orbax-checkpoint-tmp/ocdbt.process_0/d/8f39efd1e47e6a0ada381d99719b652c" [OS error 2: ENOENT No such file or directory] [source locations='tensorstore/internal/os/file_util_posix.cc:406\ntensorstore/kvstore/file/file_key_value_store.cc:662\ntensorstore/kvstore/transaction.cc:133']
Full Stack Trace
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/futures/future.py", line 309, in _target_setting_result
self._result = target()
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/futures/future.py", line 426, in <lambda>
target=lambda: asyncio_utils.run_sync(coro),
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/asyncio_utils.py", line 36, in run_sync
return asyncio.run(coro)
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/jax_array_handlers.py", line 373, in _serialize_without_dispatcher
await async_serialize_replica_slices_batch(
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/jax_array_handlers.py", line 618, in _async_serialize_replica_slices
await asyncio.gather(*write_coros)
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/serialization.py", line 265, in async_serialize_from_host
await asyncio.gather(*write_coros)
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/serialization.py", line 253, in write_fragment
await t[fragment.index].write(
ValueError: NOT_FOUND: Error writing "params.params.decoder.decoder_norm.scale/c/0" in OCDBT database...
Steps to Reproduce
- Create a v5litepod-32 TPU
- Install MaxText:
pip install -e . from https://github.com/AI-Hypercomputer/maxtext
- Download a HuggingFace Llama model (e.g., deepseek-ai/DeepSeek-R1-Distill-Llama-8B)
- Run the checkpoint conversion:
JAX_PLATFORMS=cpu python3 llama_or_mistral_ckpt.py \
--base-model-path=/path/to/hf-model \
--maxtext-model-path=/path/to/output \
--model-size=llama3-8b \
--huggingface-checkpoint=True
Observed Behavior
- Layer conversion completes:
layers: 100%|██████████| 32/32
- Sharding conversion completes
- Orbax save starts but fails during atomic rename
- Checkpoint directory contains ~11GB of data but is incomplete (missing tree structure metadata)
- Subsequent attempts to load the checkpoint fail with "No structure could be identified"
What We Tried
- Saving to local disk instead of GCS - same issue
- Using
JAX_PLATFORMS=cpu - same issue
- Disabling OCDBT/Zarr3 with legacy format flags - same issue (save appears to succeed but 0 bytes written)
- Different JAX versions (0.5.0, 0.8.1, 0.9.0) - same issue
Analysis
The error appears to be a race condition or file descriptor issue in the tensorstore OCDBT layer:
- Orbax creates a
.lock file during write
- Tries to atomically rename
.lock → final file
- Rename fails with ENOENT despite the file existing moments before
This may be related to async coordination between multiple write operations.
Expected Behavior
Checkpoint conversion should complete successfully and produce a loadable MaxText checkpoint.
Impact
This completely blocks the ability to convert HuggingFace models to MaxText format on multi-host TPUs, which is a primary use case for the MaxText + JetStream serving stack.
Environment
Problem
When converting a HuggingFace Llama 8B checkpoint to MaxText format using
llama_or_mistral_ckpt.py, the checkpoint save fails silently. The layer conversion completes successfully (32/32 layers), but the Orbax save operation fails during atomic file rename.Error Message
Full Stack Trace
Steps to Reproduce
pip install -e .from https://github.com/AI-Hypercomputer/maxtextJAX_PLATFORMS=cpu python3 llama_or_mistral_ckpt.py \ --base-model-path=/path/to/hf-model \ --maxtext-model-path=/path/to/output \ --model-size=llama3-8b \ --huggingface-checkpoint=TrueObserved Behavior
layers: 100%|██████████| 32/32What We Tried
JAX_PLATFORMS=cpu- same issueAnalysis
The error appears to be a race condition or file descriptor issue in the tensorstore OCDBT layer:
.lockfile during write.lock→ final fileThis may be related to async coordination between multiple write operations.
Expected Behavior
Checkpoint conversion should complete successfully and produce a loadable MaxText checkpoint.
Impact
This completely blocks the ability to convert HuggingFace models to MaxText format on multi-host TPUs, which is a primary use case for the MaxText + JetStream serving stack.