translation_pipeline

Datatrove-based translation pipeline for benchmark datasets (LongBench and RULER), with vLLM inference, retry handling, and optional Hugging Face upload. It is also designed for other long-context datasets where chunk-first translation helps prevent models from answering tasks instead of translating long reasoning traces or long-form text.

Requirements

Python >=3.12
uv
Reachable vLLM OpenAI-compatible endpoint (default: http://localhost:8000/v1)

Optional workspace setting:

export UV_PYTHON_INSTALL_DIR=/mnt/vast/workspaces/jackal_ai/.uv/python

Install dependencies:

uv sync

How It Works

Input loading:
- LongBench: downloads/extracts data files and processes subject-wise JSONL.
- RULER: loads configured split(s) from the HF dataset and writes JSONL inputs.
Datatrove pipeline setup:
- Reads rows as Documents, preserving row metadata.
- Processes in chunks (records-per-chunk) with configurable concurrency.
- Chunking is intentional for long contexts: many translation models tend to switch into answer-generation mode instead of faithful translation when prompts become very long.
Field translation:
- LongBench fields: input, context, answers
- RULER fields: input, outputs
- Uses specialized profiles for RULER QA/NIAH layouts to reduce malformed generations.
- The same chunk-first strategy is useful beyond LongBench/RULER for any dataset with long reasoning traces or other long-form text that must be translated without changing task semantics.
Failure handling:
- Retries translation attempts per row.
- Marks unrecoverable rows with _translation_failed=True.
- Persists failed attempt artifacts to per-config logs.
Output and publishing:
- Writes JSONL outputs per benchmark config.
- Optionally uploads to Hugging Face and/or saves dataset artifacts to disk.

Usage

Main entrypoint:

python translate_datatrove.py --help

LongBench example

python translate_datatrove.py \
  --benchmark longbench \
  --language de \
  --vllm-url http://localhost:8000/v1 \
  --output-dir ./datatrove_outputs \
  --checkpoints-dir ./datatrove_checkpoints \
  --dataset-filter hotpotqa,musique

RULER example

python translate_datatrove.py \
  --benchmark ruler \
  --language de \
  --vllm-url http://localhost:8000/v1 \
  --ruler-splits qa_1,qa_2 \
  --output-dir ./datatrove_outputs \
  --checkpoints-dir ./datatrove_checkpoints

Retry previously failed rows

python translate_datatrove.py \
  --benchmark longbench \
  --language de \
  --retry-failed-from ./datatrove_outputs/longbench/hotpotqa_de \
  --output-dir ./datatrove_outputs \
  --checkpoints-dir ./datatrove_checkpoints

Hugging Face upload

Set HF_TOKEN or pass --hf-token. Upload is enabled by default unless --no-upload is set.

Validation

Use validate_datatrove_output.py to check missing/empty fields, failure flags, and length ratios against English:

python validate_datatrove_output.py \
  --output-dir ./datatrove_outputs/longbench/hotpotqa_de \
  --language de \
  --fields input,context,answers \
  --no-suffix \
  --compare-dir ./datatrove_outputs/longbench/hotpotqa_en

Repository Layout

translate_datatrove.py: compatibility wrapper and CLI entrypoint.
translation_pipeline/cli.py: argument parsing and benchmark orchestration.
translation_pipeline/general_logic.py: shared translation and validation logic.
translation_pipeline/ruler_logic.py: RULER-specific parsing/profiles.
translation_pipeline/longbench_logic.py: LongBench-specific dataset helpers.
translation_pipeline/pipeline_ops.py: Datatrove runtime, IO, retry merge, and upload utilities.
validate_datatrove_output.py: post-run output validator.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
translation_pipeline		translation_pipeline
.gitignore		.gitignore
README.md		README.md
merge_translation_outputs.py		merge_translation_outputs.py
pyproject.toml		pyproject.toml
translate_datatrove.py		translate_datatrove.py
uv.lock		uv.lock
validate_datatrove_output.py		validate_datatrove_output.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

translation_pipeline

Requirements

How It Works

Usage

LongBench example

RULER example

Retry previously failed rows

Hugging Face upload

Validation

Repository Layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

translation_pipeline

Requirements

How It Works

Usage

LongBench example

RULER example

Retry previously failed rows

Hugging Face upload

Validation

Repository Layout

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages