A tool for generating translation Q&A datasets from Wikipedia articles using semantic sentence alignment.
- Semantic Alignment: Uses sentence-transformers (LaBSE) to semantically align sentences between original and translated articles
- Bidirectional Pairs: Creates translation pairs in both directions (original→translated and translated→original)
- Multiple Output Formats:
dataset.json: Complete QnA dataset with single and batch sentence pairsleftover_sentences.json: Sentences that didn't meet similarity thresholdrag_aligned_pairs.csv: Aligned sentence pairs for RAG systems
- Multi-language Support: Supports English, German, French, Spanish, Italian, Portuguese, and Dutch
- Batch Processing: Generates both single-sentence and 10-sentence batch Q&A pairs
sentence-transformers
scikit-learn
spacy
Install spaCy language models for your target languages:
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
# etc.-
Prepare your files:
- Place your original article text in a file (e.g.,
article_en.txt) - Place your translated article text in a file (e.g.,
article_es.txt)
- Place your original article text in a file (e.g.,
-
Configure the script: Edit
dataset_gen.pyand update the configuration section inmain():original_file = "article_en.txt" # Original article translated_file = "article_es.txt" # Translated article source_lang = "en" # Source language code target_lang = "es" # Target language code sim_threshold = 0.75 # Similarity threshold (0.0-1.0)
-
Run the generator:
python dataset_gen.py
Contains Q&A pairs with metadata:
{
"metadata": {
"source_language": "en",
"target_language": "es",
"total_qna_pairs": 490
},
"qna_pairs": [
{
"question": "Translate this article from en to es: A door is...",
"answer": "Una puerta es...",
"type": "single_sentence",
"direction": "en_to_es",
"index": 0
}
]
}Lists sentences that didn't pass the similarity threshold, categorized as:
- Missing from translation (original content not translated)
- Extra in translation (added content not in original)
CSV with aligned sentence pairs for RAG applications:
original_text_en,translated_text_es
"A door is...","Una puerta es..."Main function to generate QnA dataset from two text files.
Parameters:
original_file: Path to original text filetranslated_file: Path to translated text filesource_lang: Source language code (e.g., "en")target_lang: Target language code (e.g., "es")sim_threshold: Semantic similarity threshold (default: 0.75)
semantic_compare(original_blob, translated_blob, source_language, target_language, sim_threshold=0.75, model_name=None)
Performs semantic comparison between articles using sentence embeddings.
- Sentence Segmentation: Uses spaCy models to split text into sentences
- Semantic Encoding: Encodes sentences using LaBSE multilingual embeddings
- Alignment: Compares sentence embeddings to find semantically aligned pairs
- Filtering: Filters out sentences below the similarity threshold
- Q&A Generation: Creates bidirectional translation pairs for both single and batched sentences
- English (en)
- German (de)
- French (fr)
- Spanish (es)
- Italian (it)
- Portuguese (pt)
- Dutch (nl)
Additional languages can be added by installing corresponding spaCy models and updating the language map in refine_compare.py.