Skip to content

grey-box/symmetry-q-n-a-datasets

Repository files navigation

Wikipedia Q&A Dataset Generator

A tool for generating translation Q&A datasets from Wikipedia articles using semantic sentence alignment.

Features

  • Semantic Alignment: Uses sentence-transformers (LaBSE) to semantically align sentences between original and translated articles
  • Bidirectional Pairs: Creates translation pairs in both directions (original→translated and translated→original)
  • Multiple Output Formats:
    • dataset.json: Complete QnA dataset with single and batch sentence pairs
    • leftover_sentences.json: Sentences that didn't meet similarity threshold
    • rag_aligned_pairs.csv: Aligned sentence pairs for RAG systems
  • Multi-language Support: Supports English, German, French, Spanish, Italian, Portuguese, and Dutch
  • Batch Processing: Generates both single-sentence and 10-sentence batch Q&A pairs

Requirements

sentence-transformers
scikit-learn
spacy

Install spaCy language models for your target languages:

python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
# etc.

Usage

  1. Prepare your files:

    • Place your original article text in a file (e.g., article_en.txt)
    • Place your translated article text in a file (e.g., article_es.txt)
  2. Configure the script: Edit dataset_gen.py and update the configuration section in main():

    original_file = "article_en.txt"  # Original article
    translated_file = "article_es.txt"  # Translated article
    source_lang = "en"  # Source language code
    target_lang = "es"  # Target language code
    sim_threshold = 0.75  # Similarity threshold (0.0-1.0)
  3. Run the generator:

    python dataset_gen.py

Output Files

dataset.json

Contains Q&A pairs with metadata:

{
  "metadata": {
    "source_language": "en",
    "target_language": "es",
    "total_qna_pairs": 490
  },
  "qna_pairs": [
    {
      "question": "Translate this article from en to es: A door is...",
      "answer": "Una puerta es...",
      "type": "single_sentence",
      "direction": "en_to_es",
      "index": 0
    }
  ]
}

leftover_sentences.json

Lists sentences that didn't pass the similarity threshold, categorized as:

  • Missing from translation (original content not translated)
  • Extra in translation (added content not in original)

rag_aligned_pairs.csv

CSV with aligned sentence pairs for RAG applications:

original_text_en,translated_text_es
"A door is...","Una puerta es..."

API Reference

generate_dataset(original_file, translated_file, source_lang, target_lang, sim_threshold=0.75, ...)

Main function to generate QnA dataset from two text files.

Parameters:

  • original_file: Path to original text file
  • translated_file: Path to translated text file
  • source_lang: Source language code (e.g., "en")
  • target_lang: Target language code (e.g., "es")
  • sim_threshold: Semantic similarity threshold (default: 0.75)

semantic_compare(original_blob, translated_blob, source_language, target_language, sim_threshold=0.75, model_name=None)

Performs semantic comparison between articles using sentence embeddings.

How It Works

  1. Sentence Segmentation: Uses spaCy models to split text into sentences
  2. Semantic Encoding: Encodes sentences using LaBSE multilingual embeddings
  3. Alignment: Compares sentence embeddings to find semantically aligned pairs
  4. Filtering: Filters out sentences below the similarity threshold
  5. Q&A Generation: Creates bidirectional translation pairs for both single and batched sentences

Supported Languages

  • English (en)
  • German (de)
  • French (fr)
  • Spanish (es)
  • Italian (it)
  • Portuguese (pt)
  • Dutch (nl)

Additional languages can be added by installing corresponding spaCy models and updating the language map in refine_compare.py.

About

Generating Q&A datasets from a Wikipedia Article

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages