Skip to content
Change the repository type filter

All

    Repositories list

    • Step-by-step schematic description of data processing in HPLT
      Python
      MIT License
      1120Updated Apr 19, 2026Apr 19, 2026
    • HPLT-WP4

      Public
      Information and pipelines on WP4: language models training
      Jupyter Notebook
      Creative Commons Zero v1.0 Universal
      3350Updated Mar 30, 2026Mar 30, 2026
    • Scripts for running bitextor jobs
      Shell
      1010Updated Mar 15, 2026Mar 15, 2026
    • OpenLID-v3
      HTML
      0100Updated Mar 9, 2026Mar 9, 2026
    • Set of scripts to run monotextor-like pipeline under slurm HPCs
      Rust
      GNU General Public License v3.0
      1500Updated Feb 26, 2026Feb 26, 2026
    • HPLT Analytics
      JavaScript
      GNU General Public License v3.0
      41501Updated Feb 23, 2026Feb 23, 2026
    • openlid

      Public
      OpenLID-v3
      Python
      GNU General Public License v3.0
      5100Updated Feb 20, 2026Feb 20, 2026
    • Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca
      Python
      11000Updated Feb 16, 2026Feb 16, 2026
    • hplt-e

      Public
      Multilingual evaluation framework
      Python
      MIT License
      0300Updated Feb 11, 2026Feb 11, 2026
    • Python port of Moses tokenizer, truecaser and normalizer
      Python
      MIT License
      59494284Updated Feb 6, 2026Feb 6, 2026
    • clianer

      Public
      A lightweight command-line frontend to OpusCleaner
      Python
      MIT License
      1000Updated Feb 5, 2026Feb 5, 2026
    • OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
      Python
      1658561Updated Feb 3, 2026Feb 3, 2026
    • Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.
      Jupyter Notebook
      MIT License
      1400Updated Jan 31, 2026Jan 31, 2026
    • Ongoing research training transformer models at scale
      Python
      Other
      3.9k000Updated Jan 28, 2026Jan 28, 2026
    • OpusPocus

      Public
      Marian machine translation training pipeline for thousands of models
      Python
      MIT License
      04160Updated Jan 27, 2026Jan 27, 2026
    • Jupyter Notebook
      MIT License
      7110Updated Jan 26, 2026Jan 26, 2026
    • Jupyter Notebook
      MIT License
      9000Updated Jan 26, 2026Jan 26, 2026
    • tf/idf-based document aligner from Bitextor
      C++
      Apache License 2.0
      0001Updated Jan 22, 2026Jan 22, 2026
    • Shell
      0130Updated Jan 22, 2026Jan 22, 2026
    • Internet archive downloader
      Jupyter Notebook
      MIT License
      0220Updated Jan 22, 2026Jan 22, 2026
    • Find your documents in the HPLT datasets
      Python
      MIT License
      1020Updated Jan 22, 2026Jan 22, 2026
    • Fast and secure translation on your local machine, powered by marian and Bergamot.
      C++
      MIT License
      40000Updated Jan 21, 2026Jan 21, 2026
    • web interface for exploring OPUS data
      PHP
      1000Updated Jan 21, 2026Jan 21, 2026
    • OPUS-API

      Public
      API for searching corpora from OPUS
      Python
      2000Updated Jan 21, 2026Jan 21, 2026
    • Makefile
      3000Updated Jan 21, 2026Jan 21, 2026
    • OpusTools

      Public
      Python
      22000Updated Jan 21, 2026Jan 21, 2026
    • OPUS

      Public
      The Open Parallel Corpus
      Makefile
      14000Updated Jan 21, 2026Jan 21, 2026
    • PHP
      MIT License
      1000Updated Jan 21, 2026Jan 21, 2026
    • OpusFilter - Parallel corpus processing toolkit
      Python
      MIT License
      26000Updated Jan 21, 2026Jan 21, 2026
    • Training pipelines for Firefox Translations neural machine translation models (adapted for OPUS-MT and integrating GreenNLP metrics)
      Python
      Mozilla Public License 2.0
      46000Updated Jan 8, 2026Jan 8, 2026
    ProTip! When viewing an organization's repositories, you can use the props. filter to filter by custom property.