SilkRoadNLP 2026 — Proceedings

First Workshop on NLP and LLMs for the Iranian Language Family Co-located with EACL 2026 · Rabat, Morocco · March 2026

This repository hosts the accepted papers of the SilkRoadNLP 2026 workshop, dedicated to advancing NLP research across the Iranian language family — including Persian, Kurdish, Pashto, Balochi, Luri, Ossetian, Tajik, Shughni, and related languages.

For more information, visit silkroadnlp.org

Accepted Papers

#	PDF	Paper	Resources	Poster
1	📄	Unmasking the Factual-Conceptual Gap in Persian Language Models _{Alireza Sakhaeirad, Ali Ma'manpoosh, Arshia Hemmat}		—
2	📄	Benchmarking Offensive Language Detection in Persian and Pashto _{Zahra Bokaei, Bonnie Webber, Walid Magdy}	—	—
3	📄	Do Large Language Models Understand Double Mismatches? Evidence from Farsi _{Maryam Mohammadi}	—	—
4	📄	TajPersLexon: A Tajik–Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP _{Mullosharaf Kurbonovich Arabov}	—	—
5	📄	A Computational Approach to Language Contact — A Case Study of Persian _{Ali Basirat, Danial Namazifard, Navid Baradaran Hemmati}	—	—
6	📄	Online Polarization Detection in Persian (Farsi) Social Media _{Saeedeh Davoudi, Nazli Goharian}		—
7	📄	ParsCORE: The Persian Corpus of Online Registers _{Alireza Razzaghi, Erik Henriksson, Veronika Laippala}	`Dataset & Code on GitHub (link TBD)`	—
8	📄	PMWP: A Benchmark for Math Word Problem Solving in Persian _{Marzieh Abdolmaleki, Mehrnoush Shamsfard, Veronique Hoste, Els Lefever}		—
9	📄	APARSIN: A Multi-Variety Sentiment and Translation Benchmark for Iranic Languages _{Sadegh Jafari, Tara Azin, Farhad Roodi, Zahra Dehghani Tafti, Mehrdad Ghadrdan, Elham Vatankhahan Esfahani, Aylin Naebzadeh, Mohammadhadi Shahhosseini, Ghafoor Khan, Kazem Forghani, Danial Namazi, Seyed Mohammad Hossein Hashemi, Farhan Farsi, Mohammad Osoolian, Maede Mohammadi, Mohammad Erfan Zare, Muhammad Hasnain Khan, Muhammad Hussain, Nooreen Zaki, Joma Mohammadi, Shayan Bali, Mohammad Javad Ranjbar, Els Lefever, Veronique Hoste}		—
10	📄	One Language, Three of Its Voices: Evaluating Multilingual LLMs Across Persian, Dari, and Tajiki on Translation and Understanding Tasks _{Noor Mairukh Khan Arnob, Abu Bakar Siddique Mahi}	—	—
11	📄	PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration _{Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery}	`Dataset & Model publicly available (link TBD)`	—
12	📄	Shughni Machine Translation Enhanced by Donor Languages _{Dmitry Novokshanov, Innokentiy S. Humonen, Ilya Makarov}		—
13	📄	Segmentation Strategy Matters: Benchmarking Whisper on Persian YouTube Content _{Reihaneh Iranmanesh, Rojin Ziaei, Joe Garman}		—
14	📄	Multi-modal Neural Machine Translation for Low-Resource Classical Persian Poetry: A Culture-Aware Evaluation _{Soheila Ansari, Mounir Boukadoum, Fatiha Sadat}		—

Paper Abstracts

Paper 2 — Unmasking the Factual-Conceptual Gap in Persian Language Models

We introduce DivanBench, a manually curated benchmark of 315 questions across three task types — factual retrieval, paired scenario verification, and situational reasoning — designed to probe cultural and conceptual knowledge in Persian LLMs. Our evaluation reveals a consistent "acquiescence trap" where models default to agreement, and highlights gaps between factual recall and deeper conceptual reasoning.

📦 Dataset: huggingface.co/datasets/divanbench/divanbench

Paper 3 — Benchmarking Offensive Language Detection in Persian and Pashto

This paper provides a systematic benchmark of offensive language detection across Persian and Pashto, evaluating a range of transformer-based models including multilingual and language-specific variants. It examines cross-lingual transfer, the impact of script similarity, and limitations of existing datasets.

Paper 4 — Do Large Language Models Understand Double Mismatches? Evidence from Farsi

This paper investigates whether LLMs can handle double mismatch constructions in Farsi — syntactic configurations where two grammatical features simultaneously deviate from their canonical agreement patterns. Results reveal systematic failures in current LLMs, pointing to gaps in morphosyntactic reasoning for morphologically rich languages.

Paper 5 — TajPersLexon: A Tajik–Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP

We present TajPersLexon, a bilingual lexical resource of 40,112 Tajik–Persian word/phrase pairs bridging Cyrillic and Perso-Arabic scripts. A hybrid transliteration and alignment model is introduced to support cross-script NLP tasks for these closely related but orthographically distant language varieties.

Paper 6 — A Computational Approach to Language Contact — A Case Study of Persian

This paper proposes a computational framework for detecting and quantifying language contact effects in Persian, modeling lexical borrowing and phonological adaptation through diachronic corpus analysis. The methodology is validated against historical linguistic accounts of Arabic, French, and English influence on Persian.

Paper 7 — Online Polarization Detection in Persian (Farsi) Social Media

We investigate political polarization on Persian-language social media using NLP techniques. Fine-tuned transformer models are evaluated on the POLAR dataset, and the paper analyzes the impact of pre-training language specificity on polarization classification performance.

💻 Code: github.com/dsaeedeh/Polarization_Detection

Paper 8 — ParsCORE: The Persian Corpus of Online Registers

ParsCORE v0.1 is a corpus of 2,000 human-annotated Persian web documents spanning diverse online registers, developed within the Universal Register framework. Initial experiments on automatic register identification show performance comparable to high-resource languages, establishing a foundation for Persian web-language research.

💻 Dataset & Code: GitHub

Paper 10 — PMWP: A Benchmark for Math Word Problem Solving in Persian

We introduce PMWP, the first dataset of 15,000 elementary-level Persian math word problems for training and evaluating mathematical reasoning in LLMs. Systematic evaluation shows Gemini-2.5-Flash achieves the highest accuracy (72.02%), while LoRA fine-tuning of open-weight models (LLaMA-3-8B, Qwen-2.5-7B) reaches over 91% exact equation match.

💻 Dataset: github.com/marzieh-abdolmaleki/PMWP

Paper 11 — APARSIN: A Multi-Variety Sentiment and Translation Benchmark for Iranic Languages

APARSIN is a large-scale benchmark covering 14 Iranic languages and dialects for sentiment analysis and machine translation. It provides standardized evaluation protocols and baselines using state-of-the-art LLMs, addressing the critical lack of multi-variety resources for the Iranian language family.

💻 Benchmark: github.com/SilkRoadAparsin

Paper 12 — One Language, Three of Its Voices: Evaluating Multilingual LLMs Across Persian, Dari, and Tajiki

This paper evaluates multilingual LLMs on translation and understanding tasks across the three major varieties of Persian (Persian/Farsi, Dari, Tajiki), using a dataset of over 240,000 processed samples. Results highlight variety-specific performance gaps and the challenges of treating these as a single language.

Paper 13 — PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

PersianPunc is a dataset of 17 million samples for Persian punctuation restoration, constructed by aggregating and filtering diverse textual resources. A fine-tuned ParsBERT model achieves 91.33% macro-F1, outperforming large generative models in efficiency and accuracy. The full dataset and fine-tuned model are publicly released.

Paper 15 — Shughni Machine Translation Enhanced by Donor Languages

This paper presents a machine translation system for Shughni, an endangered Iranian language of the Pamirs with fewer than 100,000 speakers and limited digital resources. By leveraging Russian and English as pivot/donor languages within an NLLB-200 framework, the system achieves significant improvements over baseline MT for this extremely low-resource language.

🤗 Demo: huggingface.co/spaces/Novokshanov/Shughni-Translator

Paper 16 — Segmentation Strategy Matters: Benchmarking Whisper on Persian YouTube Content

We benchmark OpenAI's Whisper on 10 hours of Persian YouTube audio with gold-standard transcripts, systematically evaluating the impact of audio segmentation strategies on ASR performance. Results show that segmentation choices have a substantial effect on WER, with practical implications for Persian speech processing pipelines.

💻 Dataset & Code: github.com/ri164-bolleit/persian-youtube-whisper-benchmark

Paper 18 — Multi-modal Neural Machine Translation for Low-Resource Classical Persian Poetry

We introduce the first multi-modal NMT system for translating classical Persian poetry (Masnavi-ye-Ma'navi), combining text with audio recitations. A new parallel Persian–English corpus of 26,571 aligned verse pairs with recitations is released, alongside a culture-specific evaluation framework for idiomatic and poetic translation quality.

💻 Corpus: github.com/amnghd/Persian_poems_corpus

Workshop Information


Full Name	First Workshop on NLP and LLMs for the Iranian Language Family (SilkRoadNLP)
Venue	Co-located with EACL 2026, Rabat, Morocco
Workshop Date	March 28–29, 2026
Website	silkroadnlp.org
Languages Covered	Persian · Dari · Tajiki · Kurdish · Pashto · Balochi · Luri · Ossetian · Shughni · and more

Important Dates (2025–2026)

Milestone	Date
Call for Papers	October 20, 2025
Direct Submission Deadline	January 8, 2026
Notification of Acceptance	January 26, 2026
Camera-ready Papers Due	February 3, 2026
Workshop Date	March 28–29, 2026

Citation

If you use these proceedings in your research, please cite the workshop:

@proceedings{silkroadnlp2026,
  title     = {Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family (SilkRoadNLP)},
  year      = {2026},
  address   = {Rabat, Morocco},
  publisher = {Association for Computational Linguistics},
  url       = {https://silkroadnlp.org}
}

SilkRoadNLP 2026 — Bridging languages along the Silk Road

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
proceedings		proceedings
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SilkRoadNLP 2026 — Proceedings

Accepted Papers

Paper Abstracts

Workshop Information

Important Dates (2025–2026)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SilkRoadNLP 2026 — Proceedings

Accepted Papers

Paper Abstracts

Workshop Information

Important Dates (2025–2026)

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages