Public dataset and resources for Hebrew coreference resolution. This repository provides gold-standard CoNLL-U data and accompanying inputs used by neural and LLM-based systems.
- Total Documents: 351 (301 train, 26 dev, 24 test)
- Total Sentences: 6,151
- Total Tokens: 159,975
- Total Mentions (no singleton): 19,483
- Total Mentions (with singleton): 45,689
- Singleton Mentions: 26,206 (57.4%)
- Train: 16,907 mentions
- Dev: 1,181 mentions
- Test: 1,395 mentions
- Coreference Agreement: CoNLL Score: 0.811, Mention Score: 0.850
data/
├── conllu/ # Gold CoNLL-U splits
│ ├── no_singleton/ # train/dev/test without singletons
│ └── with_singleton/ # train/dev/test with singletons
├── neural_input/ # Inputs for neural models
│ ├── wl/ # WL format (train/dev/test + head variants)
│ └── lingmess/ # LingMess format (train/dev/test; SOTA tokenized)
└── llm_input/ # Inputs and outputs used with LLM pipelines
├── mentions_by_llm_from_raw/
├── mentions_by_model_danit_parse/
├── mentions_by_model_gold_parse/
├── raw_documents/
├── tokenized_documents/
└── tokenized_documents_danit_tokenization/
hebrew_coreference_data/
├── data/ # Main data directory
│ ├── conllu/ # Gold standard CoNLL-U annotations
│ ├── neural_input/ # Neural model inputs
│ └── llm_input/ # LLM pipeline data
├── guidelines/ # Annotation guidelines
├── agreement_calculation/ # Agreement calculation tools
└── annotation/ # Original annotation data
├── final_coref/ # Final consolidated annotations
├── coref_pairwise/ # Pairwise agreement annotations
└── mention_annotation/ # Individual annotator data
Notes
- The content of
data/conllu/is sourced from the project's gold data ("conllu_gold") and is organized intowith_singletonandno_singletonvariants per split. neural_input/wlandneural_input/lingmessmirror the common Hebrew splits used in prior work.llm_inputcontains raw and tokenized documents as well as model- and LLM-derived mentions.
The English guidelines for the annotation scheme are available at guidelines/Hebrew_Coreference_Guidelines_English.pdf.
If you use this repository, please cite it and acknowledge the Hebrew coreference annotation effort.
Please open an issue for questions or clarifications.