This package is an implementation of the Probablistic Structured Queries algorithm for cross-langauge information retrieval. It leverages alignment table from statistical machine translation to translate the document bag-of-tokens into the query language.
Raw translation tables are available on Huggingface Models hltcoe/psq_translation_tables
fast_psq is available on PyPI.
pip install fast_psqAlternatively, you can also install directly from the GitHub main branch by using the following command.
pip install pip@git+https://github.com/hltcoe/PSQfast_psq works with ir_datasets and ir_measures quite well for accessing IR evaluation collections
and evaluating results. You can install the two packages with the following command.
pip install ir_datasets ir_measuresThe indexing script takes a translation table (i.e., alignment matrix) and a document jsonl file.
We release a number of them on Huggingface Model, which can be automatically downloaded
in the script by placing the path in the --psq_file flag in the format of {repo_id}:{flie_path}.
Alternatively, you can also pass in a local .json.gz file that contains a dictionary of dictionaries, mapping from
source tokens (string) to target tokens (string) to alignment probabilities.
However, the default tokenizer in the script uses mosestokenier, which may not match the one in your own
alignment matrix. You should either use mosestokenier when aligning the bitext or replace the tokenizer with yours.
The document file should be a jsonl file with one document in each row.
You can specify the field for document id, title, and body text by passing in the field name
in the file through --docid, --title, and --body respectively.
Alternatively, you can also use --doc_source with irds: as prefix to use a dataset in ir_datasets.
The following is an example indexing command.
python -m fast_psq.index \
--doc_file irds:neuclir/1/zh/trec-2022 \
--lang zh \
--psq_file hltcoe/psq_translation_tables:zh.table.dict.gz \
--min_translation_prob 0.00010 \
--max_translation_alternatives 64 \
--max_translation_cdf 0.99 \
--docid doc_id \
--title title \
--body text \
--min_translation_prob 1e-4 \
--max_translation_alternatives 64 \
--output_dir ./indexes/neuclir-zh.f32/ \
--compression \
--nworkers 64Please use python -m fast_psq.index --help for more information about the arguments.
The search script takes the index and a tsv query file and output a TREC style result file.
Similarly, we support ir_datasets as well with irds: prefix in both --query_source and --qrels arguments.
The following command is an example.
python -m fast_psq.search \
--query_source irds:neuclir/1/zh/trec-2022 \
--query_field title \
--index_dir ./indexes/neuclir-zh.f32/ \
--qrels irds:neuclir/1/zh/trec-2022 \
--query_lang en \
--output_file ./neuclir-zh.en.title.f32.trecPlease use python -m fast_psq.search --help for more information about the arguments.
@article{psq-repro,
title = {Efficiency-Effectiveness Tradeoff of Probabilistic Structured Queries for Cross-Language Information Retrieval},
author = {Eugene Yang and Suraj Nair and Dawn Lawrie and James Mayfield and Douglas W. Oard and Kevin Duh},
journal = {arXiv preprint arXiv},
year = {2024}
}