ml_filter/documentation/evaluation.md at master · Modalities/ml_filter

This documentation gives an overview of the typical steps that have to be done in order to evaluate prompt based annotations against the ground truth and add the results to a latex file.

First put the annotations of different LLMs on the test data in a common folder /path/to/annotations . This folder should contain all LLMs for which you want to report the evaluation metrics together in one table in our overleaf document:

python3.11 ml_filter evaluate_prompt_based_annotations --input_directory /path/to/annotations --output_directory /path/to/annotations/comparison --gt_data /raid/s3/opengptx/user/richard-rutmann/data/eurolingua/prompt_based_annotations/test_data/gt/annotations__educational_content__en__gt.jsonl --aggregation majority --labels 0,1,2,3,4,5 --thresholds 2,3

Currently we use this path on DGX2 to collect and evaluate annotations: /raid/s3/opengptx/user/richard-rutmann/data/eurolingua/experiments/multilinguality/experiments

Next, we will extract the metrics of interest from the files produced in step 1 and write them into a latex-file:

python3.11 ml_filter collect_ir_metrics --input_directory /path/to/annotations/comparison --output_directory /path/to/annotations/comparison/ir_summary --min_metrics Invalid,MSE,MAE

The content of

/path/to/annotations/comparison/ir_summary/ir_summary_gt.tex

can then be copied partly or fully to the overleaf document.

In addition, you can find confusion matrices and histograms for each annotator and language in

/path/to/annotations/comparison

as well as aggregated across languages in

/path/to/annotations/comparison/ir_summary

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

evaluation.md

Latest commit

History

evaluation.md

File metadata and controls