Skip to content

Commit a5b777e

Browse files
authored
Merge pull request #240 from Modalities/embedding_annotation_pipeline_at_scale
2 parents 6a43648 + 71024b3 commit a5b777e

21 files changed

Lines changed: 2540 additions & 24 deletions

README.md

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,16 @@ We use this repository to filter out low-quality documents from the Common Crawl
1818
3. The classifier is used to filter out low-quality documents from the entire CC dataset. The filtered dataset is then used to train the model(s).
1919

2020

21+
## Documentation Map
22+
- [Pipelines: Embedding & Annotation](documentation/pipelines.md) – generate embeddings and run annotation heads at scale.
23+
- [Aggregation](documentation/aggregation.md) – how scores are combined (mean, max, min, majority, etc.).
24+
- [Data Format](documentation/data_format.md) – expected JSONL schema & label structure.
25+
- [Evaluation](documentation/evaluation.md) – metrics and evaluation utilities.
2126

2227
## Installation and Development
2328

2429
Please see [CONTRIBUTING.md](CONTRIBUTING.md)
2530

26-
2731
## Usage
2832
Once you have [setup TGI container](#setting-up-the-tgi-container-with-hugging-face-models), you can proceed to score and the documents and trainer and classifier
2933

@@ -32,12 +36,28 @@ Once you have [setup TGI container](#setting-up-the-tgi-container-with-hugging-f
3236
python cli.py score_documents --config_file_path path/to/your/config.yaml
3337
3438
```
35-
### 2. How to Train a Classifier
36-
If you already have the score, you can train a classifier by running
39+
### 2. Create Embeddings at Scale
40+
Generate HDF5 embedding files from raw JSONL (see `documentation/pipelines.md` for full schema):
41+
```bash
42+
python cli.py run_embedding_pipeline --config_file_path configs/embedding_job.yaml
43+
```
44+
Outputs: one `.h5` per input file (embeddings + optional labels) under the configured embedding directory.
45+
46+
### 3. How to Train a Classifier
47+
If you already have scores (e.g. LLM annotations), you can train a classifier by running
3748
```script
3849
python cli.py train_classifier --config_file_path path/to/your/training_config.yaml
3950
```
40-
### 3. Measure Interrater Reliability
51+
Trained model (and tokenizer) are saved under the `final` subdirectory of the configured output dir.
52+
53+
### 4. Run Annotation Heads on Embeddings
54+
Apply one or more trained regression / classification heads to previously generated embeddings:
55+
```bash
56+
python cli.py run_annotation_pipeline --config_file_path configs/annotation_job.yaml
57+
```
58+
Outputs: `${source_filename}.jsonl` with predicted scores in `annotated_data/`.
59+
60+
### 5. Measure Interrater Reliability
4161
If you have a dataset with scores annotated by multiple annotators, you can compute metrics to measure the interrater reliability with the command interrater_reliability. If you want to compare the scores in a single file (e.g. the human annotated ground truth data), run:
4262
```script
4363
python cli.py interrater_reliability data_annotated.jsonl --output_file_path output.json
@@ -50,6 +70,7 @@ You can create plots for the distribution of annotations and the differences bet
5070
```script
5171
python cli.py plot_scores data_annotated_by_model_1.jsonl data_annotated_by_model_2.jsonl --aggregation majority --output_dir outputs
5272
```
73+
5374
## TGI
5475

5576
This service relies on **TGI containers** (Text Generation Inference), which can be downloaded from [Hugging Face](https://huggingface.co). Follow the steps below to download and run the TGI container.
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
params:
2+
embeddings_directory: /raid/s3/opengptx/jude/repos/ml_filter/data/throughput_analysis/output/validation_embeddings
3+
output_dir: /raid/s3/opengptx/jude/repos/ml_filter/data/throughput_analysis/output/annotations
4+
5+
regression_head_checkpoints:
6+
Gemma_Snowflake: /raid/s3/opengptx/jude/repos/ml_filter/embedding_ablations/training/final #/raid/s3/opengptx/jude/repos/ml_filter/hessanAI/checkpoints/checkpoints/edu-gemma-snowflake-balanced.ckpt
7+
Llama_Snowflake: /raid/s3/opengptx/jude/repos/ml_filter/embedding_ablations/training/final #/raid/s3/opengptx/jude/repos/ml_filter/hessanAI/checkpoints/checkpoints/edu-llama-snowflake-balanced.ckpt
8+
Mistral_Snowflake: /raid/s3/opengptx/jude/repos/ml_filter/embedding_ablations/training/final #/raid/s3/opengptx/jude/repos/ml_filter/hessanAI/checkpoints/checkpoints/edu-mistral-snowflake-balanced.ckpt
9+
batch_size: 1000
10+
hdf5_dataset_name: train
11+
output_keys: ["document_id", "score_Gemma_Snowflake", "score_Llama_Snowflake", "score_Mistral_Snowflake"]
12+
model_dtype: bfloat16
13+
embedding_dtype: bfloat16
14+
label_dtype: bfloat16
15+
compression: gzip
16+
running_on_slurm: false
17+
18+
local_settings:
19+
tasks: 1
20+
workers: 1
21+
local_tasks: 1
22+
local_rank_offset: 0
23+
logging_dir: ${params.output_dir}/logs
24+
25+
slurm_settings: null
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
params:
2+
embeddings_directory: /raid/s3/opengptx/jude/repos/ml_filter/data/input_dir_hierarchy
3+
output_dir: /raid/s3/opengptx/jude/repos/ml_filter/data/embedding_output_dir/annotations_new/
4+
5+
regression_head_checkpoints:
6+
Gemma_Snowflake: /raid/s3/opengptx/jude/repos/ml_filter/hessanAI/checkpoints/checkpoints/edu-gemma-snowflake-balanced.ckpt
7+
Llama_Snowflake: /raid/s3/opengptx/jude/repos/ml_filter/hessanAI/checkpoints/checkpoints/edu-llama-snowflake-balanced.ckpt
8+
batch_size: 1000
9+
10+
running_on_slurm: true
11+
12+
local_settings: null
13+
14+
slurm_settings:
15+
sbatch_args:
16+
account: "p_gptx"
17+
nodes: 1
18+
ntasks: 1
19+
gres: gpu:1
20+
partition: "capella"
21+
time: "04:00:00"
22+
cpus_per_task: 8
23+
mem_per_cpu_gb: 2
24+
gpus_per_task: 1
25+
job_name: "MMbert_embedder"
26+
output: /data/cat/ws/alju972f-regression_heads/dataset/mmbet_embeddings/mmber_logs/%j.out
27+
error: /data/cat/ws/alju972f-regression_heads/dataset/mmbet_embeddings/mmber_logs/%j.err
28+
qos: "normal"
29+
venv_path: /data/cat/ws/alju972f-regression_heads/envs/env_regression_heads/bin/activate
30+
tasks: 10
31+
workers: 1001
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
dataset_name: validation
2+
3+
params:
4+
# File selection
5+
glob_pattern: "**/*.jsonl"
6+
input_dir: /raid/s3/opengptx/jude/repos/ml_filter/data/throughput_analysis/input #/raid/s3/opengptx/abbas/processed_data_natural/${dataset_name}_set
7+
8+
# Output
9+
output_dir: /raid/s3/opengptx/jude/repos/ml_filter/data/throughput_analysis/output
10+
text_field: text
11+
keys_to_index: ["id", "aggregation_type"]
12+
embedding_dir: ${dataset_name}_embeddings
13+
compression: gzip
14+
15+
# Precision
16+
embedding_dtype: float32
17+
label_dtype: int8
18+
model_dtype: bfloat16
19+
20+
# Model and embedding parameters
21+
embedding_model: jhu-clsp/mmBERT-base
22+
batch_size: 128
23+
writer_batch_size: 1000
24+
hdf5_dataset_name: train
25+
save_labels: false
26+
max_length: 8192
27+
padding: true
28+
truncation: true
29+
30+
running_on_slurm: false
31+
32+
local_settings:
33+
tasks: 2
34+
workers: 2
35+
local_tasks: 2
36+
local_rank_offset: 0
37+
38+
slurm_settings: null
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
dataset_name: training
2+
3+
params:
4+
# File selection
5+
glob_pattern: "**/*.jsonl"
6+
input_dir: /data/cat/ws/alju972f-regression_heads/repos/data/embedding_creation/input #/data/cat/ws/alju972f-regression_heads/dataset/Regression_head/abbas/processed_data_natural/${dataset_name}_set
7+
8+
# Output
9+
output_dir: /data/cat/ws/alju972f-regression_heads/repos/data/embedding_creation/output
10+
keys_to_index: ["id", "aggregation_type"]
11+
text_field: text
12+
embedding_dir: ${dataset_name}_embeddings
13+
compression: gzip
14+
15+
# Precision
16+
embedding_dtype: float32
17+
label_dtype: int8
18+
model_dtype: bfloat16
19+
20+
# Model and embedding parameters
21+
embedding_model: jhu-clsp/mmBERT-base
22+
batch_size: 512
23+
writer_batch_size: 1000
24+
hdf5_dataset_name: train
25+
save_labels: false
26+
max_length: 8192
27+
padding: true
28+
truncation: true
29+
30+
running_on_slurm: true
31+
32+
local_settings: null
33+
34+
slurm_settings:
35+
sbatch_args:
36+
account: "p_gptx"
37+
nodes: 1
38+
ntasks: 1
39+
gres: gpu:4
40+
exclusive: user
41+
partition: "capella"
42+
time: "04:00:00"
43+
cpus_per_task: 32
44+
mem_per_cpu_gb: 8
45+
gpus_per_task: 4
46+
job_name: "MMbert_embedder"
47+
output: ${params.output_dir}/logs/%j.out
48+
error: ${params.output_dir}/logs/%j.err
49+
qos: "normal"
50+
venv_path: /data/cat/ws/alju972f-regression_heads/repos/env/jql_pipeline/bin/activate
51+
tasks: 1
52+
workers: 1

0 commit comments

Comments
 (0)