Skip to content

Commit 71024b3

Browse files
committed
Merge branch 'master' into embedding_annotation_pipeline_at_scale
2 parents 2a1bc14 + 6a43648 commit 71024b3

20 files changed

Lines changed: 2380 additions & 23 deletions

CONTRIBUTING.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,13 @@ This will create a venv with Python3.11 and pip under `.venv`.
2020
Install the project via
2121
```shell
2222
# ensure that you already created and activated a virtual environment before
23-
pip install .
23+
uv pip install .
2424
```
2525

2626
For developers, use
2727
```shell
2828
# ensure that you already created and activated a virtual environment before
29-
pip install -e .[tests,linting]
29+
uv pip install -e .[tests,linting]
3030
pre-commit install --install-hooks
3131
```
3232

README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,19 @@ Alternativley, just use execute `bash scripts/host_vllm_model.sh $CONTAINER_NAME
164164
bash scripts/host_vllm_model.sh my_vllm_container 9123 meta-llama/Llama-3.1-8B-Instruct
165165
```
166166
167+
##### Mistral
168+
For mistral models make sure to manually set the correct chat template file in the tokenizer_config.json.
169+
We tried hosting the model as described by MistralAI,
170+
```bash
171+
vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 --tensor-parallel-size 4 --port 8003 --tokenizer_mode mistral --config_format mistral --load_format mistral
172+
```
173+
174+
but still ran into:
175+
```bash
176+
ValueError: Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at
177+
```
178+
179+
167180
#### Test the hosted model
168181
```bash
169182
curl http://localhost:port_number/v1/completions \

configs/score_documents/lorem_ipsum.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ settings:
88

99
paths:
1010
raw_data_file_paths:
11-
- data/test_fineweb2_dump.jsonl
11+
- data/fineweb_2_500k_both/split/spa_Latn_sampled_500k_0_to_600.jsonl
12+
- data/fineweb_2_500k_both/split/srp_Cyrl_sampled_500k_0_to_600.jsonl
1213
output_directory_path: data/output
1314
prompt_template_file_path: data/prompts/fineweb_edu/educational_prompt.yaml
1415
start_indexes:
@@ -48,4 +49,5 @@ document_processor:
4849
num_processes: 1
4950
score_metric_name: educational_score
5051
strings_to_remove: []
51-
jq_language_pattern: .metadata.language
52+
jq_language_pattern: .language
53+
document_id_column: id
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
settings:
2+
model_name: google/gemma-2-27b-it
3+
num_gpus: 1
4+
tokenizer_name_or_path: ${settings.model_name}
5+
paths:
6+
raw_data_file_paths:
7+
- data/debug/deu_Latn_sampled_500k_first_200.jsonl
8+
start_indexes:
9+
- 150
10+
output_directory_path: data/output
11+
prompt_template_file_path: data/prompts/pii/pii_prompt.yaml
12+
llm_rest_client:
13+
model_name: ${settings.model_name}
14+
max_tokens: 8192
15+
sampling_params:
16+
max_tokens: 500
17+
temperature: 0.7
18+
n: 3
19+
top_p: 0.9
20+
max_pool_connections: 1000
21+
max_pool_maxsize: 1000
22+
max_retries: 2
23+
backoff_factor: 0.4
24+
timeout: 100
25+
verbose: false
26+
num_gpus: ${settings.num_gpus}
27+
max_new_tokens: 500
28+
tokenizer:
29+
pretrained_model_name_or_path: ${settings.tokenizer_name_or_path}
30+
special_tokens: null
31+
add_generation_prompt: true
32+
prompt_builder:
33+
prompt_template_file_path: ${settings.paths.prompt_template_file_path}
34+
max_prompt_length: 7690
35+
document_processor:
36+
output_directory_path: ${settings.paths.output_directory_path}
37+
queue_size: 1000
38+
num_processes: 10
39+
score_metric_name: pii_score
40+
strings_to_remove: []
41+
jq_language_pattern: .language
42+
document_id_column: id

data/debug/deu_Latn_sampled_500k_first200_id.jsonl

Lines changed: 200 additions & 0 deletions
Large diffs are not rendered by default.

data/debug/deu_Latn_sampled_500k_first_200.jsonl

Lines changed: 200 additions & 0 deletions
Large diffs are not rendered by default.

data/fineweb_2_500k_both/split/spa_Latn_sampled_500k_0_to_600.jsonl

Lines changed: 600 additions & 0 deletions
Large diffs are not rendered by default.

data/fineweb_2_500k_both/split/srp_Cyrl_sampled_500k_0_to_600.jsonl

Lines changed: 600 additions & 0 deletions
Large diffs are not rendered by default.

data/prompts/pii/pii_prompt.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
prompt: |
2+
Below is an extract from a web page. Evaluate the level of Personally Identifiable Information (PII) present in the extract using the additive 5-point scoring system described below. Points are accumulated based on the presence and sensitivity of PII:
3+
4+
- Add 1 point if the extract contains any potential PII like a name, username, or other identifiers that could be linked to an individual, even if not uniquely identifying on their own.
5+
- Add another point if the extract includes direct identifiers that can uniquely identify an individual, such as a full name, email address, phone number, or physical address.
6+
- Award a third point if the extract contains sensitive PII like government ID numbers (e.g., SSN, passport), or financial information (e.g., credit card, bank account numbers).
7+
- Grant a fourth point if the extract includes highly sensitive PII, such as medical records, biometric data, genetic information, or private communications.
8+
- Bestow a fifth point if the extract is dominated by PII, such as a list or database of personal information for multiple individuals.
9+
Note: If the extract contains no PII, the score is 0. Information about public figures that is already widely known, or data belonging to fictional characters, should also be scored 0. General discussions about data privacy that do not contain specific individuals' data should be scored based only on any actual PII examples present.
10+
The extract:
11+
{placeholder}
12+
After examining the extract:
13+
- Briefly justify your score, up to 100 words.
14+
- Conclude with the score using the format: 'PII score: <total points>'
15+
16+
prompt_name: pii_content_filter

src/ml_filter/__main__.py

Lines changed: 77 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
from ml_filter.training.embedding_training_pipeline import run_embedding_head_training_pipeline
2222
from ml_filter.translate import TranslationServiceType, TranslatorFactory
2323
from ml_filter.utils.chunk_data import chunk_jsonl
24+
from ml_filter.utils.get_costs_of_openai_batched_requests import find_and_process_files
2425
from ml_filter.utils.manipulate_datasets import apply_score_transforms, convert_hf_dataset_to_jsonl, split_dataset
2526
from ml_filter.utils.manipulate_documents import merge_and_sort_jsonl_files
2627
from ml_filter.utils.manipulate_prompt import add_target_language_to_prompt
@@ -128,7 +129,18 @@ def main() -> None:
128129
required=True,
129130
help="The endpoint for the LLM service.",
130131
)
131-
def entry_point_score_documents(config_file_path: Path, rest_endpoint: str, experiment_id: Optional[str] = None):
132+
@click.option(
133+
"--use_llm_rest_client_request_collector",
134+
type=bool,
135+
default=False,
136+
help="Whether to use the LLM REST client request collector to run requests with OpenAI batched API.",
137+
)
138+
def entry_point_score_documents(
139+
config_file_path: Path,
140+
rest_endpoint: str,
141+
use_llm_rest_client_request_collector,
142+
experiment_id: Optional[str] = None,
143+
):
132144
with open(config_file_path, "rb") as f:
133145
hash_value = hashlib.sha256(f.read()).hexdigest()[:8]
134146
experiment_id_postfix = datetime.now().strftime("%Y-%m-%d__%H-%M-%S") + f"__{hash_value}"
@@ -137,7 +149,12 @@ def entry_point_score_documents(config_file_path: Path, rest_endpoint: str, expe
137149
experiment_id = experiment_id_postfix
138150
else:
139151
experiment_id = experiment_id + f"/{experiment_id_postfix}"
140-
llm_service = LLMClient(config_file_path=config_file_path, experiment_id=experiment_id, rest_endpoint=rest_endpoint)
152+
llm_service = LLMClient(
153+
config_file_path=config_file_path,
154+
experiment_id=experiment_id,
155+
rest_endpoint=rest_endpoint,
156+
use_llm_rest_client_request_collector=use_llm_rest_client_request_collector,
157+
)
141158
llm_service.run()
142159

143160

@@ -751,6 +768,64 @@ def _get_target_language_codes_list_helper(target_language_codes: str) -> list[s
751768
return [lang_code.strip().lower() for lang_code in target_language_codes.split(",")]
752769

753770

771+
@main.command(name="submit_batch_requests")
772+
@click.argument("input_files", nargs=-1, type=click.Path(exists=True, path_type=Path))
773+
@click.option(
774+
"--model_name",
775+
type=str,
776+
required=True,
777+
help="Name of the OpenAI model to use.",
778+
)
779+
@click.option(
780+
"--max_requests_per_file",
781+
type=int,
782+
default=None,
783+
help="Maximum number of requests to send per input file.",
784+
)
785+
@click.option(
786+
"--check_status_only",
787+
type=bool,
788+
default=False,
789+
help="Whether to check the status of existing batch requests only.",
790+
)
791+
def submit_collected_requests_to_batched_openai_api_cli(
792+
input_files: tuple[Path], model_name: str, max_requests_per_file: int | None, check_status_only: bool
793+
):
794+
"""
795+
CLI command to submit collected requests to the batched OpenAI API.
796+
"""
797+
from ml_filter.llm_api.openai_batch_request_collector import OpenAIBatchAPIRequestSubmitter
798+
799+
input_files = [Path(p) for p in input_files]
800+
# Not all models are supported: https://community.openai.com/t/error-on-tryng-to-use-batches/935474/7
801+
collector = OpenAIBatchAPIRequestSubmitter(
802+
input_files=input_files, model_name=model_name, max_requests_per_file=max_requests_per_file
803+
)
804+
if not check_status_only:
805+
collector.submit()
806+
else:
807+
collector.check_status_maybe_get_results()
808+
809+
810+
@main.command(name="get_costs_of_openai_batched_requests")
811+
@click.option(
812+
"--root_directory",
813+
type=str,
814+
required=True,
815+
help="The root directory to search recursively.",
816+
)
817+
@click.option(
818+
"-o",
819+
"--output_file",
820+
type=str,
821+
default="report.md",
822+
show_default=True,
823+
help="Path to save the markdown report (default: report.md under the root dir).",
824+
)
825+
def get_costs_of_openai_batched_requests_cli(root_directory: str, output_file: str):
826+
find_and_process_files(root_directory, output_file)
827+
828+
754829
@main.command(name="train_with_embeddings")
755830
@click.option(
756831
"--config_file_path",

0 commit comments

Comments
 (0)