Implement FineWebSecurity module for cybersecurity data filtering#14
Implement FineWebSecurity module for cybersecurity data filtering#14naufalso wants to merge 3 commits into
Conversation
naufalso
commented
May 12, 2026
- Added core functionalities for filtering web data using BERT/ModernBERT classifiers.
- Introduced command line interfaces for checking progress and downloading dataset subsets.
- Developed dataset handling capabilities, including loading, iterating, and saving parquet files.
- Implemented progress tracking for filtering operations.
- Created utility functions for managing file persistence and JSON document handling.
- Added comprehensive tests for CLI, dataset operations, filtering logic, and progress management.
- Updated README with module details and usage instructions.
- Added core functionalities for filtering web data using BERT/ModernBERT classifiers. - Introduced command line interfaces for checking progress and downloading dataset subsets. - Developed dataset handling capabilities, including loading, iterating, and saving parquet files. - Implemented progress tracking for filtering operations. - Created utility functions for managing file persistence and JSON document handling. - Added comprehensive tests for CLI, dataset operations, filtering logic, and progress management. - Updated README with module details and usage instructions.
There was a problem hiding this comment.
Pull request overview
Adds a new data/FineWebSecurity module that filters the Hugging Face FineWeb dataset with a cybersecurity BERT/ModernBERT classifier, including CLIs, persistence/progress utilities, and a Docker + tmux-based parallel runner.
Changes:
- Introduces
fineweb_securitypackage code for dataset access, BERT inference, persistence, progress tracking, and optional Hugging Face Hub upload. - Adds CLI entrypoints for filtering, downloading subsets, and checking progress (plus compatibility wrapper scripts).
- Adds a Dockerfile, shell scripts, config lists, and pytest coverage for core behaviors.
Reviewed changes
Copilot reviewed 30 out of 31 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Marks the cybersecurity-filtering code as released. |
| data/README.md | Documents data modules and links FineWebSecurity docs/dataset. |
| data/FineWebSecurity/README.md | FineWebSecurity installation + usage documentation. |
| data/FineWebSecurity/README_Docker.md | Docker-specific usage instructions. |
| data/FineWebSecurity/requirements.txt | Python dependencies for FineWebSecurity tooling. |
| data/FineWebSecurity/Dockerfile | Container build for running filtering pipeline. |
| data/FineWebSecurity/.gitignore | Ignores FineWebSecurity local artifacts (outputs/cache/etc.). |
| data/FineWebSecurity/config/fineweb_config.txt | List of FineWeb subset configs used by tmux runner. |
| data/FineWebSecurity/accelerate/default_config.yaml | Accelerate config scaffold for local runs. |
| data/FineWebSecurity/scripts/parallel_lib.sh | tmux-based parallel execution helper library. |
| data/FineWebSecurity/scripts/filter_fineweb_bert.sh | Multi-GPU/multi-subset tmux launcher for filtering. |
| data/FineWebSecurity/src/init.py | Marks src/ as a Python package root (empty). |
| data/FineWebSecurity/src/filter_fineweb_bert_map.py | Compatibility wrapper to invoke filter CLI. |
| data/FineWebSecurity/src/download_subset.py | Compatibility wrapper to invoke download CLI. |
| data/FineWebSecurity/src/check_fineweb_progress.py | Compatibility wrapper to invoke progress-check CLI. |
| data/FineWebSecurity/src/fineweb_security/init.py | Package initializer. |
| data/FineWebSecurity/src/fineweb_security/bert.py | Model loading, warmup, and batch prediction utilities. |
| data/FineWebSecurity/src/fineweb_security/datasets.py | FineWeb parquet listing, iteration, and downloading helpers. |
| data/FineWebSecurity/src/fineweb_security/persistence.py | JSON doc persistence and parquet writing utilities. |
| data/FineWebSecurity/src/fineweb_security/progress.py | Progress file read/write helpers. |
| data/FineWebSecurity/src/fineweb_security/hub.py | Hugging Face Hub token/branch management + upload helper. |
| data/FineWebSecurity/src/fineweb_security/cli/init.py | CLI package marker. |
| data/FineWebSecurity/src/fineweb_security/cli/filter_bert.py | Main filtering CLI pipeline implementation. |
| data/FineWebSecurity/src/fineweb_security/cli/download_subset.py | CLI for downloading FineWeb subset shards. |
| data/FineWebSecurity/src/fineweb_security/cli/check_progress.py | CLI for reporting filtering progress. |
| data/FineWebSecurity/tests/conftest.py | Test path setup for src/ layout. |
| data/FineWebSecurity/tests/test_cli.py | CLI parser help smoke tests. |
| data/FineWebSecurity/tests/test_datasets.py | Tests remaining-work detection logic. |
| data/FineWebSecurity/tests/test_filter_pipeline.py | Tests batch filtering logic at threshold. |
| data/FineWebSecurity/tests/test_persistence.py | Tests JSON persistence + parquet output behavior. |
| data/FineWebSecurity/tests/test_progress.py | Tests progress read/write round-trip behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return Progress( | ||
| parquet_idx=int(content.get("parquet_idx", 0) or 0), | ||
| parquet_sample_idx=int(content.get("parquet_sample_idx", 0) or 0), | ||
| ) |
| if compile_model and hasattr(torch, "compile") and torch.__version__ >= "2.0.0" and device == "cuda": | ||
| try: | ||
| logger.info("Compiling model with torch.compile().") | ||
| model = torch.compile(model) | ||
| except Exception as exc: | ||
| logger.warning("torch.compile failed (%s). Continuing without compilation.", exc) | ||
|
|
| def warmup_model(model: torch.nn.Module, tokenizer: Any, batch_size: int, max_length: int, device: str) -> None: | ||
| dummy_input_ids = torch.randint(0, tokenizer.vocab_size, (batch_size, max_length)) | ||
| dummy_attention_mask = torch.ones((batch_size, max_length)) | ||
| predict_batch( | ||
| { | ||
| "input_ids": dummy_input_ids.tolist(), | ||
| "attention_mask": dummy_attention_mask.tolist(), | ||
| }, |
| ) | ||
| if offset == 0 and start_sample_idx > 0: | ||
| logger.info("Skipping %d examples in %s.", start_sample_idx, parquet_file) | ||
| dataset = dataset.skip(start_sample_idx) |
| logger.info("Running inference for %s.", parquet_file) | ||
| try: | ||
| tokenized_dataset.map( | ||
| lambda batch, batch_indices, rank_idx: process_batch( | ||
| batch, | ||
| models, | ||
| batch_indices, | ||
| rank_idx, | ||
| args.threshold, | ||
| current_output_dir, | ||
| save_queue, | ||
| args.save_frequency, | ||
| progress_file, | ||
| parquet_idx, | ||
| ), | ||
| batched=True, | ||
| batch_size=args.batch_size, | ||
| with_indices=True, | ||
| with_rank=True, | ||
| desc=f"Processing {parquet_file}", | ||
| num_proc=args.parallel_worker, | ||
| ) |
| pip install torch && \ | ||
| pip install --no-cache-dir -r requirements.txt |
| current_batch_number = last_processed_index // max(len(batch_indices), 1) | ||
| if current_batch_number % save_frequency == 0: | ||
| save_progress(progress_file, parquet_idx, last_processed_index) |
| parser.add_argument("--dataset_subset", type=str, default="CC-MAIN-2024-18") | ||
| parser.add_argument("--dataset_size", type=int, default=-1) | ||
| parser.add_argument("--batch_size", type=int, default=640) | ||
| parser.add_argument("--start_idx", type=int, default=0) |
| import numpy as np | ||
| import torch | ||
| from multiprocess import Manager, set_start_method | ||
| from tqdm import tqdm |
| import json | ||
|
|
|
@copilot apply changes based on the comments in this thread |
Agent-Logs-Url: https://github.com/RISys-Lab/RedSage/sessions/c31d0317-5451-449f-969e-2afe0b13ae2c Co-authored-by: naufalso <51362638+naufalso@users.noreply.github.com>
Agent-Logs-Url: https://github.com/RISys-Lab/RedSage/sessions/c31d0317-5451-449f-969e-2afe0b13ae2c Co-authored-by: naufalso <51362638+naufalso@users.noreply.github.com>
Implemented the requested fixes from the review thread in commits 27e0a50 and 8ae3b51. This includes robust progress parsing, safer warmup/compile handling, non-streaming resume fixes, idempotent progress indexing, wiring |