This repository compares baseline retrieval vs SAC (Summary-Augmented Chunking) for legal/regulatory RAG, and serves an API for jurisdiction-filtered queries.
- Ingest/clean PDFs
- Baseline:
01_baseline_pipeline.py->cleaned/ - SAC:
01_sac_pipeline.py->cleaned_sac/
- Baseline:
- Build embeddings + FAISS
- Baseline:
02_baseline_embed.py - SAC/main:
02_embed_build_openai.py->indexes/faiss.index,indexes/meta.jsonl
- Baseline:
- Retrieval + QA
03_retrieve_and_qa.py
- API
06_api_server.py(GET /health,GET /jurisdictions,POST /query)
- Evaluation
04_baseline_benchmark.py,04_benchmark.py
raw/- Source PDFs by jurisdiction (
ch,hk,jp,sg,kr,uk,uae,us,eu)
- Source PDFs by jurisdiction (
indexes/- Active FAISS index + metadata (generated artifact)
tools/- Runtime scripts (e.g., API smoke regression)
tests/tests/manual/: active hand-curated sets (current core benchmark)
baseline_data/- Baseline indexes/artifacts for A/B comparison
dev_support/- Development support / process materials (manifests, legacy batch tests, helper scripts, historical reports)
- If you only run the current production path, focus on:
raw/->01_sac_pipeline.py->02_embed_build_openai.py->03_retrieve_and_qa.py->06_api_server.py
- SAC benchmark:
python3 04_benchmark.py --test-set tests/manual/test_set_core_hk_sg_uk_uae.json
- Baseline benchmark:
python3 04_baseline_benchmark.py --test-set tests/manual/test_set_core_hk_sg_uk_uae.json
- Export regression CSV (per-case + summary):
python3 04_benchmark.py --test-set dev_support/tests/batches_legacy/test_set_B01_fixed.json --out-csv dev_support/reports/benchmark_B01_cases.csv --out-summary-csv dev_support/reports/benchmark_B01_summary.csv
- Core in-corpus benchmark (HK/SG/UK/UAE):
python3 04_benchmark.py --test-set tests/manual/test_set_core_hk_sg_uk_uae.json --out-csv reports/core_hk_sg_uk_uae_cases.csv --out-summary-csv reports/core_hk_sg_uk_uae_summary.csv
- Incrementally process new PDFs and append only new chunks to FAISS:
python3 dev_support/scripts/refresh_sac_index_from_raw.py --jurisdictions sg,uk,uae
- If JSONL already exists and you only want append-embedding:
python3 dev_support/scripts/refresh_sac_index_from_raw.py --jurisdictions sg,uk,uae --skip-clean
- Run 4 key API checks (
/jurisdictions, HK only, HK+SG mixed, invalid XX=400):bash tools/api_smoke_regression.sh
- Active Python interpreter (current working env):
/Users/chenzheyang/anaconda3/envs/crypto_reg/bin/python
- Note:
- This env has
faissandopenaiavailable. - System
python3may not havefaiss.
- This env has
- Required:
GPTSAPI_API_KEY
- Optional:
GPTS_BASE_URL(default:https://api.gptsapi.net/v1)PROXY_URL(for local proxy, if needed)HOST(default:127.0.0.1)PORT(default:8000)EMBED_MODEL(default:text-embedding-3-large)QA_MODEL(default:gpt-4o-mini)
- Copy env template:
cp .env.example .env
- Load env:
set -a; source .env; set +a
- Start API server:
/Users/chenzheyang/anaconda3/envs/crypto_reg/bin/python 06_api_server.py
- Run SAC benchmark:
/Users/chenzheyang/anaconda3/envs/crypto_reg/bin/python 04_benchmark.py --test-set tests/manual/test_set_core_hk_sg_uk_uae.json/Users/chenzheyang/anaconda3/envs/crypto_reg/bin/python 04_benchmark.py --test-set dev_support/tests/batches_legacy/test_set_B01_fixed.json
- Run baseline benchmark:
/Users/chenzheyang/anaconda3/envs/crypto_reg/bin/python 04_baseline_benchmark.py --test-set tests/manual/test_set_core_hk_sg_uk_uae.json