Crypto Regulation RAG Workspace

This repository compares baseline retrieval vs SAC (Summary-Augmented Chunking) for legal/regulatory RAG, and serves an API for jurisdiction-filtered queries.

Core Pipeline

Ingest/clean PDFs
- Baseline: 01_baseline_pipeline.py -> cleaned/
- SAC: 01_sac_pipeline.py -> cleaned_sac/
Build embeddings + FAISS
- Baseline: 02_baseline_embed.py
- SAC/main: 02_embed_build_openai.py -> indexes/faiss.index, indexes/meta.jsonl
Retrieval + QA
- 03_retrieve_and_qa.py
API
- 06_api_server.py (GET /health, GET /jurisdictions, POST /query)
Evaluation
- 04_baseline_benchmark.py, 04_benchmark.py

Directory Guide

raw/
- Source PDFs by jurisdiction (ch, hk, jp, sg, kr, uk, uae, us, eu)
indexes/
- Active FAISS index + metadata (generated artifact)
tools/
- Runtime scripts (e.g., API smoke regression)
tests/
- tests/manual/: active hand-curated sets (current core benchmark)
baseline_data/
- Baseline indexes/artifacts for A/B comparison
dev_support/
- Development support / process materials (manifests, legacy batch tests, helper scripts, historical reports)

Practical Rule

If you only run the current production path, focus on:
- raw/ -> 01_sac_pipeline.py -> 02_embed_build_openai.py -> 03_retrieve_and_qa.py -> 06_api_server.py

Benchmark Quick Usage

SAC benchmark:
- python3 04_benchmark.py --test-set tests/manual/test_set_core_hk_sg_uk_uae.json
Baseline benchmark:
- python3 04_baseline_benchmark.py --test-set tests/manual/test_set_core_hk_sg_uk_uae.json
Export regression CSV (per-case + summary):
- python3 04_benchmark.py --test-set dev_support/tests/batches_legacy/test_set_B01_fixed.json --out-csv dev_support/reports/benchmark_B01_cases.csv --out-summary-csv dev_support/reports/benchmark_B01_summary.csv
Core in-corpus benchmark (HK/SG/UK/UAE):
- python3 04_benchmark.py --test-set tests/manual/test_set_core_hk_sg_uk_uae.json --out-csv reports/core_hk_sg_uk_uae_cases.csv --out-summary-csv reports/core_hk_sg_uk_uae_summary.csv

Incremental Refresh

Incrementally process new PDFs and append only new chunks to FAISS:
- python3 dev_support/scripts/refresh_sac_index_from_raw.py --jurisdictions sg,uk,uae
If JSONL already exists and you only want append-embedding:
- python3 dev_support/scripts/refresh_sac_index_from_raw.py --jurisdictions sg,uk,uae --skip-clean

API Smoke Regression

Run 4 key API checks (/jurisdictions, HK only, HK+SG mixed, invalid XX=400):
- bash tools/api_smoke_regression.sh

Runtime Environment

Active Python interpreter (current working env):
- /Users/chenzheyang/anaconda3/envs/crypto_reg/bin/python
Note:
- This env has faiss and openai available.
- System python3 may not have faiss.

API Environment Variables

Required:
- GPTSAPI_API_KEY
Optional:
- GPTS_BASE_URL (default: https://api.gptsapi.net/v1)
- PROXY_URL (for local proxy, if needed)
- HOST (default: 127.0.0.1)
- PORT (default: 8000)
- EMBED_MODEL (default: text-embedding-3-large)
- QA_MODEL (default: gpt-4o-mini)

Handoff Quick Start

Copy env template:
- cp .env.example .env
Load env:
- set -a; source .env; set +a
Start API server:
- /Users/chenzheyang/anaconda3/envs/crypto_reg/bin/python 06_api_server.py
Run SAC benchmark:
- /Users/chenzheyang/anaconda3/envs/crypto_reg/bin/python 04_benchmark.py --test-set tests/manual/test_set_core_hk_sg_uk_uae.json
- /Users/chenzheyang/anaconda3/envs/crypto_reg/bin/python 04_benchmark.py --test-set dev_support/tests/batches_legacy/test_set_B01_fixed.json
Run baseline benchmark:
- /Users/chenzheyang/anaconda3/envs/crypto_reg/bin/python 04_baseline_benchmark.py --test-set tests/manual/test_set_core_hk_sg_uk_uae.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crypto Regulation RAG Workspace

Core Pipeline

Directory Guide

Practical Rule

Benchmark Quick Usage

Incremental Refresh

API Smoke Regression

Runtime Environment

API Environment Variables

Handoff Quick Start

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dev_support		dev_support
reports		reports
tests/manual		tests/manual
tools		tools
.env.example		.env.example
.gitignore		.gitignore
01_baseline_pipeline.py		01_baseline_pipeline.py
01_sac_pipeline.py		01_sac_pipeline.py
02_baseline_embed.py		02_baseline_embed.py
02_embed_build_openai.py		02_embed_build_openai.py
03_retrieve_and_qa.py		03_retrieve_and_qa.py
04_baseline_benchmark.py		04_baseline_benchmark.py
04_benchmark.py		04_benchmark.py
06_api_server.py		06_api_server.py
README.md		README.md
README_一键运行说明.md		README_一键运行说明.md
crypto_reg_phase1_sources.csv		crypto_reg_phase1_sources.csv
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Crypto Regulation RAG Workspace

Core Pipeline

Directory Guide

Practical Rule

Benchmark Quick Usage

Incremental Refresh

API Smoke Regression

Runtime Environment

API Environment Variables

Handoff Quick Start

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages