Graph-Based Fraud Intelligence Platform

Detecting money laundering at scale using Knowledge Graphs, Graph Analytics, and LLM Investigation

DATA 298A · San José State University · Team 2 · Spring 2026

Why This Exists

Banks lose $3.1 trillion to money laundering every year. Traditional fraud detection systems evaluate every transaction in isolation — one row, one decision. They cannot see the coordinated patterns that sophisticated laundering operations depend on:

Layering rings — A sends to B, B sends to C, C sends back to A. Clean.
Smurfing — 50 accounts all draining into one mule account below reporting thresholds.
Fan-out distribution — One source account rapidly distributing to dozens of recipients.

These patterns are invisible to tabular models. They are immediately obvious in a graph.

Inspired by how Palantir Gotham models the world as entities and relationships — not rows and columns — we built a Knowledge Graph on top of 31.9 million IBM AML transactions and layered four detection models on top of it.

The Core Thesis

Fraud is relational, not independent.

A transaction processed in isolation looks normal. The same transaction, viewed inside a network of 47 connected accounts moving $97,000 between 3 AM and 5 AM, looks like exactly what it is.

This platform proves that thesis in numbers:

Model Layer	PR-AUC	vs. Baseline
Tabular XGBoost (baseline)	0.3043	—
+ Degree centrality features	0.4590	+51%
+ Louvain community features	0.5599	+84%

The Dataset

Property	Detail
Source	IBM Anti-Money Laundering Dataset on Kaggle
Author	Erik Altman, IBM Research
Pattern Set	`HI-Medium` — High Illicit ratio, Medium complexity laundering patterns
Total Transactions	31,898,238
Total Accounts	1,757,942
Fraud Transactions	35,158
Fraud Rate	0.1102% (904 legitimate transactions per 1 fraudulent)
Payment Formats	ACH, Cheque, Credit Card, Wire, Bitcoin, Reinvestment
Date Range	September 2022 (27 days)

4-Layer Architecture

The platform implements four detection layers in sequence. Each layer feeds signal into the next.

┌─────────────────────────────────────────────────────────────────────┐
│                     4-LAYER FRAUD DETECTION SYSTEM                   │
│                                                                       │
│  ┌──────────────────┐                                                 │
│  │  DATA PIPELINE   │  31.9M rows → cleaned → feature-engineered     │
│  │  notebooks/01-03 │  → time-aware 70/15/15 splits (parquet)        │
│  │                  │  → ontology-shattered → Neo4j bulk load        │
│  └────────┬─────────┘                                                 │
│           │                                                           │
│           ▼                                                           │
│  ┌──────────────────┐                                                 │
│  │   LAYER 1        │  XGBoost on tabular features                   │
│  │   XGBoost        │  scale_pos_weight=904 | PR-AUC: 0.3043         │
│  │   Baseline       │  11 features: log_amount, is_ACH, hour, etc.   │
│  └────────┬─────────┘                                                 │
│           │                                                           │
│           ▼                                                           │
│  ┌──────────────────┐                                                 │
│  │   LAYER 2        │  Neo4j GDS: Degree Centrality                  │
│  │   Graph          │  in_degree, out_degree, degree_centrality       │
│  │   Features       │  per account → joined to transaction splits    │
│  └────────┬─────────┘                                                 │
│           │                                                           │
│           ▼                                                           │
│  ┌──────────────────┐                                                 │
│  │   LAYER 3        │  Neo4j GDS: Louvain Community Detection        │
│  │   Community      │  community_id, community_size,                 │
│  │   Detection      │  community_fraud_rate → PR-AUC: 0.5599 (+84%) │
│  └────────┬─────────┘                                                 │
│           │                                                           │
│           ▼                                                           │
│  ┌──────────────────┐                                                 │
│  │   LAYER 4        │  Claude LLM Investigator                       │
│  │   LLM            │  Subgraph JSON → structured report             │
│  │   Investigator   │  pattern + evidence + risk + actions           │
│  └──────────────────┘                                                 │
└─────────────────────────────────────────────────────────────────────┘

Layer 1 — XGBoost Baseline (Trainable)

Trains on 11 tabular features extracted during data cleaning. Handles 904:1 class imbalance via scale_pos_weight. Primary metric is PR-AUC — accuracy is meaningless at 0.11% fraud rate.

Key features by SHAP importance: is_ACH > log_amount > is_Wire > hour > is_cross_currency

Layer 2 — Graph Features (Non-Trainable)

Runs Neo4j GDS degree centrality on the full Account graph. Extracts in_degree, out_degree, total_degree, and degree_centrality per account, then joins them onto every transaction row. A hub account with 500 outgoing transactions looks completely different from a normal account with 3.

Layer 3 — Louvain Community Detection (Non-Trainable)

Runs Louvain community detection on the Account graph. Assigns every account a community_id and computes community_fraud_rate — what fraction of that community's transactions are fraudulent. An account in a community with 42% fraud rate is not the same as one in a community with 0.1% fraud rate.

Layer 4 — LLM Investigator (Trainable via Prompt)

Takes a pre-computed subgraph JSON (flagged by Layers 1–3) and produces a structured investigation report. Four Claude model variants compared for consistency, faithfulness, and schema compliance.

LLM Evaluation Results:

Variant	Model	Schema	Faithfulness	Consistency	Meets Target
v1	claude-haiku-4-5	1.00	0.89	0.95	✓
v2	claude-sonnet-4-6	1.00	1.00	0.89	✓
v3	claude-sonnet-4-6 + enhanced prompt	1.00	1.00	0.87	✓
v4	claude-sonnet-4-6 + chain-of-thought	1.00	1.00	0.84	✓

Consistency target: Jaccard ≥ 0.8 across 3 runs. Recommended variant: v2 (composite score 0.967).

Graph Ontology

Inspired by the entity-relationship modeling philosophy of Palantir Gotham — where the world is modeled as objects and connections, not tables and rows — we defined a strict ontology to represent the financial network:

                  [:SENT]                    [:RECEIVED_BY]
(Account) ─────────────────▶ (Transaction) ─────────────────▶ (Account)
    │                               │
    │ Properties:                   │ Properties:
    │  • account_id (unique)        │  • txn_id (unique)
    │  • bank_id                    │  • amount (raw USD)
    │  • degree_centrality          │  • log_amount (normalized)
    │  • louvain_community_id       │  • timestamp (datetime)
                                    │  • payment_format
                                    │  • is_laundering (ground truth)
                                    │  • hour, is_weekend
                                    │  • amount_bucket

Ontology Shattering — the flat CSV is split into exactly 4 files before loading:

File	Role	Rows
`nodes_accounts.csv`	De-duplicated Account node definitions	1,758,573
`nodes_transactions.csv`	Transaction nodes with all feature properties	29,336,378
`edges_sent.csv`	Account → Transaction pointers	29,336,378
`edges_received.csv`	Transaction → Account pointers	29,336,378

Each account exists exactly once. Loading a flat CSV directly creates duplicate nodes and corrupts cycle detection.

EDA Key Findings

Signal	Finding	Implication
Class Imbalance	904:1 ratio (0.11% fraud)	Accuracy is useless — use PR-AUC
Payment Format	ACH = 87.3% of all fraud	`is_ACH` is the strongest single feature
Amount Distribution	Fraud transactions 6x larger on average	`log_amount` suppresses outlier noise
Cross-Currency	0% fraud rate on cross-currency transactions	Strong negative predictor
Temporal Patterns	Fraud spikes at 3–5 AM	`hour` and `is_weekend` are key signals

Project Structure

.
├── README.md
├── requirements.txt
├── docker-compose.yml              # Neo4j with GDS + APOC plugins
├── .env.example                    # Environment variable template
├── .gitignore
│
├── notebooks/
│   ├── 01_data_extraction.py       # Kaggle API download + validation
│   ├── 02_data_cleaning.py         # Cleaning, feature engineering, splits
│   └── 03_eda_visualizations.py    # 16-chart EDA suite
│
├── src/
│   ├── graph/
│   │   ├── export_bulk_neo4j.py    # Shatter parquet → 4 ontology CSVs
│   │   ├── bulk_load_cypher.py     # 31M node micro-batch Cypher loader (~2 hrs)
│   │   ├── run_analytics.py        # GDS: degree centrality + Louvain
│   │   └── visualize_graph.py      # PyVis interactive graph explorer
│   │
│   ├── models/
│   │   ├── 04_extract_graph_features.py   # Pull degree + Louvain from Neo4j → CSV
│   │   ├── 05_build_feature_store.py      # Join graph features onto transaction splits
│   │   ├── 06_train_xgboost_baseline.py   # Layer 1: tabular XGBoost
│   │   ├── 07_train_graph_enhanced_model.py # Layer 1+2+3: XGBoost + graph features
│   │   └── tune_threshold.py              # PR curve threshold optimisation
│   │
│   └── llm/
│       ├── investigator.py         # Layer 4: 4-variant Claude LLM investigator
│       └── evaluate.py             # Consistency + faithfulness + schema evaluation
│
├── artifacts/
│   ├── sample_subgraph.json        # Smurfing test case
│   ├── layering_ring_subgraph.json # Layering Ring test case
│   ├── fan_out_subgraph.json       # Fan-Out test case
│   ├── llm_outputs/                # Saved LLM responses per variant
│   └── metrics/                    # llm_eval.json + comparison table
│
├── data/models/                    # Gitignored — trained model artifacts
│   ├── xgboost_baseline.json
│   ├── xgboost_baseline_metrics.json
│   ├── xgboost_graph_enhanced.json
│   └── model_comparison.json
│
├── dashboard_app.py                # Streamlit intelligence dashboard
└── data/                           # Gitignored — 5.4GB dataset lives locally
    ├── processed/                  # Parquet splits
    └── neo4j_bulk_import/          # 4 ontology CSVs

How to Run

Step 1 — Environment setup

cp .env.example .env
# Edit .env and set NEO4J_PASSWORD and ANTHROPIC_API_KEY
pip install -r requirements.txt

Step 2 — Start Neo4j

docker-compose up -d
# Verify at http://localhost:7474 — login: neo4j / fraud2026

Step 3 — Data pipeline

python3 notebooks/01_data_extraction.py    # Download 7.6GB from Kaggle
python3 notebooks/02_data_cleaning.py      # Clean + engineer features + split
python3 notebooks/03_eda_visualizations.py # Generate 16 EDA charts

Step 4 — Load Knowledge Graph (~2 hours)

python3 src/graph/export_bulk_neo4j.py     # Shatter parquet → 4 CSVs
python3 src/graph/bulk_load_cypher.py      # Load 31M records into Neo4j

Step 5 — Train models

# Layer 1: XGBoost baseline
python3 src/models/06_train_xgboost_baseline.py

# Layers 2+3: Extract graph features and train enhanced model
python3 src/models/04_extract_graph_features.py
python3 src/models/05_build_feature_store.py
python3 src/models/07_train_graph_enhanced_model.py

Step 6 — Run LLM Investigator

# Run all 4 variants against a subgraph
python3 src/llm/investigator.py --variant all --input artifacts/sample_subgraph.json --runs 3

# Evaluate
python3 src/llm/evaluate.py --subgraph sample_subgraph

Step 7 — Launch Dashboard

python3 -m streamlit run dashboard_app.py
# Open http://localhost:8501

Tech Stack

Layer	Technology	Purpose
Data Sourcing	Kaggle Hub API	Authenticated dataset download
Data Lake	Apache Parquet (PyArrow)	Compressed columnar storage
Processing	Pandas 2.0, NumPy	Vectorized transformations
Graph Database	Neo4j 5.x	Knowledge graph
Graph ML	Neo4j GDS 2.27.0	Louvain + Degree Centrality
ML Layer	XGBoost 2.0 + SHAP	Fraud classification
LLM Layer	Anthropic Claude (claude-sonnet-4-6)	Structured investigation reports
Visualization	Matplotlib, PyVis	EDA + graph explorer
Dashboard	Streamlit 1.30	Intelligence interface
Container	Docker + Compose	Reproducible Neo4j environment

Current Progress

Phase 1 — Data Engineering ✅

Raw dataset extraction from Kaggle (31.9M rows, 5.45GB)
Schema cleaning, self-loop removal, null validation
Feature engineering (13 columns: log_amount, is_ACH, hour, dow, is_cross_currency, etc.)
Time-aware chronological splits (70/15/15) — zero temporal leakage
Ontology mapping — flat CSV → 4-component graph structure
31M node bulk load into Neo4j via micro-batching (~2 hours)
16-chart EDA suite

Phase 2 — Graph Analytics ✅

Degree centrality — identify high-volume hub accounts
Louvain community detection — map fraud rings
Community fraud rate computation
Graph features joined onto transaction splits (feature store)

Phase 3 — Model Development ✅

XGBoost baseline — PR-AUC 0.3043 on tabular features
Graph-enhanced XGBoost — PR-AUC 0.4590 with degree features (+51%)
Community-enhanced XGBoost — PR-AUC 0.5599 with Louvain features (+84%)
SHAP feature importance analysis
Threshold tuning via PR curve

Phase 4 — LLM Investigator ✅

4-variant Claude investigator (Haiku, Sonnet, Sonnet+prompt, Sonnet+CoT)
Structured output schema: pattern, evidence, risk_level, actions
Evaluation framework: consistency (Jaccard), faithfulness, schema compliance
Tested on 3 subgraph types: Smurfing, Layering Ring, Fan-Out
All 4 variants meet consistency target (≥ 0.8 Jaccard)

Team

Team 2 — DATA 298A Applied Data Science Capstone San José State University · Spring 2026

Name	Contributions
Arya Mehta	Data engineering, graph architecture, pipeline orchestration, graph-enhanced model
Aishwarya	XGBoost baseline training, LLM investigator layer, evaluation framework
Prajwal Dambalkar	Graph features, Louvain community detection, model comparison
Keith Gonsalves	Dashboard, model metrics visualisation
Om Dankhara	Supporting infrastructure

Academic Context

This project satisfies the following DATA 298A capstone sections:

3.1–3.6 — Data pipeline: collection, cleaning, transformation, splits, EDA
4.1 — XGBoost baseline model with PR-AUC evaluation
4.2 — Graph feature extraction (degree centrality, Louvain)
4.3 — Graph-enhanced model — baseline vs. graph comparison
4.4 — LLM investigator layer with structured output and multi-variant evaluation
4.5 — Pipeline demo via Streamlit dashboard

Built with Python, Neo4j, and a lot of coffee.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Graph-Based Fraud Intelligence Platform

Why This Exists

The Core Thesis

The Dataset

4-Layer Architecture

Layer 1 — XGBoost Baseline (Trainable)

Layer 2 — Graph Features (Non-Trainable)

Layer 3 — Louvain Community Detection (Non-Trainable)

Layer 4 — LLM Investigator (Trainable via Prompt)

Graph Ontology

EDA Key Findings

Project Structure

How to Run

Step 1 — Environment setup

Step 2 — Start Neo4j

Step 3 — Data pipeline

Step 4 — Load Knowledge Graph (~2 hours)

Step 5 — Train models

Step 6 — Run LLM Investigator

Step 7 — Launch Dashboard

Tech Stack

Current Progress

Phase 1 — Data Engineering ✅

Phase 2 — Graph Analytics ✅

Phase 3 — Model Development ✅

Phase 4 — LLM Investigator ✅

Team

Academic Context

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
artifacts		artifacts
data		data
docs		docs
notebooks		notebooks
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
dashboard_app.py		dashboard_app.py
demo_TA_pipeline.py		demo_TA_pipeline.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
run_all_backend_eda.py		run_all_backend_eda.py

Folders and files

Latest commit

History

Repository files navigation

Graph-Based Fraud Intelligence Platform

Why This Exists

The Core Thesis

The Dataset

4-Layer Architecture

Layer 1 — XGBoost Baseline (Trainable)

Layer 2 — Graph Features (Non-Trainable)

Layer 3 — Louvain Community Detection (Non-Trainable)

Layer 4 — LLM Investigator (Trainable via Prompt)

Graph Ontology

EDA Key Findings

Project Structure

How to Run

Step 1 — Environment setup

Step 2 — Start Neo4j

Step 3 — Data pipeline

Step 4 — Load Knowledge Graph (~2 hours)

Step 5 — Train models

Step 6 — Run LLM Investigator

Step 7 — Launch Dashboard

Tech Stack

Current Progress

Phase 1 — Data Engineering ✅

Phase 2 — Graph Analytics ✅

Phase 3 — Model Development ✅

Phase 4 — LLM Investigator ✅

Team

Academic Context

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages