TradeGuard AI Testing Framework

Overview

TradeGuard AI Testing Framework is an AI-powered quality engineering solution designed for validating Large Language Model (LLM) applications in the Capital Markets domain. The framework leverages Python, Pytest, and DeepEval to evaluate the accuracy, consistency, and regulatory compliance of AI-generated responses related to market surveillance and trade monitoring.

The primary objective of the framework is to ensure that AI systems can correctly identify and explain suspicious trading behaviours while maintaining alignment with financial regulations and market abuse guidelines.

Business Context

Financial institutions, investment banks, and regulatory technology teams increasingly use AI-powered systems to assist analysts in detecting potential market abuse and suspicious trading activities.

Regulatory bodies such as the Monetary Authority of Singapore (MAS) and the Markets in Financial Instruments Directive II (MiFID II) require firms to implement effective surveillance mechanisms capable of identifying manipulative trading practices and maintaining market integrity.

This framework provides a structured approach to validating AI-driven surveillance use cases against realistic trading scenarios.

Key Detection Scenarios

The framework evaluates AI responses against various market manipulation patterns, including:

1. Wash Trades

Detection of transactions where there is no genuine change in beneficial ownership despite apparent trading activity.

2. Layering

Identification of multiple deceptive orders placed at different price levels to create a false impression of market demand or supply.

3. Spoofing

Detection of large non-bona fide orders intended to manipulate market perception and subsequently cancelled before execution.

4. Normal Trades

Validation that legitimate trades are not incorrectly flagged — false positive detection.

Architecture

Trade Dataset (JSON)
        ↓
Target LLM (Ollama / OpenAI)
        ↓
Generated Trade Analysis
        ↓
DeepEval Evaluation Framework
        ↓
LLM Judge Assessment (OpenAI GPT-4o)
        ↓
Pass / Fail Decision + Metrics Score

Evaluation Approach — LLM-as-a-Judge

The framework adopts the LLM-as-a-Judge evaluation methodology.

For each surveillance scenario:

Trade data is supplied as structured JSON input
The target AI model analyses the trading activity
DeepEval compares the generated response against predefined expectations
An independent evaluator LLM scores the response based on:
- Answer Relevancy — Is the response relevant to the trade description?
- Faithfulness — Is the reasoning faithful to the source data?
- Detection Accuracy — Was the correct FLAG: YES / FLAG: NO decision made?

This approach enables scalable testing of AI systems beyond traditional rule-based assertions.

Technology Stack

Layer	Technology
Language	Python 3.11+
Test Runner	Pytest
LLM Evaluation	DeepEval
Local LLM	Ollama (TinyLlama / Mistral)
Evaluation Judge	OpenAI GPT-4o
Data Format	JSON
CI/CD	GitHub Actions
Dependency Management	pip + requirements.txt

Project Structure

TradeGuardAI/
├── data/
│   └── synthetic_trades.json     ← Trade test dataset
├── tests/
│   ├── test_trade_evaluation.py  ← DeepEval test cases
│   └── validate_trades.py        ← Rule-based assertions
├── utils/
│   ├── read_data.py              ← Trade data loader
│   └── llm_client.py            ← LLM API client
├── evaluators/
│   └── ollama_evaluator.py       ← Custom DeepEval evaluator
├── prompts/                      ← LLM system prompts
├── reports/                      ← Test execution reports
├── conftest.py                   ← Pytest fixtures and setup
├── pytest.ini                    ← Pytest configuration
├── requirements.txt              ← Project dependencies
├── .env                          ← Local environment variables (not committed)
└── .github/
    └── workflows/
        └── ci.yml                ← GitHub Actions CI pipeline

Getting Started

Prerequisites

Python 3.10+
Ollama installed locally — ollama.com
OpenAI API key — platform.openai.com
Git

Installation

# Clone the repository
git clone https://github.com/subhlabh610/TradeGuardAI.git
cd TradeGuardAI

# Create and activate virtual environment
python -m virtualenv venv
venv\Scripts\activate        # Windows
source venv/bin/activate     # Mac/Linux

# Install dependencies
pip install -r requirements.txt

Configuration

Create a .env file in the project root:

OPENAI_API_KEY=your_openai_key_here
BASE_URL=http://localhost:11434
MODEL_NAME=tinyllama

Pull Ollama Model

ollama pull tinyllama

Run Tests

# Run full test suite
python -m pytest

# Run with verbose output
python -m pytest -v

# Run specific test file
python -m pytest tests/test_trade_evaluation.py -v

CI/CD Pipeline

The project includes a GitHub Actions CI pipeline that:

Triggers on every push and pull request to main
Installs Python 3.11 and project dependencies
Installs Ollama and pulls TinyLlama model
Runs the full test suite
Reports pass/fail results

Pipeline configuration: .github/workflows/ci.yml

Framework Benefits

Automated validation of AI-powered surveillance systems
Support for realistic Capital Markets trade scenarios
Regulatory-focused testing strategy aligned with MAS and MiFID II
Reusable Pytest test suites with fixture-based architecture
Explainable evaluation results using DeepEval metrics
Scalable approach for AI model benchmarking
CI/CD integration for continuous quality assurance

Why This Matters

Banks and financial institutions must ensure that AI-driven surveillance systems can reliably identify market abuse patterns while supporting regulatory compliance, reducing operational risk, and protecting market integrity.

A single undetected wash trade or spoofing pattern can result in significant regulatory penalties, reputational damage, and market instability. This framework provides the quality assurance layer that bridges AI development and production deployment in regulated financial environments.

Author

Sulabh Gupta — Senior SDET | AI Quality Engineering | Capital Markets

LinkedIn | GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TradeGuard AI Testing Framework

Overview

Business Context

Key Detection Scenarios

1. Wash Trades

2. Layering

3. Spoofing

4. Normal Trades

Architecture

Evaluation Approach — LLM-as-a-Judge

Technology Stack

Project Structure

Getting Started

Prerequisites

Installation

Configuration

Pull Ollama Model

Run Tests

CI/CD Pipeline

Framework Benefits

Why This Matters

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.deepeval		.deepeval
.github/workflows		.github/workflows
data		data
evaluators		evaluators
tests		tests
utils		utils
.gitignore		.gitignore
conftest.py		conftest.py
pytest.ini		pytest.ini
readme.md		readme.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

TradeGuard AI Testing Framework

Overview

Business Context

Key Detection Scenarios

1. Wash Trades

2. Layering

3. Spoofing

4. Normal Trades

Architecture

Evaluation Approach — LLM-as-a-Judge

Technology Stack

Project Structure

Getting Started

Prerequisites

Installation

Configuration

Pull Ollama Model

Run Tests

CI/CD Pipeline

Framework Benefits

Why This Matters

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages