Semantic Shields

Overview

This toolkit provides a set of Python scripts designed for Natural Language Processing (NLP) analysis of ransomware notes and Indicators of Compromise (IOCs). By leveraging Large Language Models (LLMs), specifically BERT, this project allows researchers to profile threat actors based on linguistic patterns, and cluster similar ransomware families, for the purpose of attribution.

Clusters and compares ransom letters originally collected by RansomLook.

Created for RECon 2026 by students and faculty at the Ohio State University.

Methodologies

This project utilizes the following NLP and Machine Learning methodologies:

Vector Embeddings: Uses a pre-trained BERT model (bert-base-uncased) to convert text documents (ransom notes) into high-dimensional vectors (768 dimensions), capturing semantic meaning beyond simple keyword matching.
Unsupervised Clustering: Implements DBSCAN (Density-Based Spatial Clustering of Applications with Noise) to identify groups of ransom notes that share high semantic similarity, potentially indicating shared authorship or lineage between different APT groups.
Sequence Matching: Utilizes difflib for token-level comparison to visualize exact word overlaps and divergences between two specific documents.

Prerequisites

Python 3.x
A CUDA-enabled GPU or Apple Silicon (MPS) is recommended for BERT inference, though code supports CPU fallback.

Setup

Clone the repository.
Make a virtual environment for your OS, then source it to your shell.
Install the dependencies:

pip install -r requirements.txt

Usage Guide

1. Semantic Clustering (`cluster.py`)

Runs the clustering and generates a visualization plot.

Method: Analyzes all .txt files located in data/ransoms/. It generates BERT embeddings for each file and clusters them using DBSCAN to find semantic overlaps. Then, it reduces the dimensionality using PCA to 2D to generate a scatter plot.

Method: Loads bert-base-uncased, computes embeddings (CLS token), and applies DBSCAN (eps=3.7, min_samples=2).
Command:

./cluster.py

Output: Prints the lists of grouped letters for each cluster and opens a matplotlib window displaying the 2D PCA visualization.

Example output:

2. Visual Comparison (`compare.py`)

Performs a direct comparison between two text files to visualize similarity.

Method: Uses difflib.SequenceMatcher to compute per-word similarity scores and generates an HTML heatmap.
Command:

./compare.py <letter-A> <letter-B>

And then open highlight.html in your browser.

Arguments:
<letter-A>, <letter-B>: Paths to the text files.
--out: Set output HTML filename (default: highlight.html).
--max-tokens: Trims files to a specific token count for performance (default: 2000).

Example output (from below command):

./compare.py data/ransoms/blacklock-0.txt data/ransoms/eldorado-1.txt

3. Data Collection (`fetch.py`)

Note: This step is wholly unnecessary if you wish to reproduce our experiment.

Fetches ransomware notes from ransomlook.io. It scrapes the top-level directory for group links and iterates through them to download individual ransom notes. To use our structure, you could move newly-downloaded files into data/ (or compare directly from the letters/ directory).

Output: Creates a letters/ directory containing the newly-downloaded text files.
Command:

./fetch.py

License

AGPLv3

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data/ransoms		data/ransoms
graphics		graphics
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
cluster.py		cluster.py
compare.py		compare.py
fetch.py		fetch.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Shields

Overview

Methodologies

Prerequisites

Setup

Usage Guide

1. Semantic Clustering (`cluster.py`)

Example output:

2. Visual Comparison (`compare.py`)

Example output (from below command):

3. Data Collection (`fetch.py`)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semantic Shields

Overview

Methodologies

Prerequisites

Setup

Usage Guide

1. Semantic Clustering (cluster.py)

Example output:

2. Visual Comparison (compare.py)

Example output (from below command):

3. Data Collection (fetch.py)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Semantic Clustering (`cluster.py`)

2. Visual Comparison (`compare.py`)

3. Data Collection (`fetch.py`)

Packages