This toolkit provides a set of Python scripts designed for Natural Language Processing (NLP) analysis of ransomware notes and Indicators of Compromise (IOCs). By leveraging Large Language Models (LLMs), specifically BERT, this project allows researchers to profile threat actors based on linguistic patterns, and cluster similar ransomware families, for the purpose of attribution.
Clusters and compares ransom letters originally collected by RansomLook.
Created for RECon 2026 by students and faculty at the Ohio State University.
This project utilizes the following NLP and Machine Learning methodologies:
- Vector Embeddings: Uses a pre-trained BERT model (
bert-base-uncased) to convert text documents (ransom notes) into high-dimensional vectors (768 dimensions), capturing semantic meaning beyond simple keyword matching. - Unsupervised Clustering: Implements DBSCAN (Density-Based Spatial Clustering of Applications with Noise) to identify groups of ransom notes that share high semantic similarity, potentially indicating shared authorship or lineage between different APT groups.
- Sequence Matching: Utilizes
difflibfor token-level comparison to visualize exact word overlaps and divergences between two specific documents.
- Python 3.x
- A CUDA-enabled GPU or Apple Silicon (MPS) is recommended for BERT inference, though code supports CPU fallback.
- Clone the repository.
- Make a virtual environment for your OS, then source it to your shell.
- Install the dependencies:
pip install -r requirements.txt1. Semantic Clustering (cluster.py)
Runs the clustering and generates a visualization plot.
Method: Analyzes all .txt files located in data/ransoms/. It generates BERT embeddings for each file and clusters them using DBSCAN to find semantic overlaps. Then, it reduces the dimensionality using PCA to 2D to generate a scatter plot.
- Method: Loads
bert-base-uncased, computes embeddings (CLS token), and applies DBSCAN (eps=3.7,min_samples=2). - Command:
./cluster.py- Output: Prints the lists of grouped letters for each cluster and opens a matplotlib window displaying the 2D PCA visualization.
2. Visual Comparison (compare.py)
Performs a direct comparison between two text files to visualize similarity.
- Method: Uses
difflib.SequenceMatcherto compute per-word similarity scores and generates an HTML heatmap. - Command:
./compare.py <letter-A> <letter-B> And then open highlight.html in your browser.
- Arguments:
<letter-A>,<letter-B>: Paths to the text files.--out: Set output HTML filename (default:highlight.html).--max-tokens: Trims files to a specific token count for performance (default: 2000).
./compare.py data/ransoms/blacklock-0.txt data/ransoms/eldorado-1.txt3. Data Collection (fetch.py)
Note: This step is wholly unnecessary if you wish to reproduce our experiment.
Fetches ransomware notes from ransomlook.io. It scrapes the top-level directory for group links and iterates through them to download individual ransom notes. To use our structure, you could move newly-downloaded files into data/ (or compare directly from the letters/ directory).
- Output: Creates a
letters/directory containing the newly-downloaded text files. - Command:
./fetch.py
