Skip to content

Commit 803193a

Browse files
Add files via upload
1 parent 8580d82 commit 803193a

7 files changed

Lines changed: 1564 additions & 0 deletions

File tree

README.md

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# VulnScan Documentation
2+
3+
VulnScan is designed to detect sensitive data across various file formats.
4+
It offers a modular framework to train models using diverse algorithms,
5+
from traditional ML classifiers to advanced Neural Networks.
6+
7+
This document outlines the system's naming conventions, lifecycle, and model configuration.
8+
9+
> [!NOTE]
10+
> Ported in update 3.5.0 of Logicytics - Latest update from there was 3.4.2
11+
>
12+
> You can find the main repo and generated files [here](https://github.com/DefinetlyNotAI/Logicytics/tree/main/CODE/vulnScan)
13+
14+
> [!IMPORTANT]
15+
> Old documentation is available in the `Archived Models` directory of this [repository](https://github.com/DefinetlyNotAI/VulnScan_Data)
16+
> This documentation is covers test data, metrics and niche features.
17+
18+
---
19+
20+
## Naming Conventions
21+
22+
### Model Naming Format
23+
`Model {Type of model} .{Version}`
24+
25+
- **Type of Model**: Describes the training data configuration.
26+
- `Sense`: Sensitive data set with 50k files, each 50KB in size.
27+
- `SenseNano`: Test set with 5-10 files, each 5KB, used for error-checking.
28+
- `SenseMacro`: Large dataset with 1M files, each 10KB. This is computationally intensive, so some corners were cut in training.
29+
- `SenseMini`: Dataset with 10K files, each between 10-200KB. Balanced size for effective training and resource efficiency.
30+
31+
- **Version Format**: `{Version#}{c}{Repeat#}`
32+
- **Version#**: Increment for major code updates.
33+
- **c**: Model identifier (e.g., NeuralNetwork, BERT, etc.). See below for codes.
34+
- **Repeat#**: Number of times the same model was trained without significant code changes, used to improve consistency.
35+
- **-F**: Denotes a failed model or a corrupted model.
36+
37+
### Model Identifiers
38+
39+
| Code | Model Type |
40+
|------|---------------------------|
41+
| `b` | BERT |
42+
| `dt` | DecisionTree |
43+
| `et` | ExtraTrees |
44+
| `g` | GBM |
45+
| `l` | LSTM |
46+
| `n` | NeuralNetwork (preferred) |
47+
| `nb` | NaiveBayes |
48+
| `r` | RandomForestClassifier |
49+
| `lr` | Logistic Regression |
50+
| `v` | SupportVectorMachine |
51+
| `x` | XGBoost |
52+
53+
### Example
54+
`Model Sense .1n2`:
55+
- Dataset: `Sense` (50k files, 50KB each).
56+
- Version: 1 (first major version).
57+
- Model: `NeuralNetwork` (`n`).
58+
- Repeat Count: 2 (second training run with no major code changes).
59+
60+
---
61+
62+
## Life Cycle Phases
63+
64+
### Version 1 (Deprecated)
65+
- **Removed**: Small and weak codebase, replaced by `v3`.
66+
67+
1. Generate data.
68+
2. Index paths.
69+
3. Read paths.
70+
4. Train models and iterate through epochs.
71+
5. Produce outputs: data, graphs, and `.pkl` files.
72+
73+
---
74+
75+
### Version 2 (Deprecated)
76+
- **Deprecation Reason**: Outdated methods for splitting and vectorizing data.
77+
78+
1. Load Data.
79+
2. Split Data.
80+
3. Vectorize Text.
81+
4. Initialize Model.
82+
5. Train Model.
83+
6. Evaluate Model.
84+
7. Save Model.
85+
8. Track Progress.
86+
87+
---
88+
89+
### Version 3 (Current)
90+
1. **Read Config**: Load model and training parameters.
91+
2. **Load Data**: Collect and preprocess sensitive data.
92+
3. **Split Data**: Separate into training and validation sets.
93+
4. **Vectorize Text**: Transform textual data using `TfidfVectorizer`.
94+
5. **Initialize Model**: Define traditional ML or Neural Network models.
95+
6. **Train Model**: Perform iterative training using epochs.
96+
7. **Validate Model**: Evaluate with metrics and generate classification reports.
97+
8. **Save Model**: Persist trained models and vectorizers for reuse.
98+
9. **Track Progress**: Log and visualize accuracy and loss trends over epochs.
99+
100+
---
101+
102+
## Preferred Model
103+
**NeuralNetwork (`n`)**
104+
- Proven to be the most effective for detecting sensitive data in the project.
105+
106+
---
107+
108+
## Notes
109+
- **Naming System**: Helps track model versions, datasets, and training iterations for transparency and reproducibility.
110+
- **Current Focus**: Transition to `v3` for improved accuracy, flexibility, and robust performance.
111+
112+
---
113+
114+
## Additional Features
115+
116+
- **Progress Tracking**: Visualizes accuracy and loss per epoch with graphs.
117+
- **Error Handling**: Logs errors for missing files, attribute issues, or unexpected conditions.
118+
- **Extensibility**: Supports plug-and-play integration for new algorithms or datasets.
119+
120+
121+
# More files
122+
123+
There is a repository that archived all the data used to make the model,
124+
as well as previously trained models for you to test out
125+
(loading scripts and vectorizers are not included).
126+
127+
The repository is located [here](https://github.com/DefinetlyNotAI/VulnScan_Data).
128+
129+
The repository contains the following directories:
130+
- `Archived Models`: Contains the previously trained models. Is organized by the model type then version.
131+
- `NN features`: Contains information about the model `.3n3` and the vectorizer used. Information include:
132+
- `Documentation_Study_Network.md`: A markdown file that contains more info.
133+
- `Neural Network Nodes Graph.gexf`: A Gephi file that contains the model nodes and edges.
134+
- `Nodes and edges (GEPHI).csv`: A CSV file that contains the model nodes and edges.
135+
- `Statistics`: Directories made by Gephi, containing the statistics of the model nodes and edges.
136+
- `Feature_Importance.svg`: A SVG file that contains the feature importance of the model.
137+
- `Loss_Landscape_3D.html`: A HTML file that contains the 3D loss landscape of the model.
138+
- `Model Accuracy Over Epochs.png` and `Model Loss Over Epochs.png`: PNG files that contain the model accuracy and loss over epochs.
139+
- `Model state dictionary.txt`: A text file that contains the model state dictionary.
140+
- `Model Summary.txt`: A text file that contains the model summary.
141+
- `Model Visualization.png`: A PNG file that contains the model visualization.
142+
- `Top_90_Features.svg`: A SVG file that contains the top 90 features of the model.
143+
- `Vectorizer features.txt`: A text file that contains the vectorizer features.
144+
- `Visualize Activation.png`: A PNG file that contains the visualization of the model activation.
145+
- `Visualize t-SNE.png`: A PNG file that contains the visualization of the model t-SNE.
146+
- `Weight Distribution.png`: A PNG file that contains the weight distribution of the model.

requirements.txt

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
joblib~=1.3.2
2+
matplotlib~=3.10.1
3+
torch~=2.5.1+cu124
4+
xgboost~=2.1.4
5+
configparser~=7.1.0
6+
scikit-learn~=1.6.1
7+
Faker~=36.1.1
8+
networkx~=3.2.1
9+
numpy~=2.2.3
10+
plotly~=6.0.0
11+
seaborn~=0.13.2
12+
torchviz~=0.0.3
13+
tqdm~=4.66.6

0 commit comments

Comments
 (0)