11# VulnScan Documentation
22
3- VulnScan is designed to detect sensitive data across various file formats.
4- It offers a modular framework to train models using diverse algorithms,
5- from traditional ML classifiers to advanced Neural Networks.
3+ VulnScan is designed to detect sensitive data across various file formats.
4+ It offers a modular framework to train models using diverse algorithms,
5+ from traditional ML classifiers to advanced Neural Networks.
66
77This document outlines the system's naming conventions, lifecycle, and model configuration.
88
99> [ !NOTE]
1010> Ported in update 3.5.0 of Logicytics - Latest update from there was 3.4.2
11- >
11+ >
1212> You can find the main repo and generated files [ here] ( https://github.com/DefinetlyNotAI/Logicytics/tree/main/CODE/vulnScan )
1313
1414> [ !IMPORTANT]
1515> Old documentation is available in the ` Archived Models ` directory of this [ repository] ( https://github.com/DefinetlyNotAI/VulnScan_Data )
16- >
16+ >
1717> This documentation is covers test data, metrics and niche features.
1818
1919---
@@ -24,16 +24,16 @@ This document outlines the system's naming conventions, lifecycle, and model con
2424` Model {Type of model} .{Version} `
2525
2626- ** Type of Model** : Describes the training data configuration.
27- - ` Sense ` : Sensitive data set with 50k files, each 50KB in size.
28- - ` SenseNano ` : Test set with 5-10 files, each 5KB, used for error-checking.
29- - ` SenseMacro ` : Large dataset with 1M files, each 10KB. This is computationally intensive, so some corners were cut in training.
30- - ` SenseMini ` : Dataset with 10K files, each between 10-200KB. Balanced size for effective training and resource efficiency.
27+ - ` Sense ` : Sensitive data set with 50k files, each 50KB in size.
28+ - ` SenseNano ` : Test set with 5-10 files, each 5KB, used for error-checking.
29+ - ` SenseMacro ` : Large dataset with 1M files, each 10KB. This is computationally intensive, so some corners were cut in training.
30+ - ` SenseMini ` : Dataset with 10K files, each between 10-200KB. Balanced size for effective training and resource efficiency.
3131
3232- ** Version Format** : ` {Version#}{c}{Repeat#} `
33- - ** Version#** : Increment for major code updates.
34- - ** c** : Model identifier (e.g., NeuralNetwork, BERT, etc.). See below for codes.
35- - ** Repeat#** : Number of times the same model was trained without significant code changes, used to improve consistency.
36- - ** -F** : Denotes a failed model or a corrupted model.
33+ - ** Version#** : Increment for major code updates.
34+ - ** c** : Model identifier (e.g., NeuralNetwork, BERT, etc.). See below for codes.
35+ - ** Repeat#** : Number of times the same model was trained without significant code changes, used to improve consistency.
36+ - ** -F** : Denotes a failed model or a corrupted model.
3737
3838### Model Identifiers
3939
@@ -52,7 +52,7 @@ This document outlines the system's naming conventions, lifecycle, and model con
5252| ` x ` | XGBoost |
5353
5454### Example
55- ` Model Sense .1n2 ` :
55+ ` Model Sense .1n2 ` :
5656- Dataset: ` Sense ` (50k files, 50KB each).
5757- Version: 1 (first major version).
5858- Model: ` NeuralNetwork ` (` n ` ).
@@ -101,7 +101,7 @@ This document outlines the system's naming conventions, lifecycle, and model con
101101---
102102
103103## Preferred Model
104- ** NeuralNetwork (` n ` )**
104+ ** NeuralNetwork (` n ` )**
105105- Proven to be the most effective for detecting sensitive data in the project.
106106
107107---
@@ -121,27 +121,27 @@ This document outlines the system's naming conventions, lifecycle, and model con
121121
122122# More files
123123
124- There is a repository that archived all the data used to make the model,
125- as well as previously trained models for you to test out
126- (loading scripts and vectorizers are not included).
124+ There is a repository that archived all the data used to make the model,
125+ as well as previously trained models for you to test out
126+ (loading scripts and vectorizers are not included).
127127
128128The repository is located [ here] ( https://github.com/DefinetlyNotAI/VulnScan_Data ) .
129129
130130The repository contains the following directories:
131131- ` Archived Models ` : Contains the previously trained models. Is organized by the model type then version.
132132- ` NN features ` : Contains information about the model ` .3n3 ` and the vectorizer used. Information include:
133- - ` Documentation_Study_Network.md ` : A markdown file that contains more info.
134- - ` Neural Network Nodes Graph.gexf ` : A Gephi file that contains the model nodes and edges.
135- - ` Nodes and edges (GEPHI).csv ` : A CSV file that contains the model nodes and edges.
136- - ` Statistics ` : Directories made by Gephi, containing the statistics of the model nodes and edges.
137- - ` Feature_Importance.svg ` : A SVG file that contains the feature importance of the model.
138- - ` Loss_Landscape_3D.html ` : A HTML file that contains the 3D loss landscape of the model.
139- - ` Model Accuracy Over Epochs.png ` and ` Model Loss Over Epochs.png ` : PNG files that contain the model accuracy and loss over epochs.
140- - ` Model state dictionary.txt ` : A text file that contains the model state dictionary.
141- - ` Model Summary.txt ` : A text file that contains the model summary.
142- - ` Model Visualization.png ` : A PNG file that contains the model visualization.
143- - ` Top_90_Features.svg ` : A SVG file that contains the top 90 features of the model.
144- - ` Vectorizer features.txt ` : A text file that contains the vectorizer features.
145- - ` Visualize Activation.png ` : A PNG file that contains the visualization of the model activation.
146- - ` Visualize t-SNE.png ` : A PNG file that contains the visualization of the model t-SNE.
147- - ` Weight Distribution.png ` : A PNG file that contains the weight distribution of the model.
133+ - ` Documentation_Study_Network.md ` : A markdown file that contains more info.
134+ - ` Neural Network Nodes Graph.gexf ` : A Gephi file that contains the model nodes and edges.
135+ - ` Nodes and edges (GEPHI).csv ` : A CSV file that contains the model nodes and edges.
136+ - ` Statistics ` : Directories made by Gephi, containing the statistics of the model nodes and edges.
137+ - ` Feature_Importance.svg ` : A SVG file that contains the feature importance of the model.
138+ - ` Loss_Landscape_3D.html ` : A HTML file that contains the 3D loss landscape of the model.
139+ - ` Model Accuracy Over Epochs.png ` and ` Model Loss Over Epochs.png ` : PNG files that contain the model accuracy and loss over epochs.
140+ - ` Model state dictionary.txt ` : A text file that contains the model state dictionary.
141+ - ` Model Summary.txt ` : A text file that contains the model summary.
142+ - ` Model Visualization.png ` : A PNG file that contains the model visualization.
143+ - ` Top_90_Features.svg ` : A SVG file that contains the top 90 features of the model.
144+ - ` Vectorizer features.txt ` : A text file that contains the vectorizer features.
145+ - ` Visualize Activation.png ` : A PNG file that contains the visualization of the model activation.
146+ - ` Visualize t-SNE.png ` : A PNG file that contains the visualization of the model t-SNE.
147+ - ` Weight Distribution.png ` : A PNG file that contains the weight distribution of the model.
0 commit comments