|
| 1 | +# VulnScan Documentation |
| 2 | + |
| 3 | +VulnScan is designed to detect sensitive data across various file formats. |
| 4 | +It offers a modular framework to train models using diverse algorithms, |
| 5 | +from traditional ML classifiers to advanced Neural Networks. |
| 6 | + |
| 7 | +This document outlines the system's naming conventions, lifecycle, and model configuration. |
| 8 | + |
| 9 | +> [!NOTE] |
| 10 | +> Ported in update 3.5.0 of Logicytics - Latest update from there was 3.4.2 |
| 11 | +> |
| 12 | +> You can find the main repo and generated files [here](https://github.com/DefinetlyNotAI/Logicytics/tree/main/CODE/vulnScan) |
| 13 | +
|
| 14 | +> [!IMPORTANT] |
| 15 | +> Old documentation is available in the `Archived Models` directory of this [repository](https://github.com/DefinetlyNotAI/VulnScan_Data) |
| 16 | +> This documentation is covers test data, metrics and niche features. |
| 17 | +
|
| 18 | +--- |
| 19 | + |
| 20 | +## Naming Conventions |
| 21 | + |
| 22 | +### Model Naming Format |
| 23 | +`Model {Type of model} .{Version}` |
| 24 | + |
| 25 | +- **Type of Model**: Describes the training data configuration. |
| 26 | + - `Sense`: Sensitive data set with 50k files, each 50KB in size. |
| 27 | + - `SenseNano`: Test set with 5-10 files, each 5KB, used for error-checking. |
| 28 | + - `SenseMacro`: Large dataset with 1M files, each 10KB. This is computationally intensive, so some corners were cut in training. |
| 29 | + - `SenseMini`: Dataset with 10K files, each between 10-200KB. Balanced size for effective training and resource efficiency. |
| 30 | + |
| 31 | +- **Version Format**: `{Version#}{c}{Repeat#}` |
| 32 | + - **Version#**: Increment for major code updates. |
| 33 | + - **c**: Model identifier (e.g., NeuralNetwork, BERT, etc.). See below for codes. |
| 34 | + - **Repeat#**: Number of times the same model was trained without significant code changes, used to improve consistency. |
| 35 | + - **-F**: Denotes a failed model or a corrupted model. |
| 36 | + |
| 37 | +### Model Identifiers |
| 38 | + |
| 39 | +| Code | Model Type | |
| 40 | +|------|---------------------------| |
| 41 | +| `b` | BERT | |
| 42 | +| `dt` | DecisionTree | |
| 43 | +| `et` | ExtraTrees | |
| 44 | +| `g` | GBM | |
| 45 | +| `l` | LSTM | |
| 46 | +| `n` | NeuralNetwork (preferred) | |
| 47 | +| `nb` | NaiveBayes | |
| 48 | +| `r` | RandomForestClassifier | |
| 49 | +| `lr` | Logistic Regression | |
| 50 | +| `v` | SupportVectorMachine | |
| 51 | +| `x` | XGBoost | |
| 52 | + |
| 53 | +### Example |
| 54 | +`Model Sense .1n2`: |
| 55 | +- Dataset: `Sense` (50k files, 50KB each). |
| 56 | +- Version: 1 (first major version). |
| 57 | +- Model: `NeuralNetwork` (`n`). |
| 58 | +- Repeat Count: 2 (second training run with no major code changes). |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +## Life Cycle Phases |
| 63 | + |
| 64 | +### Version 1 (Deprecated) |
| 65 | +- **Removed**: Small and weak codebase, replaced by `v3`. |
| 66 | + |
| 67 | +1. Generate data. |
| 68 | +2. Index paths. |
| 69 | +3. Read paths. |
| 70 | +4. Train models and iterate through epochs. |
| 71 | +5. Produce outputs: data, graphs, and `.pkl` files. |
| 72 | + |
| 73 | +--- |
| 74 | + |
| 75 | +### Version 2 (Deprecated) |
| 76 | +- **Deprecation Reason**: Outdated methods for splitting and vectorizing data. |
| 77 | + |
| 78 | +1. Load Data. |
| 79 | +2. Split Data. |
| 80 | +3. Vectorize Text. |
| 81 | +4. Initialize Model. |
| 82 | +5. Train Model. |
| 83 | +6. Evaluate Model. |
| 84 | +7. Save Model. |
| 85 | +8. Track Progress. |
| 86 | + |
| 87 | +--- |
| 88 | + |
| 89 | +### Version 3 (Current) |
| 90 | +1. **Read Config**: Load model and training parameters. |
| 91 | +2. **Load Data**: Collect and preprocess sensitive data. |
| 92 | +3. **Split Data**: Separate into training and validation sets. |
| 93 | +4. **Vectorize Text**: Transform textual data using `TfidfVectorizer`. |
| 94 | +5. **Initialize Model**: Define traditional ML or Neural Network models. |
| 95 | +6. **Train Model**: Perform iterative training using epochs. |
| 96 | +7. **Validate Model**: Evaluate with metrics and generate classification reports. |
| 97 | +8. **Save Model**: Persist trained models and vectorizers for reuse. |
| 98 | +9. **Track Progress**: Log and visualize accuracy and loss trends over epochs. |
| 99 | + |
| 100 | +--- |
| 101 | + |
| 102 | +## Preferred Model |
| 103 | +**NeuralNetwork (`n`)** |
| 104 | +- Proven to be the most effective for detecting sensitive data in the project. |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | +## Notes |
| 109 | +- **Naming System**: Helps track model versions, datasets, and training iterations for transparency and reproducibility. |
| 110 | +- **Current Focus**: Transition to `v3` for improved accuracy, flexibility, and robust performance. |
| 111 | + |
| 112 | +--- |
| 113 | + |
| 114 | +## Additional Features |
| 115 | + |
| 116 | +- **Progress Tracking**: Visualizes accuracy and loss per epoch with graphs. |
| 117 | +- **Error Handling**: Logs errors for missing files, attribute issues, or unexpected conditions. |
| 118 | +- **Extensibility**: Supports plug-and-play integration for new algorithms or datasets. |
| 119 | + |
| 120 | + |
| 121 | +# More files |
| 122 | + |
| 123 | +There is a repository that archived all the data used to make the model, |
| 124 | +as well as previously trained models for you to test out |
| 125 | +(loading scripts and vectorizers are not included). |
| 126 | + |
| 127 | +The repository is located [here](https://github.com/DefinetlyNotAI/VulnScan_Data). |
| 128 | + |
| 129 | +The repository contains the following directories: |
| 130 | +- `Archived Models`: Contains the previously trained models. Is organized by the model type then version. |
| 131 | +- `NN features`: Contains information about the model `.3n3` and the vectorizer used. Information include: |
| 132 | + - `Documentation_Study_Network.md`: A markdown file that contains more info. |
| 133 | + - `Neural Network Nodes Graph.gexf`: A Gephi file that contains the model nodes and edges. |
| 134 | + - `Nodes and edges (GEPHI).csv`: A CSV file that contains the model nodes and edges. |
| 135 | + - `Statistics`: Directories made by Gephi, containing the statistics of the model nodes and edges. |
| 136 | + - `Feature_Importance.svg`: A SVG file that contains the feature importance of the model. |
| 137 | + - `Loss_Landscape_3D.html`: A HTML file that contains the 3D loss landscape of the model. |
| 138 | + - `Model Accuracy Over Epochs.png` and `Model Loss Over Epochs.png`: PNG files that contain the model accuracy and loss over epochs. |
| 139 | + - `Model state dictionary.txt`: A text file that contains the model state dictionary. |
| 140 | + - `Model Summary.txt`: A text file that contains the model summary. |
| 141 | + - `Model Visualization.png`: A PNG file that contains the model visualization. |
| 142 | + - `Top_90_Features.svg`: A SVG file that contains the top 90 features of the model. |
| 143 | + - `Vectorizer features.txt`: A text file that contains the vectorizer features. |
| 144 | + - `Visualize Activation.png`: A PNG file that contains the visualization of the model activation. |
| 145 | + - `Visualize t-SNE.png`: A PNG file that contains the visualization of the model t-SNE. |
| 146 | + - `Weight Distribution.png`: A PNG file that contains the weight distribution of the model. |
0 commit comments