|
| 1 | +# Network Security Classification Project |
| 2 | + |
| 3 | +[](https://dvc.org/) |
| 4 | +[](https://mlflow.org/) |
| 5 | +[](https://dagshub.com/) |
| 6 | + |
| 7 | +## 📋 Project Overview |
| 8 | + |
| 9 | +This project implements a machine learning pipeline for network security classification, focusing on detecting and classifying network security threats. The pipeline is built with reproducibility, versioning, and tracking in mind, leveraging modern MLOps tools. |
| 10 | + |
| 11 | +### 🔍 Key Features |
| 12 | + |
| 13 | +- **End-to-End ML Pipeline**: Automated data ingestion, validation, transformation, and model training |
| 14 | +- **Data Version Control**: Track and version datasets using DVC |
| 15 | +- **Experiment Tracking**: Monitor model metrics and parameters with MLflow |
| 16 | +- **Reproducibility**: Ensure consistent results across different environments |
| 17 | +- **CI/CD Integration**: Automated testing and deployment workflows |
| 18 | +- **Containerization**: Docker support for consistent deployment |
| 19 | +- **REST API**: FastAPI-based API for real-time predictions |
| 20 | +- **Text Classification**: Support for text-based cyber threat intelligence data |
| 21 | +- **Multiple Training Approaches**: Support for both MongoDB-based and direct file-based training |
| 22 | + |
| 23 | +## 🛠️ Technology Stack |
| 24 | + |
| 25 | +- **Python**: Core programming language |
| 26 | +- **MongoDB**: Database for storing network security data |
| 27 | +- **Scikit-learn & XGBoost**: ML algorithms for classification |
| 28 | +- **DVC**: Data version control |
| 29 | +- **MLflow**: Experiment tracking and model registry |
| 30 | +- **DAGsHub**: Collaborative MLOps platform |
| 31 | +- **Docker**: Containerization |
| 32 | +- **Pytest**: Testing framework |
| 33 | +- **FastAPI**: High-performance API framework |
| 34 | +- **Uvicorn**: ASGI server for FastAPI |
| 35 | + |
| 36 | +## 🚀 Getting Started |
| 37 | + |
| 38 | +### Prerequisites |
| 39 | + |
| 40 | +- Python 3.8+ (Python 3.10 or 3.11 recommended for best compatibility) |
| 41 | +- Git |
| 42 | +- Docker (optional) |
| 43 | +- MongoDB connection string |
| 44 | +- DAGsHub account (for MLflow tracking) |
| 45 | + |
| 46 | +### Installation |
| 47 | + |
| 48 | +1. Clone the repository: |
| 49 | + ```bash |
| 50 | + git clone https://github.com/austinLorenzMccoy/networkSecurity_project.git |
| 51 | + cd networkSecurity_project |
| 52 | + ``` |
| 53 | + |
| 54 | +2. Create and activate a virtual environment: |
| 55 | + ```bash |
| 56 | + python -m venv venv |
| 57 | + source venv/bin/activate # On Windows: venv\Scripts\activate |
| 58 | + ``` |
| 59 | + |
| 60 | +3. Install dependencies: |
| 61 | + ```bash |
| 62 | + pip install -r requirements.txt |
| 63 | + pip install -e . |
| 64 | + ``` |
| 65 | + |
| 66 | +4. Set up environment variables: |
| 67 | + ```bash |
| 68 | + # Create a .env file with your MongoDB connection string and DAGsHub credentials |
| 69 | + cp .env.template .env |
| 70 | + # Edit the .env file with your credentials |
| 71 | + ``` |
| 72 | + |
| 73 | +5. Initialize DVC: |
| 74 | + ```bash |
| 75 | + dvc init |
| 76 | + ``` |
| 77 | + |
| 78 | +6. Connect to DAGsHub (optional): |
| 79 | + ```bash |
| 80 | + # Set up DAGsHub as a remote |
| 81 | + dvc remote add origin https://dagshub.com/austinLorenzMccoy/networkSecurity_project.dvc |
| 82 | + ``` |
| 83 | + |
| 84 | +## 📊 DVC Pipeline |
| 85 | + |
| 86 | +The project uses DVC to define and run the ML pipeline stages: |
| 87 | + |
| 88 | +```bash |
| 89 | +# Run the entire pipeline |
| 90 | +dvc repro |
| 91 | + |
| 92 | +# Run a specific stage |
| 93 | +dvc repro -s data_ingestion |
| 94 | +dvc repro -s data_validation |
| 95 | +dvc repro -s data_transformation |
| 96 | +dvc repro -s model_training |
| 97 | + |
| 98 | +# Run the direct training pipeline (using cyber threat intelligence data) |
| 99 | +dvc repro -s direct_training |
| 100 | + |
| 101 | +# View pipeline visualization |
| 102 | +dvc dag |
| 103 | +``` |
| 104 | + |
| 105 | +## 📈 MLflow Tracking |
| 106 | + |
| 107 | +MLflow is used to track experiments, including parameters, metrics, and artifacts: |
| 108 | + |
| 109 | +```bash |
| 110 | +# Start the MLflow UI locally |
| 111 | +mlflow ui |
| 112 | + |
| 113 | +# Or view experiments on DAGsHub |
| 114 | +# Visit: https://dagshub.com/austinLorenzMccoy/networkSecurity_project.mlflow |
| 115 | +``` |
| 116 | + |
| 117 | +### DAGsHub Integration |
| 118 | + |
| 119 | +To enable MLflow tracking with DAGsHub: |
| 120 | + |
| 121 | +1. Set your DAGsHub credentials in the `.env` file: |
| 122 | + ``` |
| 123 | + MLFLOW_TRACKING_USERNAME=your_dagshub_username |
| 124 | + MLFLOW_TRACKING_PASSWORD=your_dagshub_token |
| 125 | + ``` |
| 126 | + |
| 127 | +2. Run the training pipeline with MLflow tracking: |
| 128 | + ```bash |
| 129 | + dvc repro direct_training |
| 130 | + ``` |
| 131 | + |
| 132 | +3. View your experiments on DAGsHub's MLflow interface |
| 133 | + |
| 134 | +## 🧪 Testing |
| 135 | + |
| 136 | +The project includes unit tests using pytest: |
| 137 | + |
| 138 | +```bash |
| 139 | +# Run all tests |
| 140 | +pytest |
| 141 | + |
| 142 | +# Run tests with coverage report |
| 143 | +pytest --cov=networksecurity |
| 144 | +``` |
| 145 | + |
| 146 | +## 🐳 Docker |
| 147 | + |
| 148 | +Build and run the project using Docker: |
| 149 | + |
| 150 | +```bash |
| 151 | +# Build the Docker image |
| 152 | +docker build -t network-security-project . |
| 153 | + |
| 154 | +# Run the container |
| 155 | +docker run -p 8000:8000 -e MONGODB_URI=your_mongodb_connection_string network-security-project |
| 156 | +``` |
| 157 | + |
| 158 | +## 🌐 FastAPI Application |
| 159 | + |
| 160 | +The project includes a FastAPI application for serving predictions: |
| 161 | + |
| 162 | +```bash |
| 163 | +# Run the FastAPI application |
| 164 | +python app.py |
| 165 | + |
| 166 | +# Or use the convenience script |
| 167 | +bash run_api.sh |
| 168 | +``` |
| 169 | + |
| 170 | +### API Endpoints |
| 171 | + |
| 172 | +- **GET /health**: Check if the model is loaded and ready |
| 173 | +- **GET /model-info**: Get information about the trained model |
| 174 | +- **POST /predict**: Make predictions using feature vectors |
| 175 | +- **POST /predict/text**: Make predictions using raw text input |
| 176 | + |
| 177 | +### Example Usage |
| 178 | + |
| 179 | +```bash |
| 180 | +# Check health status |
| 181 | +curl -X GET "http://localhost:8000/health" |
| 182 | + |
| 183 | +# Get model information |
| 184 | +curl -X GET "http://localhost:8000/model-info" |
| 185 | + |
| 186 | +# Make a prediction with text |
| 187 | +curl -X POST "http://localhost:8000/predict/text" \ |
| 188 | + -H "Content-Type: application/json" \ |
| 189 | + -d '{"text": "A new ransomware attack has been detected that encrypts files."}' |
| 190 | +``` |
| 191 | + |
| 192 | +## 📁 Project Structure |
| 193 | + |
| 194 | +``` |
| 195 | +. |
| 196 | +├── .dvc/ # DVC configuration |
| 197 | +├── .dagshub/ # DAGsHub configuration |
| 198 | +├── artifact/ # Generated artifacts from pipeline |
| 199 | +│ └── direct_training/ # Artifacts from direct training approach |
| 200 | +├── data_schema/ # Data schema definitions |
| 201 | +├── logs/ # Application logs |
| 202 | +├── Network_Data/ # Raw data (tracked by DVC) |
| 203 | +├── networksecurity/ # Main package |
| 204 | +│ ├── components/ # Pipeline components |
| 205 | +│ ├── constants/ # Constants and configurations |
| 206 | +│ ├── entity/ # Data entities and models |
| 207 | +│ ├── exception/ # Custom exceptions |
| 208 | +│ ├── logging/ # Logging utilities |
| 209 | +│ ├── pipeline/ # Pipeline orchestration |
| 210 | +│ └── utils/ # Utility functions |
| 211 | +├── notebooks/ # Jupyter notebooks for exploration |
| 212 | +├── reports/ # Generated reports and metrics |
| 213 | +├── tests/ # Test cases |
| 214 | +├── .env # Environment variables |
| 215 | +├── .env.template # Template for environment variables |
| 216 | +├── .gitignore # Git ignore file |
| 217 | +├── app.py # FastAPI application |
| 218 | +├── custom_model_trainer.py # Custom model trainer implementation |
| 219 | +├── dvc.yaml # DVC pipeline definition |
| 220 | +├── Dockerfile # Docker configuration |
| 221 | +├── main.py # Main entry point |
| 222 | +├── pytest.ini # Pytest configuration |
| 223 | +├── README.md # Project documentation |
| 224 | +├── requirements.txt # Python dependencies |
| 225 | +├── run_api.sh # Script to run the FastAPI application |
| 226 | +├── setup.py # Package setup file |
| 227 | +└── train_with_components.py # Direct training script using components |
| 228 | +``` |
| 229 | + |
| 230 | +## 🔄 CI/CD Integration |
| 231 | + |
| 232 | +The project is set up with GitHub Actions for CI/CD: |
| 233 | + |
| 234 | +- **Continuous Integration**: Automated testing on pull requests |
| 235 | +- **Continuous Deployment**: Automatic model training and evaluation |
| 236 | +- **DVC and MLflow Integration**: Track experiments and data versions |
| 237 | + |
| 238 | +## 📝 License |
| 239 | + |
| 240 | +This project is licensed under the MIT License - see the LICENSE file for details. |
| 241 | + |
| 242 | +## 👥 Contributors |
| 243 | + |
| 244 | +- Augustine Chibueze - [GitHub](https://github.com/austinLorenzMccoy) |
| 245 | + |
| 246 | +## 🙏 Acknowledgements |
| 247 | + |
| 248 | +- [DVC](https://dvc.org/) for data version control |
| 249 | +- [MLflow](https://mlflow.org/) for experiment tracking |
| 250 | +- [DAGsHub](https://dagshub.com/) for MLOps collaboration |
0 commit comments