Skip to content

Commit cb2451f

Browse files
Add DVC pipeline for direct training with MLflow integration and FastAPI application for model serving
1 parent 754f5f9 commit cb2451f

8 files changed

Lines changed: 975 additions & 0 deletions

File tree

.env.template

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
MONGODB_URI=mongodb+srv://username:password@cluster.mongodb.net/?retryWrites=true&w=majority
2+
DAGSHUB_USERNAME=your_username
3+
DAGSHUB_TOKEN=your_token

README.md

Lines changed: 250 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
# Network Security Classification Project
2+
3+
[![DVC](https://img.shields.io/badge/DVC-Data%20Version%20Control-945DD6?logo=dvc)](https://dvc.org/)
4+
[![MLflow](https://img.shields.io/badge/MLflow-Platform%20for%20ML%20Lifecycle-0194E2?logo=mlflow)](https://mlflow.org/)
5+
[![DAGsHub](https://img.shields.io/badge/DAGsHub-Hosted%20MLOps%20Platform-FF69B4)](https://dagshub.com/)
6+
7+
## 📋 Project Overview
8+
9+
This project implements a machine learning pipeline for network security classification, focusing on detecting and classifying network security threats. The pipeline is built with reproducibility, versioning, and tracking in mind, leveraging modern MLOps tools.
10+
11+
### 🔍 Key Features
12+
13+
- **End-to-End ML Pipeline**: Automated data ingestion, validation, transformation, and model training
14+
- **Data Version Control**: Track and version datasets using DVC
15+
- **Experiment Tracking**: Monitor model metrics and parameters with MLflow
16+
- **Reproducibility**: Ensure consistent results across different environments
17+
- **CI/CD Integration**: Automated testing and deployment workflows
18+
- **Containerization**: Docker support for consistent deployment
19+
- **REST API**: FastAPI-based API for real-time predictions
20+
- **Text Classification**: Support for text-based cyber threat intelligence data
21+
- **Multiple Training Approaches**: Support for both MongoDB-based and direct file-based training
22+
23+
## 🛠️ Technology Stack
24+
25+
- **Python**: Core programming language
26+
- **MongoDB**: Database for storing network security data
27+
- **Scikit-learn & XGBoost**: ML algorithms for classification
28+
- **DVC**: Data version control
29+
- **MLflow**: Experiment tracking and model registry
30+
- **DAGsHub**: Collaborative MLOps platform
31+
- **Docker**: Containerization
32+
- **Pytest**: Testing framework
33+
- **FastAPI**: High-performance API framework
34+
- **Uvicorn**: ASGI server for FastAPI
35+
36+
## 🚀 Getting Started
37+
38+
### Prerequisites
39+
40+
- Python 3.8+ (Python 3.10 or 3.11 recommended for best compatibility)
41+
- Git
42+
- Docker (optional)
43+
- MongoDB connection string
44+
- DAGsHub account (for MLflow tracking)
45+
46+
### Installation
47+
48+
1. Clone the repository:
49+
```bash
50+
git clone https://github.com/austinLorenzMccoy/networkSecurity_project.git
51+
cd networkSecurity_project
52+
```
53+
54+
2. Create and activate a virtual environment:
55+
```bash
56+
python -m venv venv
57+
source venv/bin/activate # On Windows: venv\Scripts\activate
58+
```
59+
60+
3. Install dependencies:
61+
```bash
62+
pip install -r requirements.txt
63+
pip install -e .
64+
```
65+
66+
4. Set up environment variables:
67+
```bash
68+
# Create a .env file with your MongoDB connection string and DAGsHub credentials
69+
cp .env.template .env
70+
# Edit the .env file with your credentials
71+
```
72+
73+
5. Initialize DVC:
74+
```bash
75+
dvc init
76+
```
77+
78+
6. Connect to DAGsHub (optional):
79+
```bash
80+
# Set up DAGsHub as a remote
81+
dvc remote add origin https://dagshub.com/austinLorenzMccoy/networkSecurity_project.dvc
82+
```
83+
84+
## 📊 DVC Pipeline
85+
86+
The project uses DVC to define and run the ML pipeline stages:
87+
88+
```bash
89+
# Run the entire pipeline
90+
dvc repro
91+
92+
# Run a specific stage
93+
dvc repro -s data_ingestion
94+
dvc repro -s data_validation
95+
dvc repro -s data_transformation
96+
dvc repro -s model_training
97+
98+
# Run the direct training pipeline (using cyber threat intelligence data)
99+
dvc repro -s direct_training
100+
101+
# View pipeline visualization
102+
dvc dag
103+
```
104+
105+
## 📈 MLflow Tracking
106+
107+
MLflow is used to track experiments, including parameters, metrics, and artifacts:
108+
109+
```bash
110+
# Start the MLflow UI locally
111+
mlflow ui
112+
113+
# Or view experiments on DAGsHub
114+
# Visit: https://dagshub.com/austinLorenzMccoy/networkSecurity_project.mlflow
115+
```
116+
117+
### DAGsHub Integration
118+
119+
To enable MLflow tracking with DAGsHub:
120+
121+
1. Set your DAGsHub credentials in the `.env` file:
122+
```
123+
MLFLOW_TRACKING_USERNAME=your_dagshub_username
124+
MLFLOW_TRACKING_PASSWORD=your_dagshub_token
125+
```
126+
127+
2. Run the training pipeline with MLflow tracking:
128+
```bash
129+
dvc repro direct_training
130+
```
131+
132+
3. View your experiments on DAGsHub's MLflow interface
133+
134+
## 🧪 Testing
135+
136+
The project includes unit tests using pytest:
137+
138+
```bash
139+
# Run all tests
140+
pytest
141+
142+
# Run tests with coverage report
143+
pytest --cov=networksecurity
144+
```
145+
146+
## 🐳 Docker
147+
148+
Build and run the project using Docker:
149+
150+
```bash
151+
# Build the Docker image
152+
docker build -t network-security-project .
153+
154+
# Run the container
155+
docker run -p 8000:8000 -e MONGODB_URI=your_mongodb_connection_string network-security-project
156+
```
157+
158+
## 🌐 FastAPI Application
159+
160+
The project includes a FastAPI application for serving predictions:
161+
162+
```bash
163+
# Run the FastAPI application
164+
python app.py
165+
166+
# Or use the convenience script
167+
bash run_api.sh
168+
```
169+
170+
### API Endpoints
171+
172+
- **GET /health**: Check if the model is loaded and ready
173+
- **GET /model-info**: Get information about the trained model
174+
- **POST /predict**: Make predictions using feature vectors
175+
- **POST /predict/text**: Make predictions using raw text input
176+
177+
### Example Usage
178+
179+
```bash
180+
# Check health status
181+
curl -X GET "http://localhost:8000/health"
182+
183+
# Get model information
184+
curl -X GET "http://localhost:8000/model-info"
185+
186+
# Make a prediction with text
187+
curl -X POST "http://localhost:8000/predict/text" \
188+
-H "Content-Type: application/json" \
189+
-d '{"text": "A new ransomware attack has been detected that encrypts files."}'
190+
```
191+
192+
## 📁 Project Structure
193+
194+
```
195+
.
196+
├── .dvc/ # DVC configuration
197+
├── .dagshub/ # DAGsHub configuration
198+
├── artifact/ # Generated artifacts from pipeline
199+
│ └── direct_training/ # Artifacts from direct training approach
200+
├── data_schema/ # Data schema definitions
201+
├── logs/ # Application logs
202+
├── Network_Data/ # Raw data (tracked by DVC)
203+
├── networksecurity/ # Main package
204+
│ ├── components/ # Pipeline components
205+
│ ├── constants/ # Constants and configurations
206+
│ ├── entity/ # Data entities and models
207+
│ ├── exception/ # Custom exceptions
208+
│ ├── logging/ # Logging utilities
209+
│ ├── pipeline/ # Pipeline orchestration
210+
│ └── utils/ # Utility functions
211+
├── notebooks/ # Jupyter notebooks for exploration
212+
├── reports/ # Generated reports and metrics
213+
├── tests/ # Test cases
214+
├── .env # Environment variables
215+
├── .env.template # Template for environment variables
216+
├── .gitignore # Git ignore file
217+
├── app.py # FastAPI application
218+
├── custom_model_trainer.py # Custom model trainer implementation
219+
├── dvc.yaml # DVC pipeline definition
220+
├── Dockerfile # Docker configuration
221+
├── main.py # Main entry point
222+
├── pytest.ini # Pytest configuration
223+
├── README.md # Project documentation
224+
├── requirements.txt # Python dependencies
225+
├── run_api.sh # Script to run the FastAPI application
226+
├── setup.py # Package setup file
227+
└── train_with_components.py # Direct training script using components
228+
```
229+
230+
## 🔄 CI/CD Integration
231+
232+
The project is set up with GitHub Actions for CI/CD:
233+
234+
- **Continuous Integration**: Automated testing on pull requests
235+
- **Continuous Deployment**: Automatic model training and evaluation
236+
- **DVC and MLflow Integration**: Track experiments and data versions
237+
238+
## 📝 License
239+
240+
This project is licensed under the MIT License - see the LICENSE file for details.
241+
242+
## 👥 Contributors
243+
244+
- Augustine Chibueze - [GitHub](https://github.com/austinLorenzMccoy)
245+
246+
## 🙏 Acknowledgements
247+
248+
- [DVC](https://dvc.org/) for data version control
249+
- [MLflow](https://mlflow.org/) for experiment tracking
250+
- [DAGsHub](https://dagshub.com/) for MLOps collaboration

0 commit comments

Comments
 (0)