This document explains everything about this project - from basic networking concepts to the complete code architecture. After reading this, you should understand exactly how packets flow through the system without needing to read the code.
- What is DPI?
- Networking Background
- Project Overview
- File Structure
- The Journey of a Packet (Simple Version)
- The Journey of a Packet (Multi-processed Version)
- Deep Dive: Each Component
- How SNI Extraction Works
- How Blocking Works
- Building and Running
- Understanding the Output
Deep Packet Inspection (DPI) is a technology used to examine the contents of network packets as they pass through a checkpoint. Unlike simple firewalls that only look at packet headers (source/destination IP), DPI looks inside the packet payload.
- ISPs: Throttle or block certain applications (e.g., BitTorrent)
- Enterprises: Block social media on office networks
- Parental Controls: Block inappropriate websites
- Security: Detect malware or intrusion attempts
User Traffic (PCAP) → [DPI Engine] → Filtered Traffic (PCAP)
↓
- Identifies apps (YouTube, Facebook, etc.)
- Blocks based on rules
- Generates reports
When you visit a website, data travels through multiple "layers":
┌─────────────────────────────────────────────────────────┐
│ Layer 7: Application │ HTTP, TLS, DNS │
├─────────────────────────────────────────────────────────┤
│ Layer 4: Transport │ TCP (reliable), UDP (fast) │
├─────────────────────────────────────────────────────────┤
│ Layer 3: Network │ IP addresses (routing) │
├─────────────────────────────────────────────────────────┤
│ Layer 2: Data Link │ MAC addresses (local network)│
└─────────────────────────────────────────────────────────┘
Every network packet is like a Russian nesting doll - headers wrapped inside headers:
┌──────────────────────────────────────────────────────────────────┐
│ Ethernet Header (14 bytes) │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ IP Header (20 bytes) │ │
│ │ ┌──────────────────────────────────────────────────────────┐ │ │
│ │ │ TCP Header (20 bytes) │ │ │
│ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ Payload (Application Data) │ │ │ │
│ │ │ │ e.g., TLS Client Hello with SNI │ │ │ │
│ │ │ └──────────────────────────────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
A connection (or "flow") is uniquely identified by 5 values:
| Field | Example | Purpose |
|---|---|---|
| Source IP | 192.168.1.100 | Who is sending |
| Destination IP | 172.217.14.206 | Where it's going |
| Source Port | 54321 | Sender's application identifier |
| Destination Port | 443 | Service being accessed (443 = HTTPS) |
| Protocol | TCP (6) | TCP or UDP |
Why is this important?
- All packets with the same 5-tuple belong to the same connection
- If we block one packet of a connection, we should block all of them
- This is how we "track" conversations between computers
Server Name Indication (SNI) is part of the TLS/HTTPS handshake. When you visit https://www.youtube.com:
- Your browser sends a "Client Hello" message
- This message includes the domain name in plaintext (not encrypted yet!)
- The server uses this to know which certificate to send
TLS Client Hello:
├── Version: TLS 1.2
├── Random: [32 bytes]
├── Cipher Suites: [list]
└── Extensions:
└── SNI Extension:
└── Server Name: "www.youtube.com" ← We extract THIS!
This is the key to DPI: Even though HTTPS is encrypted, the domain name is visible in the first packet!
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Wireshark │ │ DPI Engine │ │ Output │
│ Capture │ ──► │ │ ──► │ PCAP │
│ (input.pcap)│ │ - Parse │ │ (filtered) │
└─────────────┘ │ - Classify │ └─────────────┘
│ - Block │
│ - Report │
└─────────────┘
| Version | File | Use Case |
|---|---|---|
| Simple (Single-threaded) | src/dpi_simple.py |
Learning, small captures |
| Multi-processed | src/dpi_mt.py |
Production, leveraging multiple CPU cores |
packet_analyzer/
├── src/ # Implementation files (Python)
│ ├── pcap_reader.py # PCAP file handling
│ ├── packet_parser.py # Protocol parsing
│ ├── sni_extractor.py # SNI/Host extraction
│ ├── types.py # Data structures (FiveTuple, AppType, etc.)
│ ├── dpi_simple.py # ★ SIMPLE VERSION ★
│ ├── dpi_mt.py # ★ MULTI-PROCESSED VERSION ★
│ └── __init__.py # Python package marker
│
├── generate_test_pcap.py # Creates test data
├── test_dpi.pcap # Sample capture with various traffic
├── WINDOWS_SETUP.md # Setup guide for Windows
└── README.md # This file!
Let's trace a single packet through dpi_simple.py:
reader = PcapReader()
reader.open("capture.pcap")What happens:
- Open the file in binary mode
- Read the 24-byte global header (magic number, version, etc.)
- Verify it's a valid PCAP file
PCAP File Format:
┌────────────────────────────┐
│ Global Header (24 bytes) │ ← Read once at start
├────────────────────────────┤
│ Packet Header (16 bytes) │ ← Timestamp, length
│ Packet Data (variable) │ ← Actual network bytes
├────────────────────────────┤
│ Packet Header (16 bytes) │
│ Packet Data (variable) │
├────────────────────────────┤
│ ... more packets ... │
└────────────────────────────┘
while True:
raw = reader.read_next_packet()
if not raw: break
# raw.data contains the packet bytes
# raw.header contains timestamp and lengthWhat happens:
- Read 16-byte packet header
- Read N bytes of packet data (N = header.incl_len)
- Return false when no more packets
parsed = PacketParser.parse(raw)What happens (in packet_parser.py):
raw.data bytes:
[0-13] Ethernet Header
[14-33] IP Header
[34-53] TCP Header
[54+] Payload
After parsing:
parsed.src_mac = "00:11:22:33:44:55"
parsed.dest_mac = "aa:bb:cc:dd:ee:ff"
parsed.src_ip = "192.168.1.100"
parsed.dest_ip = "172.217.14.206"
parsed.src_port = 54321
parsed.dest_port = 443
parsed.protocol = 6 (TCP)
parsed.has_tcp = True
Parsing the Ethernet Header (14 bytes):
Bytes 0-5: Destination MAC
Bytes 6-11: Source MAC
Bytes 12-13: EtherType (0x0800 = IPv4)
Parsing the IP Header (20+ bytes):
Byte 0: Version (4 bits) + Header Length (4 bits)
Byte 8: TTL (Time To Live)
Byte 9: Protocol (6=TCP, 17=UDP)
Bytes 12-15: Source IP
Bytes 16-19: Destination IP
Parsing the TCP Header (20+ bytes):
Bytes 0-1: Source Port
Bytes 2-3: Destination Port
Bytes 4-7: Sequence Number
Bytes 8-11: Acknowledgment Number
Byte 12: Data Offset (header length)
Byte 13: Flags (SYN, ACK, FIN, etc.)
tuple_key = FiveTuple(
src_ip=parsed.src_ip_int,
dst_ip=parsed.dest_ip_int,
src_port=parsed.src_port,
dst_port=parsed.dest_port,
protocol=parsed.protocol,
)
if tuple_key not in flows:
flows[tuple_key] = Flow(tuple_key)
flow = flows[tuple_key]What happens:
- The flow table is a hash map:
FiveTuple → Flow - If this 5-tuple exists, we get the existing flow
- If not, a new flow is created
- All packets with the same 5-tuple share the same flow
# For HTTPS traffic (port 443)
if flow.app_type == AppType.UNKNOWN and parsed.has_tcp and parsed.dest_port == 443:
if parsed.payload_data:
sni = SNIExtractor.extract(parsed.payload_data)
if sni:
flow.sni = sni
flow.app_type = sni_to_app_type(sni)What happens (in sni_extractor.py):
-
Check if it's a TLS Client Hello:
Byte 0: Content Type = 0x16 (Handshake) ✓ Byte 5: Handshake Type = 0x01 (Client Hello) ✓ -
Navigate to Extensions:
Skip: Version, Random, Session ID, Cipher Suites, Compression -
Find SNI Extension (type 0x0000):
Extension Type: 0x0000 (SNI) Extension Length: N SNI List Length: M SNI Type: 0x00 (hostname) SNI Length: L SNI Value: "www.youtube.com" ← FOUND! -
Map SNI to App Type:
# In src/types.py if "youtube" in lower_sni: return AppType.YOUTUBE
if not flow.blocked:
flow.blocked = rules.is_blocked(tuple_key.src_ip, flow.app_type, flow.sni)What happens:
# Check IP blacklist
if src_ip in self.blocked_ips:
return True
# Check app blacklist
if app in self.blocked_apps:
return True
# Check domain blacklist (substring match)
for dom in self.blocked_domains:
if dom in sni:
return True
return Falseif flow.blocked:
dropped += 1
else:
forwarded += 1
# Write packet to output file
output.write(packet_header)
output.write(packet_data)After processing all packets:
# Count apps
for tuple, flow in flows.items():
app_stats[flow.app_type] += 1
# Print report
print("YouTube: 150 packets (15%)")
print("Facebook: 80 packets (8%)")
...The multi-processed version (src/dpi_mt.py) adds parallelism for high performance:
┌─────────────────┐
│ Reader Process │
│ (reads PCAP) │
└────────┬────────┘
│
┌──────────────┴──────────────┐
│ hash(5-tuple) % 2 │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ LB0 Process │ │ LB1 Process │
│ (Load Balancer)│ │ (Load Balancer)│
└────────┬────────┘ └────────┬────────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│hash % 2 │ │hash % 2 │
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│FP0 Proc │ │FP1 Proc │ │FP2 Proc │ │FP3 Proc │
│(Fast Path)│ │(Fast Path)│ │(Fast Path)│ │(Fast Path)│
└─────┬────┘ └─────┬────┘ └─────┬────┘ └─────┬────┘
│ │ │ │
└────────────┴──────────────┴────────────┘
│
▼
┌───────────────────────┐
│ Output Queue │
└───────────┬───────────┘
│
▼
┌───────────────────────┐
│ Output Writer Proc │
│ (writes to PCAP) │
└───────────────────────┘
- Load Balancers (LBs): Distribute work across FPs
- Fast Paths (FPs): Do the actual DPI processing
- Consistent Hashing: Same 5-tuple always goes to same FP
Why consistent hashing matters:
Connection: 192.168.1.100:54321 → 142.250.185.206:443
Packet 1 (SYN): hash → FP2
Packet 2 (SYN-ACK): hash → FP2 (same process!)
Packet 3 (Client Hello): hash → FP2 (same process!)
Packet 4 (Data): hash → FP2 (same process!)
All packets of this connection go to FP2.
FP2 can track the flow state correctly in its own memory.
# Main loop in Reader process
while True:
raw = reader.read_next_packet()
if not raw: break
pkt = create_packet_object(raw)
lb_idx = hash(pkt.tuple) % num_lbs
# Send to Load Balancer via Queue
lb_queues[lb_idx].put(pkt)def lb_worker(input_queue, fp_queues):
while True:
pkt = input_queue.get()
if pkt is None: break # Shutdown signal
fp_idx = hash(pkt.tuple) % num_fps
fp_queues[fp_idx].put(pkt)def fp_worker(input_queue, output_queue, rules):
flows = {} # Local flow table for this process
while True:
pkt = input_queue.get()
if pkt is None: break
flow = flows.setdefault(pkt.tuple, Flow(pkt.tuple))
classify_flow(pkt, flow)
if not rules.is_blocked(pkt.tuple.src_ip, flow.app_type, flow.sni):
output_queue.put(pkt)def writer_worker(output_queue, output_file):
with PcapWriter(output_file) as writer:
while True:
pkt = output_queue.get()
if pkt is None: break
writer.write_packet(pkt)Purpose: Read network captures saved by Wireshark
Key Logic (using struct):
# Read Global Header
# 4sHHiIII = magic, version_maj, version_min, thiszone, sigfigs, snaplen, network
data = file.read(24)
global_header = struct.unpack('<IHHIIII', data)
# Read Packet Header
# IIII = ts_sec, ts_usec, incl_len, orig_len
header_data = file.read(16)
p_header = struct.unpack('<IIII', header_data)Purpose: Extract protocol fields from raw bytes
Key function:
def parse(raw_packet):
# 1. Parse Ethernet
# 2. Parse IPv4 (extract protocol, IPs)
# 3. Parse TCP or UDP (extract ports)
# 4. Extract remaining payload
return ParsedPacket(...)Important concepts:
Network Byte Order: Network protocols use big-endian. Python's struct uses > for big-endian:
# Unpack 2 bytes (16-bit port) at offset 34
src_port = struct.unpack('>H', data[34:36])[0]
# Unpack 4 bytes (32-bit sequence number) at offset 38
seq_num = struct.unpack('>I', data[38:42])[0]Purpose: Extract domain names from TLS and HTTP
For TLS (HTTPS):
- Verify TLS record header (0x16)
- Verify Client Hello handshake (0x01)
- Skip variable-length fields (Session ID, Cipher Suites)
- Find SNI extension (type 0x0000)
For HTTP:
- Search for "Host: " string in the payload
- Extract value until the next newline
Purpose: Define data structures used throughout
@dataclass(frozen=True)
class FiveTuple:
src_ip: int
dst_ip: int
src_port: int
dst_port: int
protocol: int
class AppType(Enum):
UNKNOWN = 0
HTTP = 1
HTTPS = 2
YOUTUBE = 3
# ...When you visit https://www.youtube.com:
┌──────────┐ ┌──────────┐
│ Browser │ │ Server │
└────┬─────┘ └────┬─────┘
│ │
│ ──── Client Hello ─────────────────────►│
│ (includes SNI: www.youtube.com) │
│ │
│ ◄─── Server Hello ───────────────────── │
│ (includes certificate) │
│ │
│ ──── Key Exchange ─────────────────────►│
│ │
│ ◄═══ Encrypted Data ══════════════════► │
│ (from here on, everything is │
│ encrypted - we can't see it) │
We can only extract SNI from the Client Hello!
def extract_sni(payload):
if payload[0] != 0x16 or payload[5] != 0x01:
return None # Not a Client Hello
offset = 43 # Skip to session ID
session_len = payload[offset]
offset += 1 + session_len
cipher_len = int.from_bytes(payload[offset:offset+2], 'big')
offset += 2 + cipher_len
comp_len = payload[offset]
offset += 1 + comp_len
# Extensions...
ext_total_len = int.from_bytes(payload[offset:offset+2], 'big')
offset += 2
# Loop through extensions to find type 0x0000 (SNI)
# ...| Rule Type | Example | What it Blocks |
|---|---|---|
| IP | 192.168.1.50 |
All traffic from this source |
| App | YouTube |
All YouTube connections |
| Domain | tiktok |
Any SNI containing "tiktok" |
Packet arrives
│
▼
┌─────────────────────────────────┐
│ Is source IP in blocked list? │──Yes──► DROP
└───────────────┬─────────────────┘
│No
▼
┌─────────────────────────────────┐
│ Is app type in blocked list? │──Yes──► DROP
└───────────────┬─────────────────┘
│No
▼
┌─────────────────────────────────┐
│ Does SNI match blocked domain? │──Yes──► DROP
└───────────────┬─────────────────┘
│No
▼
FORWARD
- Open
run_analyzer.batin a text editor (like Notepad). - Change the
INPUT_PCAPvariable if you want to analyze your own capture:set INPUT_PCAP=your_file.pcap
- Save and Double-click the file to run the multi-processed analyzer automatically.
- Python 3.8+
- No external Python libraries needed (
multiprocessingandstructare built-in)!
If you prefer to use the Command Prompt or PowerShell directly, you must first generate the test data (if you don't already have a .pcap file):
Step 1. Go to your project folder:
cd path/to/Packet_analyzerStep 2. Generate Test Traffic (Run once):
python generate_test_pcap.pyStep 2. Run the Multi-processed version (High performance):
python -m src.dpi_mt test_dpi.pcap output.pcap(Optional) Run the Simple version (For learning/debugging):
python -m src.dpi_simple test_dpi.pcap output.pcapYou can tell the analyzer to block specific traffic by adding "flags" to your command.
What you can block:
- Applications (
--block-app):YouTube,Facebook,TikTok,WhatsApp,Instagram,Twitter,Netflix,Amazon,Microsoft,Apple,Telegram,Spotify,Zoom,Discord,GitHub,Cloudflare,Google. - IP Addresses (
--block-ip): Any valid IPv4 address (e.g.,192.168.1.50). - Domains (
--block-domain): Any substring (e.g.,tiktok,porn,google). If a domain contains this word, it drops the packet.
1. Block an Application:
python -m src.dpi_mt test_dpi.pcap output.pcap --block-app YouTube2. Block a specific IP address:
python -m src.dpi_mt test_dpi.pcap output.pcap --block-ip 192.168.1.503. Block a Domain name:
python -m src.dpi_mt test_dpi.pcap output.pcap --block-domain tiktok4. Combine multiple rules:
python -m src.dpi_mt test_dpi.pcap output.pcap --block-app YouTube --block-ip 1.1.1.1 --block-domain pornIf you want a dedicated script where you can just "set and forget" your rules, use run_advanced.bat:
- Right-click
run_advanced.batand select Edit. - Change the variables at the top of the file:
set BLOCK_APP=YouTube set BLOCK_IP=192.168.1.50 set BLOCK_DOMAIN=tiktok
- Save and Double-click the file. It will build the command for you and run the analysis with your rules.
If you prefer to keep using the standard script:
- Right-click
run_analyzer.batand select Edit. - Find the line that starts with
python -m src.dpi_mt. - Add your blocking flags at the end of that line. For example:
python -m src.dpi_mt %INPUT_PCAP% %OUTPUT_PCAP% --fps %WORKERS% --block-app YouTube
- Save and run!
python generate_test_pcap.py
# Creates test_dpi.pcap with sample traffic╔══════════════════════════════════════════════════════════════╗
║ DPI ENGINE v2.0 (Multi-processed) ║
╠══════════════════════════════════════════════════════════════╣
║ Load Balancers: 2 FPs per LB: 2 Total Processes: 5 ║
╚══════════════════════════════════════════════════════════════╝
[Rules] Blocked app: YouTube
[Rules] Blocked IP: 192.168.1.50
[Reader] Processing packets...
[Reader] Done reading 77 packets
╔══════════════════════════════════════════════════════════════╗
║ PROCESSING REPORT ║
╠══════════════════════════════════════════════════════════════╣
║ Total Packets: 77 ║
║ Total Bytes: 5738 ║
║ Forwarded: 69 ║
║ Dropped: 8 ║
╠══════════════════════════════════════════════════════════════╣
║ PROCESS STATISTICS ║
║ LB0 dispatched: 53 ║
║ LB1 dispatched: 24 ║
║ FP0 processed: 53 ║
║ ... ║
╚══════════════════════════════════════════════════════════════╝
-
Add More App Signatures
# In src/types.py if "twitch" in lower_sni: return AppType.TWITCH
-
Add Bandwidth Throttling
# Instead of DROP, delay packets slightly (concept) if should_throttle(flow): time.sleep(0.01)
-
Add QUIC/HTTP3 Support
- QUIC uses UDP on port 443
- SNI is in the Initial packet
This DPI engine demonstrates:
- Network Protocol Parsing - Understanding packet structure
- Deep Packet Inspection - Looking inside encrypted connections
- Flow Tracking - Managing stateful connections
- Multi-processed Architecture - Scaling with Python's multiprocessing
- Producer-Consumer Pattern - Synchronized queues
The key insight is that even HTTPS traffic leaks the destination domain in the TLS handshake, allowing network operators to identify and control application usage.
If you have questions about any part of this project, the code is well-commented and follows the same flow described in this document. Start with the simple version (src/dpi_simple.py) to understand the concepts, then move to the multi-processed version (src/dpi_mt.py) to see how parallelism is added.
Happy learning! 🚀