A multi-source log analysis and threat detection tool built in Python, mirroring the internal architecture of enterprise SIEM platforms such as Splunk and IBM QRadar.
- Overview
- Why LogSentry?
- Architecture
- Project Structure
- Features
- Parsers
- Detectors
- Correlator
- External Integrations
- Report Generation
- Sample Log Scenarios
- Installation
- Usage
- Configuration
- Design Decisions
- What This Project Demonstrates
- Documentation
LogSentry is a Python-based security log analyzer that processes raw log files from multiple sources, applies threat detection rules, correlates alerts across sources, enriches findings with real-world threat intelligence, and produces a timestamped HTML incident report.
It supports three log formats out of the box:
- SSH —
/var/log/auth.logand/var/log/secure - Apache/nginx — Combined Log Format (
access.log) - Windows Security — Binary
.evtxevent log files
The tool is designed around the same fundamental pipeline that powers enterprise SIEM platforms — ingest, normalise, detect, correlate, report — rebuilt from scratch in Python to demonstrate deep understanding of the internals.
Most security students use pre-built SIEM tools. LogSentry was built from scratch to prove understanding of what those tools actually do internally:
| What commercial SIEMs do | What LogSentry does |
|---|---|
| Universal forwarders / log collectors | Custom parsers per log format |
| Common Information Model (CIM) | Uniform event dictionary schema |
| SPL/AQL correlation rules | Python detector modules |
| Notable Events / Offenses | correlator.py timeline engine |
| Threat intelligence feeds (TAXII) | AbuseIPDB API integration |
| MITRE ATT&CK app for Splunk | mitre.py with STIX 2.0 dataset |
| Dashboards, PDF exports | Jinja2 HTML report |
LogSentry is built around a strict four-layer pipeline with complete separation of concerns:
Raw Log Files
│
▼
┌─────────────┐
│ PARSERS │ ssh_parser.py / apache_parser.py / windows_parser.py
└─────────────┘
│ List of normalised event dicts
▼
┌─────────────┐
│ DETECTORS │ brute_force.py / off_hours.py / multi_target.py / web_attacks.py
└─────────────┘
│ List of alert dicts
▼
┌─────────────┐
│ CORRELATOR │ correlator.py
└─────────────┘
│ Grouped attack timelines
▼
┌─────────────┐
│ REPORTER │ report_generator.py + report.html (Jinja2)
└─────────────┘
│
▼
Timestamped HTML Report
The core design rule: A detector never reads a file. A parser never makes a judgment. A reporter never analyses data. Each module has exactly one job.
logsentry/
├── main.py # Orchestration entry point
├── correlator.py # Cross-source timeline builder
├── mitre.py # MITRE ATT&CK technique enrichment
├── enrichment.py # AbuseIPDB API integration
├── report_generator.py # Jinja2 HTML report builder
├── requirements.txt
├── .env # API keys (never commit this)
│
├── parsers/
│ ├── __init__.py
│ ├── ssh_parser.py # auth.log / secure parser
│ ├── apache_parser.py # access.log parser
│ └── windows_parser.py # .evtx binary parser
│
├── detectors/
│ ├── __init__.py
│ ├── brute_force.py # Threshold + CRITICAL escalation
│ ├── off_hours.py # Temporal anomaly detection
│ ├── multi_target.py # Credential stuffing detection
│ └── web_attacks.py # SQL/traversal/command/scan detection
│
├── templates/
│ └── report.html # Jinja2 HTML template
│
├── sample_logs/
│ ├── auth.log # 4-scenario SSH test data
│ └── access.log # 3-scenario Apache test data
│
└── output/ # Generated reports saved here
- Multi-format parsing — SSH, Apache CLF, Windows EVTX binary
- Brute force detection with CRITICAL escalation on confirmed account compromise
- Off-hours & weekend access detection for insider threat modelling
- Credential stuffing detection via multi-target username enumeration
- Web attack detection — SQL injection, directory traversal, command injection, vulnerability scanning
- Cross-source correlation — links alerts from different log types to the same attacker IP
- MITRE ATT&CK mapping — every alert mapped to a real technique ID (STIX 2.0)
- AbuseIPDB enrichment — live threat intelligence context for every attacker IP
- HTML report generation — professional timestamped incident reports via Jinja2
All parsers output a uniform event dictionary so that every detector can consume events from any log source without modification:
{
"timestamp": "Mar 29 02:14:33", # raw string from log
"ip": "192.168.1.105", # always normalised string
"username": "root", # N/A when not available
"event_type": "failed", # "failed" | "success" | "invalid_user"
"source": "ssh", # "ssh" | "apache" | "windows"
# Apache-only
"method": "GET",
"url": "/index.php",
"status": 200, # int — needed for numeric comparison
"size": "1234", # str — display only
# Windows-only
"event_id": 4625, # int
}Processes /var/log/auth.log and /var/log/secure using three compiled regex patterns:
FAILED_PATTERN = r'(\w+\s+\d+\s+\d+:\d+:\d+).*Failed password for (\S+) from (\S+) port'
SUCCESS_PATTERN = r'(\w+\s+\d+\s+\d+:\d+:\d+).*Accepted password for (\S+) from (\S+) port'
INVALID_USER_PATTERN = r'(\w+\s+\d+\s+\d+:\d+:\d+).*Invalid user (\S+) from (\S+) port'| Event type | Meaning | Attack implication |
|---|---|---|
failed |
Username exists, password wrong | Password spraying / brute force |
invalid_user |
Username does not exist | Account enumeration / credential stuffing |
success |
Successful authentication | Legitimate or post-compromise access |
Files are opened with encoding='utf-8', errors='ignore' to handle non-UTF-8 bytes that occasionally appear in production log files from special usernames or log aggregators.
Processes Combined Log Format (CLF) — the default format for both Apache and nginx:
203.0.113.47 - - [29/Mar/2026:00:01:15 +0000] "GET /wp-admin HTTP/1.1" 404 512
The regex captures: client IP, timestamp, HTTP method, URL path, status code, response size. The status code is immediately converted to int at parse time for numeric comparison in detectors. All URLs are lowercased at parse time to eliminate case-based evasion techniques (SeLeCt → select).
Windows .evtx files are binary. The parser uses python-evtx to decode each record into an XML string, then navigates the DOM using xml.etree.ElementTree.
Every tag requires the Microsoft event schema namespace prefix:
{http://schemas.microsoft.com/win/2004/08/events/event}
| Event ID | Event Name | Has IP? |
|---|---|---|
| 4625 | Failed Logon | Yes |
| 4624 | Successful Logon | Yes |
| 4740 | Account Lockout | Yes |
| 4720 | User Account Created | No (local event) |
| 4672 | Special Privileges Assigned | No (local event) |
| 4732 | Member Added to Security Group | No (local event) |
File: detectors/brute_force.py
Detects repeated failed login attempts from a single IP, and escalates to CRITICAL if the same IP later succeeds.
Algorithm:
- Separate
failed_eventsandsuccess_eventsfrom input failure_counts = Counter(e["ip"] for e in failed_events)- For each IP where
count >= BRUTE_FORCE_THRESHOLD (5):- Collect targeted usernames, first/last timestamps → HIGH alert
- If same IP appears in
success_events→ CRITICAL alert (account compromised)
Why threshold = 5?
| Behaviour | Typical count | Result |
|---|---|---|
| Legitimate user forgot password | 1–3 | No alert |
| Legitimate user, caps lock | 2–4 | No alert |
| Manual attacker testing | 5–20 | Alert fires |
| Automated tool (hydra, medusa) | 100–10,000 | Alert fires |
5 is the industry-standard middle ground — matches the default in Fail2Ban and many commercial SIEM rulesets.
File: detectors/off_hours.py
Analyses successful logins only (failed attempts at odd hours are already caught by the brute force detector — no double alerting). Detects access outside business hours and on weekends.
WORK_START = 8 # 08:00
WORK_END = 18 # 18:00
WEEKEND_DAYS = [5, 6] # Saturday, Sunday| Condition | Severity | Rationale |
|---|---|---|
| Weekend login | CRITICAL | Almost no legitimate reason; highest insider-threat signal |
| Weekday off-hours login | HIGH | Suspicious — requires investigation |
| Weekday in-hours login | No alert | Normal |
Why this matters: Insider threats use legitimate credentials. A disgruntled employee accessing the database server at 2 AM on a Sunday is as dangerous as an external attacker. The tool flags the anomaly — a human analyst with organisational context makes the final call.
Timestamp year injection: Syslog format (RFC 3164) does not include the year. The parser injects the current year:
full_ts = f"{datetime.now().year} {ts_str}"
datetime.strptime(full_ts, "%Y %b %d %H:%M:%S")File: detectors/multi_target.py
Detects credential stuffing and account enumeration — one IP probing many different usernames.
ip_to_usernames = defaultdict(set) # set deduplicates automaticallyUses defaultdict(set) instead of defaultdict(list) to ensure each username is counted once per IP regardless of how many times it was attempted. An IP with 100 attempts at root counts as 1 unique target, not 100.
Combines both failed and invalid_user events — a real credential stuffing attack generates both (some usernames exist on the system, some don't). Looking at only one type gives an incomplete picture.
Threshold: 3 unique usernames from one IP = credential stuffing alert (HIGH).
File: detectors/web_attacks.py
Analyses Apache access log events across four attack categories:
SQL Injection — 13 keyword patterns including union select, or 1=1, drop table, exec(, cast(, admin'--. Uses break after first match to generate one alert per request (prevents triple-alerting on a single URL that matches multiple patterns).
Directory Traversal — 7 patterns including ../, ..\, /etc/passwd, /etc/shadow, boot.ini, /proc/self. A status 200 response on a traversal URL is flagged specially — it means the server actually returned the file.
Command Injection — 12 patterns including ; cat, ; whoami, | cat, && ls, `whoami`, $(whoami). Receives CRITICAL severity — successful command injection is full remote code execution (RCE).
Vulnerability Scanner Detection — counts 404 responses per IP across the entire log. Fires after the main loop (a two-pass algorithm) because the threshold decision requires knowing the total count, not a running total.
| Attack Type | Severity | MITRE Technique |
|---|---|---|
| Command injection | CRITICAL | T1059 |
| SQL injection | HIGH | T1190 |
| Directory traversal | HIGH | T1083 |
| Vulnerability scanning | MEDIUM | T1595 |
File: correlator.py
Individual alerts are noise. A correlated timeline is the attack story.
The correlator does not detect new threats. It links existing alerts from different detectors into a unified attack narrative grouped by source IP — mirroring the work of a Tier 2 SOC analyst.
Algorithm:
defaultdict(list)groups all alerts by IP address- Count unique alert types per IP using
set() - Skip IPs with only 1 unique type — not a coordinated attack
- For IPs with 2+ unique types: collect sources involved, build chronological timeline, inherit highest severity
Why 2 unique types as the threshold?
- 1 type → could be an automated scanner with no human behind it
- 2 types → attacker pivoted from one technique to another — definitionally a multi-stage attack
- 3+ types → full kill chain — the highest-value intelligence in the report
Example full kill chain (single IP):
00:01 Apache → Vulnerability scan (18 × 404)
00:02 Apache → SQL injection (UNION SELECT)
00:03 Apache → Directory traversal (../../../etc/passwd → 200)
00:03 SSH → Brute force (45 failed attempts on root)
00:03 SSH → Root account COMPROMISED
02:17 SSH → Off-hours login as dbadmin (lateral movement)
Without correlation: 6 separate low-context alerts. With correlation: one CRITICAL incident with a 6-event timeline.
File: mitre.py
Maps every LogSentry alert to a real MITRE ATT&CK technique ID using the official mitreattack-python library and the enterprise-attack.json STIX 2.0 dataset (~75MB, downloaded once at setup).
from mitreattack.stix20 import MitreAttackData
mitre_data = MitreAttackData("enterprise-attack.json")
technique = mitre_data.get_object_by_attack_id(
technique_id,
"attack-pattern" # MUST be "attack-pattern", NOT "technique"
)| Detector | Event Type | MITRE ID | Technique Name |
|---|---|---|---|
| brute_force | SSH brute force | T1110.001 | Password Guessing |
| brute_force | Account compromise | T1078 | Valid Accounts |
| off_hours | Off-hours/weekend login | T1078.003 | Local Accounts |
| multi_target | Credential stuffing | T1110.004 | Credential Stuffing |
| web_attacks | SQL injection | T1190 | Exploit Public-Facing App |
| web_attacks | Directory traversal | T1083 | File and Directory Discovery |
| web_attacks | Command injection | T1059 | Command and Scripting Interpreter |
| web_attacks | Vulnerability scan | T1595 | Active Scanning |
The integration includes graceful fallback — if the MITRE library fails for any reason, the main program never crashes. It returns a minimal dict with the technique ID and URL only.
File: enrichment.py
Queries the AbuseIPDB API for every attacker IP found in the logs, adding real-world threat intelligence context.
- Endpoint:
https://api.abuseipdb.com/api/v2/check - Method: GET
- Free tier: 1,000 checks/day
- lookback window: 90 days (
maxAgeInDays: 90)
Response fields used:
abuseConfidenceScore(0–100): 0–24 = low, 25–74 = moderate, 75–100 = hightotalReports: how many times reported globallylastReportedAt: timestamp of most recent reportcountryCode: origin countryisp: internet service provider
Important: AbuseIPDB enriches existing alerts. It does not create new alerts. An IP with a score of 0 that appears in a brute force event is still a brute force alert.
File: report_generator.py + templates/report.html
Generates a professional timestamped HTML report using Jinja2:
env = Environment(loader=FileSystemLoader("./templates"))
template = env.get_template("report.html")
html_output = template.render(**report_data)Output filename format: logsentry_report_YYYYMMDD_HHMMSS.html
Each run produces a uniquely named file — previous reports are never overwritten, preserving history for compliance and incident post-mortems.
Report sections:
- Executive summary (total alerts, severity breakdown)
- Correlated attack timelines (grouped by IP, chronological)
- SSH alerts table
- Apache alerts table
- Windows alerts table
- Per-alert MITRE ATT&CK technique links
- AbuseIPDB enrichment data per attacker IP
The sample_logs/ directory contains carefully crafted test data covering 7 scenarios.
| IP | Scenario | Detectors triggered | Max severity |
|---|---|---|---|
| 203.0.113.47 | Full kill chain | Multi-target + Brute force + Off-hours | CRITICAL |
| 198.51.100.22 | Credential stuffing, no success | Multi-target + Brute force | HIGH |
| 45.33.32.156 | Simple brute force | Brute force | HIGH |
| 192.168.x.x | Legitimate weekend logins | Off-hours (false positive demo) | CRITICAL |
203.0.113.47 full timeline:
00:02 → Username enumeration (admin, ubuntu, deploy, git) HIGH
00:02 → 45 failed root login attempts HIGH
00:03 → ROOT ACCOUNT COMPROMISED CRITICAL
02:17 → Returns as dbadmin (lateral movement, off-hours) CRITICAL
| IP | Scenario | Detectors triggered | Max severity |
|---|---|---|---|
| 203.0.113.47 | Full web attack chain | Scan + SQLi + Traversal + CMDi | CRITICAL |
| 198.51.100.22 | SQL + traversal combo | SQLi + Traversal | HIGH |
| 45.33.32.156 | Pure vulnerability scanner | Scanner detection | MEDIUM |
Correlated: 203.0.113.47 appears in both logs → the correlator produces a single CRITICAL incident spanning: web recon → SQL injection → directory traversal (passwd file leaked) → SSH brute force → root compromise.
Requirements: Python 3.10+
# 1. Clone the repository
git clone https://github.com/Doumit04/LogSentry.git
cd LogSentry
# 2. Create and activate virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
source .venv/bin/activate # Linux / macOS
# 3. Install dependencies
pip install -r requirements.txt
# 4. Download MITRE ATT&CK dataset (one time, ~75MB)
python -c "from mitreattack.stix20 import MitreAttackData; MitreAttackData.download('enterprise-attack.json')"
# 5. Configure API key
cp .env.example .env
# Edit .env and add your AbuseIPDB API keyDependencies:
requests
python-dotenv
jinja2
colorama
python-evtx
mitreattack-python
# Analyse SSH logs
python main.py --log sample_logs/auth.log --type ssh
# Analyse Apache logs
python main.py --log sample_logs/access.log --type apache
# Analyse Windows Event Logs
python main.py --log sample_logs/Security.evtx --type windows
# Analyse all sources at once
python main.py --log sample_logs/ --type allReports are saved to the output/ directory with a timestamp in the filename.
Edit .env to configure API keys:
ABUSEIPDB_API_KEY=your_api_key_hereEdit detection thresholds in the relevant detector files:
| Constant | Default | File | Rationale |
|---|---|---|---|
BRUTE_FORCE_THRESHOLD |
5 | brute_force.py |
Fail2Ban industry standard |
MULTI_TARGET_THRESHOLD |
3 | multi_target.py |
2 = shared workstation; 3 = enumeration |
SCAN_THRESHOLD |
10 | web_attacks.py |
OSSEC default |
WORK_START |
8 | off_hours.py |
Standard 8-hour workday |
WORK_END |
18 | off_hours.py |
ISO/IEC 27001 |
False positive philosophy: LogSentry is designed to surface everything suspicious and let a human decide. This mirrors real SIEM behaviour. Tuning thresholds so high that only obvious attacks trigger results in missed detections. In security, a false positive wastes an analyst's time — a false negative lets an attacker through.
Parsing vs detection: The most important distinction in this codebase. A parser converts a raw log line into a normalised dictionary — it makes no judgments. A detector receives normalised events and asks one question: does this data match a specific attack pattern? This separation means adding a new log source requires writing one new parser with zero changes to any detector, and adding a new detection rule requires writing one new detector with zero changes to any parser.
Why correlation is the highest-value feature: A 404 scanning alert could be a broken link aggregator. A brute force alert could be a misconfigured CI/CD system. An off-hours login could be a developer in a different timezone. The same three events correlated to a single IP — in order, over 45 minutes — are unambiguous: a human attacker performed reconnaissance, validated credentials via brute force, and then used the compromised account during off-hours to avoid detection. The correlator transforms noise into a decision-ready incident report.
| Skill Area | Specific Knowledge |
|---|---|
| SIEM architecture | Ingest → Normalise → Detect → Correlate → Alert pipeline |
| Log format parsing | Syslog, Combined Log Format, Windows EVTX binary |
| Regex engineering | Greedy vs lazy quantifiers, anchor strings, evasion-resistant normalisation |
| Python standard library | Counter, defaultdict, re, datetime.strptime, generator expressions |
| Threat detection logic | Threshold-based, behavioural anomaly, multi-source correlation |
| MITRE ATT&CK | STIX 2.0, technique taxonomy, kill chain phases, sub-technique URLs |
| Threat intelligence APIs | REST integration, authentication headers, confidence scores |
| Windows event logs | EVTX binary format, XML namespace handling, Event ID taxonomy |
| False positive management | Threshold tuning rationale, SOC analyst workflow, insider threat modelling |
| Software design | Single responsibility principle applied to security tooling |
A 41-page deep technical reference document is included in the docs/ directory, covering every algorithmic decision, regex pattern, threshold rationale, XML schema, API integration, and architectural trade-off in the project.
Topics covered: SSH Parser · Apache Parser · Windows EVTX Parser · Brute Force Detector · Off-Hours Detector · Multi-Target Detector · Web Attack Detector · Correlator · Reporter · MITRE ATT&CK Integration · AbuseIPDB Integration · Jinja2 Reporting
Built by Tony Doumit — March 2026