Skip to content

Doumit04/LogSentry

Repository files navigation

LogSentry — SIEM Log Analyzer & Threat Detection Engine

A multi-source log analysis and threat detection tool built in Python, mirroring the internal architecture of enterprise SIEM platforms such as Splunk and IBM QRadar.


Table of Contents


Overview

LogSentry is a Python-based security log analyzer that processes raw log files from multiple sources, applies threat detection rules, correlates alerts across sources, enriches findings with real-world threat intelligence, and produces a timestamped HTML incident report.

It supports three log formats out of the box:

  • SSH/var/log/auth.log and /var/log/secure
  • Apache/nginx — Combined Log Format (access.log)
  • Windows Security — Binary .evtx event log files

The tool is designed around the same fundamental pipeline that powers enterprise SIEM platforms — ingest, normalise, detect, correlate, report — rebuilt from scratch in Python to demonstrate deep understanding of the internals.


Why LogSentry?

Most security students use pre-built SIEM tools. LogSentry was built from scratch to prove understanding of what those tools actually do internally:

What commercial SIEMs do What LogSentry does
Universal forwarders / log collectors Custom parsers per log format
Common Information Model (CIM) Uniform event dictionary schema
SPL/AQL correlation rules Python detector modules
Notable Events / Offenses correlator.py timeline engine
Threat intelligence feeds (TAXII) AbuseIPDB API integration
MITRE ATT&CK app for Splunk mitre.py with STIX 2.0 dataset
Dashboards, PDF exports Jinja2 HTML report

Architecture

LogSentry is built around a strict four-layer pipeline with complete separation of concerns:

Raw Log Files
      │
      ▼
┌─────────────┐
│   PARSERS   │  ssh_parser.py / apache_parser.py / windows_parser.py
└─────────────┘
      │  List of normalised event dicts
      ▼
┌─────────────┐
│  DETECTORS  │  brute_force.py / off_hours.py / multi_target.py / web_attacks.py
└─────────────┘
      │  List of alert dicts
      ▼
┌─────────────┐
│ CORRELATOR  │  correlator.py
└─────────────┘
      │  Grouped attack timelines
      ▼
┌─────────────┐
│  REPORTER   │  report_generator.py + report.html (Jinja2)
└─────────────┘
      │
      ▼
Timestamped HTML Report

The core design rule: A detector never reads a file. A parser never makes a judgment. A reporter never analyses data. Each module has exactly one job.


Project Structure

logsentry/
├── main.py                    # Orchestration entry point
├── correlator.py              # Cross-source timeline builder
├── mitre.py                   # MITRE ATT&CK technique enrichment
├── enrichment.py              # AbuseIPDB API integration
├── report_generator.py        # Jinja2 HTML report builder
├── requirements.txt
├── .env                       # API keys (never commit this)
│
├── parsers/
│   ├── __init__.py
│   ├── ssh_parser.py          # auth.log / secure parser
│   ├── apache_parser.py       # access.log parser
│   └── windows_parser.py      # .evtx binary parser
│
├── detectors/
│   ├── __init__.py
│   ├── brute_force.py         # Threshold + CRITICAL escalation
│   ├── off_hours.py           # Temporal anomaly detection
│   ├── multi_target.py        # Credential stuffing detection
│   └── web_attacks.py         # SQL/traversal/command/scan detection
│
├── templates/
│   └── report.html            # Jinja2 HTML template
│
├── sample_logs/
│   ├── auth.log               # 4-scenario SSH test data
│   └── access.log             # 3-scenario Apache test data
│
└── output/                    # Generated reports saved here

Features

  • Multi-format parsing — SSH, Apache CLF, Windows EVTX binary
  • Brute force detection with CRITICAL escalation on confirmed account compromise
  • Off-hours & weekend access detection for insider threat modelling
  • Credential stuffing detection via multi-target username enumeration
  • Web attack detection — SQL injection, directory traversal, command injection, vulnerability scanning
  • Cross-source correlation — links alerts from different log types to the same attacker IP
  • MITRE ATT&CK mapping — every alert mapped to a real technique ID (STIX 2.0)
  • AbuseIPDB enrichment — live threat intelligence context for every attacker IP
  • HTML report generation — professional timestamped incident reports via Jinja2

Parsers

All parsers output a uniform event dictionary so that every detector can consume events from any log source without modification:

{
    "timestamp":  "Mar 29 02:14:33",   # raw string from log
    "ip":         "192.168.1.105",     # always normalised string
    "username":   "root",              # N/A when not available
    "event_type": "failed",            # "failed" | "success" | "invalid_user"
    "source":     "ssh",               # "ssh" | "apache" | "windows"

    # Apache-only
    "method":     "GET",
    "url":        "/index.php",
    "status":     200,                 # int — needed for numeric comparison
    "size":       "1234",              # str — display only

    # Windows-only
    "event_id":   4625,                # int
}

SSH Parser

Processes /var/log/auth.log and /var/log/secure using three compiled regex patterns:

FAILED_PATTERN      = r'(\w+\s+\d+\s+\d+:\d+:\d+).*Failed password for (\S+) from (\S+) port'
SUCCESS_PATTERN     = r'(\w+\s+\d+\s+\d+:\d+:\d+).*Accepted password for (\S+) from (\S+) port'
INVALID_USER_PATTERN = r'(\w+\s+\d+\s+\d+:\d+:\d+).*Invalid user (\S+) from (\S+) port'
Event type Meaning Attack implication
failed Username exists, password wrong Password spraying / brute force
invalid_user Username does not exist Account enumeration / credential stuffing
success Successful authentication Legitimate or post-compromise access

Files are opened with encoding='utf-8', errors='ignore' to handle non-UTF-8 bytes that occasionally appear in production log files from special usernames or log aggregators.

Apache Parser

Processes Combined Log Format (CLF) — the default format for both Apache and nginx:

203.0.113.47 - - [29/Mar/2026:00:01:15 +0000] "GET /wp-admin HTTP/1.1" 404 512

The regex captures: client IP, timestamp, HTTP method, URL path, status code, response size. The status code is immediately converted to int at parse time for numeric comparison in detectors. All URLs are lowercased at parse time to eliminate case-based evasion techniques (SeLeCtselect).

Windows Event Log Parser

Windows .evtx files are binary. The parser uses python-evtx to decode each record into an XML string, then navigates the DOM using xml.etree.ElementTree.

Every tag requires the Microsoft event schema namespace prefix:

{http://schemas.microsoft.com/win/2004/08/events/event}
Event ID Event Name Has IP?
4625 Failed Logon Yes
4624 Successful Logon Yes
4740 Account Lockout Yes
4720 User Account Created No (local event)
4672 Special Privileges Assigned No (local event)
4732 Member Added to Security Group No (local event)

Detectors

Brute Force Detector

File: detectors/brute_force.py

Detects repeated failed login attempts from a single IP, and escalates to CRITICAL if the same IP later succeeds.

Algorithm:

  1. Separate failed_events and success_events from input
  2. failure_counts = Counter(e["ip"] for e in failed_events)
  3. For each IP where count >= BRUTE_FORCE_THRESHOLD (5):
    • Collect targeted usernames, first/last timestamps → HIGH alert
  4. If same IP appears in success_eventsCRITICAL alert (account compromised)

Why threshold = 5?

Behaviour Typical count Result
Legitimate user forgot password 1–3 No alert
Legitimate user, caps lock 2–4 No alert
Manual attacker testing 5–20 Alert fires
Automated tool (hydra, medusa) 100–10,000 Alert fires

5 is the industry-standard middle ground — matches the default in Fail2Ban and many commercial SIEM rulesets.


Off-Hours Detector

File: detectors/off_hours.py

Analyses successful logins only (failed attempts at odd hours are already caught by the brute force detector — no double alerting). Detects access outside business hours and on weekends.

WORK_START   = 8     # 08:00
WORK_END     = 18    # 18:00
WEEKEND_DAYS = [5, 6] # Saturday, Sunday
Condition Severity Rationale
Weekend login CRITICAL Almost no legitimate reason; highest insider-threat signal
Weekday off-hours login HIGH Suspicious — requires investigation
Weekday in-hours login No alert Normal

Why this matters: Insider threats use legitimate credentials. A disgruntled employee accessing the database server at 2 AM on a Sunday is as dangerous as an external attacker. The tool flags the anomaly — a human analyst with organisational context makes the final call.

Timestamp year injection: Syslog format (RFC 3164) does not include the year. The parser injects the current year:

full_ts = f"{datetime.now().year} {ts_str}"
datetime.strptime(full_ts, "%Y %b %d %H:%M:%S")

Multi-Target Detector

File: detectors/multi_target.py

Detects credential stuffing and account enumeration — one IP probing many different usernames.

ip_to_usernames = defaultdict(set)  # set deduplicates automatically

Uses defaultdict(set) instead of defaultdict(list) to ensure each username is counted once per IP regardless of how many times it was attempted. An IP with 100 attempts at root counts as 1 unique target, not 100.

Combines both failed and invalid_user events — a real credential stuffing attack generates both (some usernames exist on the system, some don't). Looking at only one type gives an incomplete picture.

Threshold: 3 unique usernames from one IP = credential stuffing alert (HIGH).


Web Attacks Detector

File: detectors/web_attacks.py

Analyses Apache access log events across four attack categories:

SQL Injection — 13 keyword patterns including union select, or 1=1, drop table, exec(, cast(, admin'--. Uses break after first match to generate one alert per request (prevents triple-alerting on a single URL that matches multiple patterns).

Directory Traversal — 7 patterns including ../, ..\, /etc/passwd, /etc/shadow, boot.ini, /proc/self. A status 200 response on a traversal URL is flagged specially — it means the server actually returned the file.

Command Injection — 12 patterns including ; cat, ; whoami, | cat, && ls, `whoami`, $(whoami). Receives CRITICAL severity — successful command injection is full remote code execution (RCE).

Vulnerability Scanner Detection — counts 404 responses per IP across the entire log. Fires after the main loop (a two-pass algorithm) because the threshold decision requires knowing the total count, not a running total.

Attack Type Severity MITRE Technique
Command injection CRITICAL T1059
SQL injection HIGH T1190
Directory traversal HIGH T1083
Vulnerability scanning MEDIUM T1595

Correlator

File: correlator.py

Individual alerts are noise. A correlated timeline is the attack story.

The correlator does not detect new threats. It links existing alerts from different detectors into a unified attack narrative grouped by source IP — mirroring the work of a Tier 2 SOC analyst.

Algorithm:

  1. defaultdict(list) groups all alerts by IP address
  2. Count unique alert types per IP using set()
  3. Skip IPs with only 1 unique type — not a coordinated attack
  4. For IPs with 2+ unique types: collect sources involved, build chronological timeline, inherit highest severity

Why 2 unique types as the threshold?

  • 1 type → could be an automated scanner with no human behind it
  • 2 types → attacker pivoted from one technique to another — definitionally a multi-stage attack
  • 3+ types → full kill chain — the highest-value intelligence in the report

Example full kill chain (single IP):

00:01  Apache  → Vulnerability scan (18 × 404)
00:02  Apache  → SQL injection (UNION SELECT)
00:03  Apache  → Directory traversal (../../../etc/passwd → 200)
00:03  SSH     → Brute force (45 failed attempts on root)
00:03  SSH     → Root account COMPROMISED
02:17  SSH     → Off-hours login as dbadmin (lateral movement)

Without correlation: 6 separate low-context alerts. With correlation: one CRITICAL incident with a 6-event timeline.


External Integrations

MITRE ATT&CK

File: mitre.py

Maps every LogSentry alert to a real MITRE ATT&CK technique ID using the official mitreattack-python library and the enterprise-attack.json STIX 2.0 dataset (~75MB, downloaded once at setup).

from mitreattack.stix20 import MitreAttackData
mitre_data = MitreAttackData("enterprise-attack.json")

technique = mitre_data.get_object_by_attack_id(
    technique_id,
    "attack-pattern"   # MUST be "attack-pattern", NOT "technique"
)
Detector Event Type MITRE ID Technique Name
brute_force SSH brute force T1110.001 Password Guessing
brute_force Account compromise T1078 Valid Accounts
off_hours Off-hours/weekend login T1078.003 Local Accounts
multi_target Credential stuffing T1110.004 Credential Stuffing
web_attacks SQL injection T1190 Exploit Public-Facing App
web_attacks Directory traversal T1083 File and Directory Discovery
web_attacks Command injection T1059 Command and Scripting Interpreter
web_attacks Vulnerability scan T1595 Active Scanning

The integration includes graceful fallback — if the MITRE library fails for any reason, the main program never crashes. It returns a minimal dict with the technique ID and URL only.

AbuseIPDB

File: enrichment.py

Queries the AbuseIPDB API for every attacker IP found in the logs, adding real-world threat intelligence context.

  • Endpoint: https://api.abuseipdb.com/api/v2/check
  • Method: GET
  • Free tier: 1,000 checks/day
  • lookback window: 90 days (maxAgeInDays: 90)

Response fields used:

  • abuseConfidenceScore (0–100): 0–24 = low, 25–74 = moderate, 75–100 = high
  • totalReports: how many times reported globally
  • lastReportedAt: timestamp of most recent report
  • countryCode: origin country
  • isp: internet service provider

Important: AbuseIPDB enriches existing alerts. It does not create new alerts. An IP with a score of 0 that appears in a brute force event is still a brute force alert.


Report Generation

File: report_generator.py + templates/report.html

Generates a professional timestamped HTML report using Jinja2:

env      = Environment(loader=FileSystemLoader("./templates"))
template = env.get_template("report.html")
html_output = template.render(**report_data)

Output filename format: logsentry_report_YYYYMMDD_HHMMSS.html Each run produces a uniquely named file — previous reports are never overwritten, preserving history for compliance and incident post-mortems.

Report sections:

  • Executive summary (total alerts, severity breakdown)
  • Correlated attack timelines (grouped by IP, chronological)
  • SSH alerts table
  • Apache alerts table
  • Windows alerts table
  • Per-alert MITRE ATT&CK technique links
  • AbuseIPDB enrichment data per attacker IP

Sample Log Scenarios

The sample_logs/ directory contains carefully crafted test data covering 7 scenarios.

auth.log — 4 scenarios

IP Scenario Detectors triggered Max severity
203.0.113.47 Full kill chain Multi-target + Brute force + Off-hours CRITICAL
198.51.100.22 Credential stuffing, no success Multi-target + Brute force HIGH
45.33.32.156 Simple brute force Brute force HIGH
192.168.x.x Legitimate weekend logins Off-hours (false positive demo) CRITICAL

203.0.113.47 full timeline:

00:02  → Username enumeration (admin, ubuntu, deploy, git)   HIGH
00:02  → 45 failed root login attempts                        HIGH
00:03  → ROOT ACCOUNT COMPROMISED                             CRITICAL
02:17  → Returns as dbadmin (lateral movement, off-hours)     CRITICAL

access.log — 3 scenarios

IP Scenario Detectors triggered Max severity
203.0.113.47 Full web attack chain Scan + SQLi + Traversal + CMDi CRITICAL
198.51.100.22 SQL + traversal combo SQLi + Traversal HIGH
45.33.32.156 Pure vulnerability scanner Scanner detection MEDIUM

Correlated: 203.0.113.47 appears in both logs → the correlator produces a single CRITICAL incident spanning: web recon → SQL injection → directory traversal (passwd file leaked) → SSH brute force → root compromise.


Installation

Requirements: Python 3.10+

# 1. Clone the repository
git clone https://github.com/Doumit04/LogSentry.git
cd LogSentry

# 2. Create and activate virtual environment
python -m venv .venv
.venv\Scripts\activate        # Windows
source .venv/bin/activate     # Linux / macOS

# 3. Install dependencies
pip install -r requirements.txt

# 4. Download MITRE ATT&CK dataset (one time, ~75MB)
python -c "from mitreattack.stix20 import MitreAttackData; MitreAttackData.download('enterprise-attack.json')"

# 5. Configure API key
cp .env.example .env
# Edit .env and add your AbuseIPDB API key

Dependencies:

requests
python-dotenv
jinja2
colorama
python-evtx
mitreattack-python

Usage

# Analyse SSH logs
python main.py --log sample_logs/auth.log --type ssh

# Analyse Apache logs
python main.py --log sample_logs/access.log --type apache

# Analyse Windows Event Logs
python main.py --log sample_logs/Security.evtx --type windows

# Analyse all sources at once
python main.py --log sample_logs/ --type all

Reports are saved to the output/ directory with a timestamp in the filename.


Configuration

Edit .env to configure API keys:

ABUSEIPDB_API_KEY=your_api_key_here

Edit detection thresholds in the relevant detector files:

Constant Default File Rationale
BRUTE_FORCE_THRESHOLD 5 brute_force.py Fail2Ban industry standard
MULTI_TARGET_THRESHOLD 3 multi_target.py 2 = shared workstation; 3 = enumeration
SCAN_THRESHOLD 10 web_attacks.py OSSEC default
WORK_START 8 off_hours.py Standard 8-hour workday
WORK_END 18 off_hours.py ISO/IEC 27001

Design Decisions

False positive philosophy: LogSentry is designed to surface everything suspicious and let a human decide. This mirrors real SIEM behaviour. Tuning thresholds so high that only obvious attacks trigger results in missed detections. In security, a false positive wastes an analyst's time — a false negative lets an attacker through.

Parsing vs detection: The most important distinction in this codebase. A parser converts a raw log line into a normalised dictionary — it makes no judgments. A detector receives normalised events and asks one question: does this data match a specific attack pattern? This separation means adding a new log source requires writing one new parser with zero changes to any detector, and adding a new detection rule requires writing one new detector with zero changes to any parser.

Why correlation is the highest-value feature: A 404 scanning alert could be a broken link aggregator. A brute force alert could be a misconfigured CI/CD system. An off-hours login could be a developer in a different timezone. The same three events correlated to a single IP — in order, over 45 minutes — are unambiguous: a human attacker performed reconnaissance, validated credentials via brute force, and then used the compromised account during off-hours to avoid detection. The correlator transforms noise into a decision-ready incident report.


What This Project Demonstrates

Skill Area Specific Knowledge
SIEM architecture Ingest → Normalise → Detect → Correlate → Alert pipeline
Log format parsing Syslog, Combined Log Format, Windows EVTX binary
Regex engineering Greedy vs lazy quantifiers, anchor strings, evasion-resistant normalisation
Python standard library Counter, defaultdict, re, datetime.strptime, generator expressions
Threat detection logic Threshold-based, behavioural anomaly, multi-source correlation
MITRE ATT&CK STIX 2.0, technique taxonomy, kill chain phases, sub-technique URLs
Threat intelligence APIs REST integration, authentication headers, confidence scores
Windows event logs EVTX binary format, XML namespace handling, Event ID taxonomy
False positive management Threshold tuning rationale, SOC analyst workflow, insider threat modelling
Software design Single responsibility principle applied to security tooling

Documentation

A 41-page deep technical reference document is included in the docs/ directory, covering every algorithmic decision, regex pattern, threshold rationale, XML schema, API integration, and architectural trade-off in the project.

Topics covered: SSH Parser · Apache Parser · Windows EVTX Parser · Brute Force Detector · Off-Hours Detector · Multi-Target Detector · Web Attack Detector · Correlator · Reporter · MITRE ATT&CK Integration · AbuseIPDB Integration · Jinja2 Reporting


Built by Tony Doumit — March 2026

About

Python SIEM with 4-layer detection pipeline; correlates multi-stage attacks across log sources, mapped to MITRE ATT&CK, enriched with live threat intelligence via AbuseIPDB

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors