Skip to content

romular21/agent-session-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

agent-session-extractor

Claude Code / Codex / OpenClaw session logs → clean, dated Markdown.

A single Python file, zero dependencies, no LLM, no API key, no network. It reads the JSONL transcripts your coding agent already writes to disk and turns them into one tidy YYYY-MM-DD.md per day — so the clumsy raw logs your agent leaves behind become months of history you can read, grep, diff, and mine for knowledge.

python3 scripts/extract_session.py --all --source codex --since 2025-09-01 --until 2025-09-30

Zero install. It's one stdlib-only file (Python 3.9+). curl it or clone it and run — no pip install, no requirements.txt, nothing to vendor.


Why?

Your agent's session logs are a goldmine — every decision, dead end, and fix you've worked through — but they're locked in large, machine-shaped JSONL files you'll never read by hand. And they don't last: Claude Code prunes old sessions, so the deep history is quietly disappearing. (Codex keeps a long archive; OpenClaw keeps rotation checkpoints.)

This tool is the free, deterministic, dependency-free first stage: it pulls the user/assistant text turns out of the raw logs and writes clean, human-readable per-day Markdown. What you do next — read it, search it, or distill it into a knowledge base — is up to you (see What to do with the output).

It's also language-agnostic: it extracts the text as-is, so sessions in any language (English, Russian, anything) come through unchanged.

Features

  • Three sources, one tool — Claude Code, Codex, and OpenClaw JSONL, with --source auto format sniffing for single files.
  • Zero dependencies — pure Python standard library, one file, Python 3.9+.
  • Deterministic & free — no LLM, no API calls, no cost; the same logs always produce the same Markdown.
  • Any language — pure text extraction, no language assumptions.
  • Date-bucketed — one YYYY-MM-DD.md per local day, in your timezone.
  • Repo-aware for Codex--project filters the global Codex store to one repo; --list prints a coverage manifest so you never silently skip a day.
  • Incremental — content-hash state skips unchanged files on re-run; safe to schedule.
  • Private by default — the default scratch dir is 0700, files 0600, and refuses to follow a symlinked target. Session logs contain your prompts (and sometimes secrets).
  • Robust — one corrupt / non-UTF-8 file degrades gracefully instead of aborting the run.

Quick Start

# Look at one session (auto-detects the format), printed to stdout:
python3 scripts/extract_session.py path/to/one-session.jsonl

# Extract a whole source for a date range, one Markdown file per day:
python3 scripts/extract_session.py --all --source codex --since 2025-09-01 --until 2025-09-30 --out-dir ~/codex-extract

# See what's there before extracting (Codex is global across repos):
python3 scripts/extract_session.py --all --source codex --list

By default --all writes to a private scratch dir under your system temp; pass --out-dir to choose where.

Usage

Claude Code sessions are stored per-project, in a directory derived from the path you run Claude from: ~/.claude/projects/<encoded-cwd>/, where the cwd is encoded by replacing / with - (e.g. /home/you/myproj-home-you-myproj). So for Claude, either run the extractor from your project directory (with an absolute path to the script), or point it at that sessions dir explicitly with --sessions-dir / CLAUDE_SESSIONS_DIR:

# Claude — run from YOUR project dir, using an absolute path to the script:
python3 /path/to/agent-session-extractor/scripts/extract_session.py \
    --all --source claude --out-dir ~/claude-extract

# Claude — or point at the encoded sessions dir explicitly (works from anywhere):
python3 scripts/extract_session.py --all --source claude \
    --sessions-dir "$HOME/.claude/projects/-home-you-myproj" --out-dir ~/claude-extract

# ...including subagents / workflows: add --recursive
# Keep only one repo out of the global Codex store:
python3 scripts/extract_session.py --all --source codex --project myrepo --out-dir ~/codex-extract

# OpenClaw sessions:
python3 scripts/extract_session.py --all --source openclaw --out-dir ~/openclaw-extract

Not sure where your logs live? Codex is global + date-nested under ~/.codex/sessions; OpenClaw is under ~/.openclaw/agents/main/sessions. Point anywhere with --sessions-dir or the matching *_SESSIONS_DIR env var (see Options). Writing extracts outside the repo clone (as above) also keeps your transcripts out of git.

Command-line options

Run python3 scripts/extract_session.py --help for the authoritative list. The essentials:

Option Meaning
<file> Extract a single session file to stdout (use with --source auto).
--all Batch mode: group a source's sessions by date into YYYY-MM-DD.md files.
--source {claude,codex,openclaw,auto} Which log format. auto is single-file only.
--since / --until Inclusive date range (YYYY-MM-DD).
--out-dir DIR Where to write (default: a private scratch dir under your temp).
--sessions-dir DIR Read sessions from this directory instead of the source's default (e.g. point at a specific Claude ~/.claude/projects/<encoded-cwd>).
--recursive Claude: include subagent/workflow sessions, not just top-level.
--project SLUG Codex: keep only sessions belonging to this repo (the store is global).
--list Print a coverage manifest (date · turns · project · extracted?); writes nothing.
--checkpoints OpenClaw + --all only: also recover deep history from .checkpoint.* rotation snapshots (default off; deduped, per-turn dated, non-incremental).
--timezone TZ local (default), UTC, or an offset like +08:00 / -0500.
--max-chars N Truncate very long turns to keep the output readable (min 2).
--force Ignore the incremental cache and re-extract everything.
--version Print the version.

Configuration (environment)

Variable Effect
SESSION_EXTRACT_TZ Default timezone label (overrides the local default).
SESSION_EXTRACT_DIR Override the source directory for any source.
CLAUDE_SESSIONS_DIR Override the auto-detected Claude Code project path.
CODEX_SESSIONS_DIR Override the default ~/.codex/sessions.
OPENCLAW_SESSIONS_DIR Override the default ~/.openclaw/agents/main/sessions.

Output

One Markdown file per local day. Each keeps the user/assistant turns in order and drops the machine noise — tool calls, reasoning traces, injected context, acknowledgements — so what's left reads like the conversation you actually had.

How it works / Design & trade-offs

This section is the "kitchen": the decisions behind the tool and why they're made the way they are. Most are deliberate trade-offs with an escape hatch.

  • Deterministic, no-LLM extraction. Stage one is pure parsing — free, reproducible, and auditable. No model decides what to keep; simple rules do. The value of an LLM comes later, in the optional distillation stage, where you control the cost.

  • A session is dated by its first turn. Simple and predictable, but a session you resumed across midnight (or across several days) dumps all its later turns into the start day's file. (In my own ~134 Codex rollouts, about 1 in 5 crossed a local-day boundary — your mileage will vary.) Escape hatch: --split-by-date (roadmap) buckets each turn by its own timestamp.

  • Default timezone is local, not a baked-in offset. Day boundaries follow your machine's zone, with each timestamp converted using its own (DST-correct) offset. Override: --timezone or SESSION_EXTRACT_TZ — use UTC for reproducible cross-machine output.

  • Codex's store is global across repos. ~/.codex/sessions mixes every project you've used Codex on into one date-nested tree. A naive --all would blend them. Escape hatch: --project filters to one repo (matched on the rollout header's git/cwd), and --list shows you every day and whether it's been extracted — so you never silently mix repos or skip a day. (Claude Code is already per-project on disk, so this doesn't apply there.)

  • Claude prunes; Codex and OpenClaw checkpoints are the deep archive. Claude Code deletes old sessions, so its local logs only go back weeks. If you want long history, run this on a schedule before pruning, and lean on Codex / OpenClaw for the deep past. Note that recovered old history is point-in-time — it can reference files, flags, or PR numbers that no longer match your current repo. Treat it as a dated record, not current truth.

  • Private by default. Session logs carry your prompts and can carry secrets, so the default scratch directory is created 0700, its files 0600, and the tool refuses to write through a pre-existing symlink (anti-tampering). Point --out-dir somewhere only if you've thought about who can read it.

  • Incremental by content hash. Re-running the same command skips source files whose content hash is unchanged, keyed by source + sessions-dir + recursive + timezone + project + max-chars — so scheduled mining doesn't redo work. Changing any of those keys against an existing --out-dir requires --force (or a fresh out-dir); --force rebuilds from scratch.

  • Graceful on bad input. Reads use errors="replace" for bad bytes, and the parsers skip a structurally-malformed line (invalid JSON, or a record of an unexpected shape) rather than raising — so one corrupt or unexpected file can't take down a whole --all sweep.

  • Sidecars & checkpoint deep-history recovery. OpenClaw .trajectory.* files (no conversational turns) are always skipped. .checkpoint.* rotation snapshots are skipped by default; opt in with --checkpoints (OpenClaw + --all only — it warns if used with another source) to recover deep history a rotated session lost from its live <id>.jsonl but kept in its checkpoints. How it works:

    • Union all of a parent's checkpoints (newest→oldest), de-duplicated against the live base and each other by a raw-text signature (timestamp, role, sha256(raw, pre-truncation)). Union-all — not just the newest — because "the newest snapshot is a superset" is an observation, not a guarantee: a size-capped rotation could trim the oldest turns. Dedup makes the union idempotent, so unioning all is free on correctness and a checkpoint that only duplicates the base contributes nothing (no inflation). The signature is the raw, pre-truncation text, so a turn truncated in one snapshot but not another can't slip the dedup.
    • Each recovered turn is dated by its own timestamp (snapshots span months) and sorted chronologically; recovery honors --recursive (nested checkpoints, resolved against the base in their own directory) and --project.
    • Stale-output reconciliation: if a previously-recovered day's checkpoint later disappears, a recovered-only day's file is removed, and a day that still has a live session is rewritten without the now-gone recovered section.
    • Trade-off: it does not reconstruct rewrite-forks via a branch walk — compaction severs the parent chain, so a walk risks dropping whole segments. The cost is a small, bounded number of superseded-fork turns (in the one real store I measured, ~4 out of 22,000). It's also non-incremental (re-scans every checkpoint each run, so it's for occasional deep mining), and turning the flag off later leaves already-recovered files in place.

Limits

  • Per-day attribution for multi-day sessions until --split-by-date lands (above).
  • --checkpoints is non-incremental — it re-scans all of a session's checkpoints each run and carries a small documented set of superseded-fork turns (see How it works above).
  • --checkpoints with --since/--until can't perfectly reconcile a multi-day session whose first turn falls outside the window: because the live path dates a whole session by its first turn, some in-window turns may be omitted. The tool warns when you combine the two; widen or drop the window to recover everything (the proper fix is per-turn dating, --split-by-date).
  • POSIX-oriented. Linux and WSL are first-class (and what CI exercises); macOS should work (it's POSIX) but isn't covered by CI. On Windows the private-dir permissions degrade to best-effort, and the test suite uses time.tzset() (Unix-only).
  • Reads, never fetches. It only works on logs already on your disk.

Validation

Built test-first — a 100+-test stdlib suite (no services, no network) covering every source parser, the incremental state machine, and the full checkpoint-recovery reconcile matrix. The published version was also put through an adversarial multi-agent review (seven reviewers attacking correctness, honesty, privacy, and onboarding) whose findings were fixed before release, and exercised on real session logs across two machines (x86-64 and ARM64 Linux) and multiple Python versions:

  • All three sources on real data — Claude Code, Codex, and OpenClaw, each with correct --source auto detection.
  • Deep history — Codex rollouts back to a 2025-08 legacy format (the oldest, least- exercised path) extracted real turns; preamble-only aborted sessions correctly produced no empty day files.
  • Large checkpoint stores — thousands of OpenClaw turns recovered via --checkpoints, with default-off output byte-identical to a clean run, and no inflation.
  • Project triage--list surfaced a real global-Codex store mixing unrelated projects, exactly the case --project exists for.

No crashes, mis-detection, or data loss across those runs. That said — session-log formats evolve and vary between setups, so your store may well hit an edge case we haven't seen yet. If the output looks off, please open an issue with a sanitized sample; the tool only reads files, so reproductions are usually easy.

What to do with the output (optional)

Extraction is stage one. A natural next step is distillation: read each rich day, write a short durable note, and index those notes into whatever knowledge base you run. That stage is deliberately not in this tool — it's backend-agnostic by design, and has been driven into both a vector brain (gbrain) and a local-ONNX index (memsearch) from this same Markdown. Keep the raw extracts as scratch; commit and index only the small curated notes.

Security

Session transcripts can contain prompts, file contents, and credentials. This tool never sends anything anywhere — it only reads local files and writes local Markdown. The default output directory is private (0700/0600) and symlink-guarded. If you redirect --out-dir, you own the permissions of the target.

Python & platform support

  • Python 3.9+ (standard library only — no third-party packages).
  • Linux / WSL first-class (CI-tested on 3.9–3.12); macOS expected to work (POSIX, not in CI); Windows best-effort.

Troubleshooting

--source auto rejected with --all. auto only works on a single file; in batch mode name the source explicitly (--source claude|codex|openclaw).

Empty output / "no sessions found". Check where your logs actually live and point the tool at them with the matching *_SESSIONS_DIR env var (or SESSION_EXTRACT_DIR). Codex is date-nested and global; Claude is per-project; OpenClaw is under ~/.openclaw/....

A Codex run mixes unrelated projects. That store is global — add --project <repo>, and use --list first to see what's there.

Dates look off by one. You're seeing local-timezone bucketing. Pass --timezone UTC (or a fixed offset) for stable, machine-independent dates.

Contributing

Issues and PRs welcome. Keep it single-file and dependency-free — that's the whole point. The tests are stdlib but use pytest as the runner (pip install pytest); run python3 -m pytest tests/ -q. (The test layout — scripts/ + tests/ — is load-bearing: the suite locates the script relative to itself, so keep both directories.)

License

MIT © 2026 Roman Rusavskii

Acknowledgments

Built and battle-tested mining ~10 months of real agent history across Claude Code and Codex. OpenClaw support was added in v1.0.0 and validated against real OpenClaw sessions (including its checkpoint deep-history).

About

Claude Code / Codex / OpenClaw session logs → clean, dated Markdown. Zero-dependency single-file Python (3.9+).

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages