Skip to content

Commit 64f49dd

Browse files
committed
Initial commit: llmwiki knowledge compiler
0 parents  commit 64f49dd

55 files changed

Lines changed: 7728 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
.env
2+
node_modules/
3+
dist/
4+
.llmwiki/
5+
.llm-wiki/
6+
/wiki/
7+
/sources/
8+
*.tmp
9+
.DS_Store
10+
localdocs/

AGENTS.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# llmwiki
2+
3+
A knowledge compiler CLI. Raw sources in, interlinked wiki out.
4+
5+
## Development Guidelines
6+
7+
### Code Style & Standards
8+
9+
- Files must be smaller than 400 lines excluding comments. Once 400 is exceeded, initiate a refactor.
10+
- Functions must be smaller than 40 lines excluding comments and the catch/finally blocks of try/catch sections. If a function exceeds that, refactor it.
11+
12+
### clean code rules
13+
14+
- Meaningful Names: Name variables and functions to reveal their purpose, not just their value.
15+
- One Function, One Responsibility: Functions should do one thing.
16+
- Avoid Magic Numbers: Replace hard-code values with named constants to give them meaning.
17+
- Use Descriptive Booleans: Boolean names should state a condition, not just its value.
18+
- Keep Code DRY: Duplicate code means duplicate bugs. Try and reuse logic where it makes sense.
19+
- Avoid Deep Nesting: Flatten your code flow to improve clarity and reduce cognitive load.
20+
- Comment Why, Not What: Explain the intention behind your code, not the obvious mechanics.
21+
- Limit Function Arguments: Too many parameters confuse. Group related data into objects.
22+
- Code Should Be Self-Explanatory: Well-written code needs fewer comments because it reads like a story.
23+
24+
### Comments and Documentation
25+
26+
- include a substantial JSDoc comment at the top of each file. For python files, use google style docstrings
27+
- Write clear comments for complex logic
28+
- Document public APIs and functions
29+
- Use JSDoc comments for functions
30+
- Keep comments up-to-date with code changes
31+
- Document any non-obvious behavior
32+
33+
## General Rules
34+
35+
- First think through the problem, read the codebase for relevant files.
36+
- Make every task and code change you do as simple as possible. We want to avoid making any massive or complex changes. Every change should impact as little code as possible. Everything is about simplicity.
37+
- Never speculate about code you have not opened. If the user references a specific file, you MUST read the file before answering. Make sure to investigate and read relevant files BEFORE answering questions about the codebase. Never make any claims about code before investigating unless you are certain of the correct answer - give grounded and hallucination-free answers.
38+

CLAUDE.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# llmwiki
2+
3+
A knowledge compiler CLI. Raw sources in, interlinked wiki out.
4+
5+
## Development Guidelines
6+
7+
### Code Style & Standards
8+
9+
- Files must be smaller than 400 lines excluding comments. Once 400 is exceeded, initiate a refactor.
10+
- Functions must be smaller than 40 lines excluding comments and the catch/finally blocks of try/catch sections. If a function exceeds that, refactor it.
11+
12+
### clean code rules
13+
14+
- Meaningful Names: Name variables and functions to reveal their purpose, not just their value.
15+
- One Function, One Responsibility: Functions should do one thing.
16+
- Avoid Magic Numbers: Replace hard-code values with named constants to give them meaning.
17+
- Use Descriptive Booleans: Boolean names should state a condition, not just its value.
18+
- Keep Code DRY: Duplicate code means duplicate bugs. Try and reuse logic where it makes sense.
19+
- Avoid Deep Nesting: Flatten your code flow to improve clarity and reduce cognitive load.
20+
- Comment Why, Not What: Explain the intention behind your code, not the obvious mechanics.
21+
- Limit Function Arguments: Too many parameters confuse. Group related data into objects.
22+
- Code Should Be Self-Explanatory: Well-written code needs fewer comments because it reads like a story.
23+
24+
### Comments and Documentation
25+
26+
- include a substantial JSDoc comment at the top of each file. For python files, use google style docstrings
27+
- Write clear comments for complex logic
28+
- Document public APIs and functions
29+
- Use JSDoc comments for functions
30+
- Keep comments up-to-date with code changes
31+
- Document any non-obvious behavior
32+
33+
## General Rules
34+
35+
- First think through the problem, read the codebase for relevant files.
36+
- Make every task and code change you do as simple as possible. We want to avoid making any massive or complex changes. Every change should impact as little code as possible. Everything is about simplicity.
37+
- Never speculate about code you have not opened. If the user references a specific file, you MUST read the file before answering. Make sure to investigate and read relevant files BEFORE answering questions about the codebase. Never make any claims about code before investigating unless you are certain of the correct answer - give grounded and hallucination-free answers.
38+

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2026 atomicmemory
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
# llmwiki
2+
3+
Compile raw sources into an interlinked markdown wiki.
4+
5+
Inspired by Karpathy's [LLM Wiki](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) pattern: instead of re-discovering knowledge at query time, compile it once into a persistent, browsable artifact that compounds over time.
6+
7+
## Who this is for
8+
9+
- **AI researchers and engineers** building persistent knowledge from papers, docs, and notes
10+
- **Technical writers** compiling scattered sources into a structured, interlinked reference
11+
- **Anyone with too many bookmarks** who wants a wiki instead of a graveyard of tabs
12+
13+
## Quick start
14+
15+
```bash
16+
npm install -g llm-wiki-compiler
17+
export ANTHROPIC_API_KEY=sk-...
18+
19+
llmwiki ingest https://some-article.com
20+
llmwiki compile
21+
llmwiki query "what is X?"
22+
```
23+
24+
## Why not just RAG?
25+
26+
RAG retrieves chunks at query time. Every question re-discovers the same relationships from scratch. Nothing accumulates.
27+
28+
llmwiki **compiles** your sources into a wiki. Concepts get their own pages. Pages link to each other. When you ask a question with `--save`, the answer becomes a new page, and future queries use it as context. Your explorations compound.
29+
30+
This is complementary to RAG, not a replacement. RAG is great for ad-hoc retrieval over large corpora. llmwiki gives you a persistent, structured artifact to retrieve from.
31+
32+
```
33+
RAG: query → search chunks → answer → forget
34+
llmwiki: sources → compile → wiki → query → save → richer wiki → better answers
35+
```
36+
37+
## How it works
38+
39+
```
40+
sources/ → SHA-256 hash check → LLM concept extraction → wiki page generation → [[wikilink]] resolution → index.md
41+
```
42+
43+
**Two-phase pipeline.** Phase 1 extracts all concepts from all sources. Phase 2 generates pages. This eliminates order-dependence, catches failures before writing anything, and merges concepts shared across multiple sources into single pages.
44+
45+
**Incremental.** Only changed sources go through the LLM. Everything else is skipped via hash-based change detection.
46+
47+
**Compounding queries.** `llmwiki query --save` writes the answer as a wiki page and immediately rebuilds the index. Saved answers show up in future queries as context.
48+
49+
### What it produces
50+
51+
A raw source like a Wikipedia article on knowledge compilation becomes a structured wiki page:
52+
53+
```yaml
54+
---
55+
title: Knowledge Compilation
56+
summary: Techniques for converting knowledge representations into forms that support efficient reasoning.
57+
sources:
58+
- knowledge-compilation.md
59+
createdAt: "2026-04-05T12:00:00Z"
60+
updatedAt: "2026-04-05T12:00:00Z"
61+
---
62+
```
63+
64+
```markdown
65+
Knowledge compilation refers to a family of techniques for pre-processing
66+
a knowledge base into a target language that supports efficient queries.
67+
68+
Related concepts: [[Propositional Logic]], [[Model Counting]]
69+
```
70+
71+
Pages include source attribution in frontmatter. Provenance is page-level today, not claim-level.
72+
73+
## Commands
74+
75+
| Command | What it does |
76+
|---------|-------------|
77+
| `llmwiki ingest <url\|file>` | Fetch a URL or copy a local file into `sources/` |
78+
| `llmwiki compile` | Incremental compile: extract concepts, generate wiki pages |
79+
| `llmwiki query "question"` | Ask questions against your compiled wiki |
80+
| `llmwiki query "question" --save` | Answer and save the result as a wiki page |
81+
| `llmwiki watch` | Auto-recompile when `sources/` changes |
82+
83+
## Output
84+
85+
```
86+
wiki/
87+
concepts/ one .md file per concept, with YAML frontmatter
88+
queries/ saved query answers, included in index and retrieval
89+
index.md auto-generated table of contents
90+
```
91+
92+
Obsidian-compatible. `[[wikilinks]]` resolve to concept titles.
93+
94+
## Demo
95+
96+
Try it on any article or document:
97+
98+
```bash
99+
mkdir my-wiki && cd my-wiki
100+
llmwiki ingest https://en.wikipedia.org/wiki/Knowledge_compilation
101+
llmwiki compile
102+
llmwiki query "how does knowledge compilation work?"
103+
```
104+
105+
See `examples/basic/` in the repo for pre-generated output you can browse without an API key.
106+
107+
## Limitations
108+
109+
Early software. Best for small, high-signal corpora (a few dozen sources). Query routing is index-based. Provenance is page-level, not claim-level. Anthropic-only for now.
110+
111+
**Honest about truncation.** Sources that exceed the character limit are truncated on ingest with `truncated: true` and the original character count recorded in frontmatter, so downstream consumers know they're working with partial content.
112+
113+
## Karpathy's LLM Wiki pattern vs this compiler
114+
115+
Karpathy describes an abstract pattern for turning raw data into compiled knowledge. Here's how llmwiki maps to it:
116+
117+
| Karpathy's concept | llmwiki | Status |
118+
|---|---|---|
119+
| Data ingest | `llmwiki ingest` | Implemented |
120+
| Compile wiki | `llmwiki compile` | Implemented |
121+
| Q&A | `llmwiki query` | Implemented |
122+
| Output filing (save answers back) | `llmwiki query --save` | Implemented |
123+
| Auto-recompile | `llmwiki watch` | Implemented |
124+
| Linting / health-check pass || Not yet implemented (`watch` is auto-recompile, not lint) |
125+
| Image support || Not yet implemented |
126+
| Marp slides || Not yet implemented |
127+
| Fine-tuning || Not yet implemented |
128+
129+
## Roadmap
130+
131+
- Multi-provider support (OpenAI, local models)
132+
- Better provenance (claim-level source attribution)
133+
- Larger-corpus query strategy (semantic search, embeddings)
134+
- Deeper Obsidian integration
135+
- Linting pass for wiki quality checks
136+
137+
If you want to contribute, these are the highest-leverage areas right now. Issues and PRs are welcome.
138+
139+
## Requirements
140+
141+
Node.js >= 18, an Anthropic API key.
142+
143+
## License
144+
145+
MIT
146+
147+
148+
## Disclaimer
149+
150+
No LLMs were harmed in the making of this repo.

examples/basic/README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Basic Example
2+
3+
Real output from running llmwiki on a single source document about knowledge compilation.
4+
5+
## What's here
6+
7+
```
8+
sources/
9+
knowledge-compilation.md ← the input (one markdown file)
10+
11+
wiki/
12+
concepts/ ← 7 concept pages extracted by the LLM
13+
change-detection.md
14+
compilation-pipeline.md
15+
concept-extraction.md
16+
cross-source-concepts.md
17+
incremental-compilation.md
18+
knowledge-compilation.md
19+
wikilinks.md
20+
index.md ← auto-generated table of contents
21+
```
22+
23+
One source in, seven interlinked concept pages out. Browse the `wiki/` directory to see the compiled output, or open it in Obsidian for navigable `[[wikilinks]]`.
24+
25+
## Reproduce it yourself
26+
27+
```bash
28+
# run from the repo root
29+
llmwiki ingest examples/basic/sources/knowledge-compilation.md
30+
llmwiki compile
31+
llmwiki query "what is knowledge compilation?"
32+
```
33+
34+
Output will vary since it's LLM-generated, but the structure will match.
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
title: knowledge compilation
3+
source: examples/basic/sources/knowledge-compilation.md
4+
ingestedAt: "2026-04-06T07:53:26.851Z"
5+
---
6+
7+
---
8+
title: knowledge compilation
9+
source: basic/sources/knowledge-compilation.md
10+
ingestedAt: "2026-04-06T07:31:48.092Z"
11+
---
12+
13+
# Knowledge Compilation
14+
15+
The idea of "knowledge compilation" is that LLMs can take messy, unstructured information and compile it into clean, structured, interlinked reference material. Think of it like a compiler for knowledge: raw sources in, organized wiki out.
16+
17+
## Why This Matters
18+
19+
Most knowledge lives in scattered documents, articles, notes, and conversations. Finding what you need means searching through all of it. A knowledge compiler processes these sources and produces a wiki where every concept has its own page, linked to related concepts.
20+
21+
## How It Works
22+
23+
The compilation pipeline has several stages:
24+
25+
1. **Ingestion**: Raw sources (URLs, files, documents) are collected into a sources directory.
26+
2. **Change Detection**: SHA-256 hashes identify which sources have changed since the last compile.
27+
3. **Concept Extraction**: An LLM reads each changed source and extracts the key concepts.
28+
4. **Page Generation**: For each concept, the LLM generates a wiki page with proper structure.
29+
5. **Interlink Resolution**: Concept mentions across pages are wrapped in [[wikilinks]].
30+
6. **Index Generation**: A table of contents is built from all concept pages.
31+
32+
## Incremental Compilation
33+
34+
Like a code compiler, only changed sources need reprocessing. This saves both time and API costs. The system tracks source hashes in a state file and skips unchanged sources entirely.
35+
36+
## Cross-Source Concepts
37+
38+
When multiple sources discuss the same concept, the compiler detects this overlap through semantic dependency tracking. Changes to one source trigger recompilation of shared concepts using content from all contributing sources.
39+
40+
## Obsidian Compatibility
41+
42+
The output format uses YAML frontmatter and [[wikilinks]], making it directly compatible with Obsidian and similar tools. Each concept page includes metadata like title, summary, source attribution, and timestamps.
43+
44+
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
title: Change Detection
3+
summary: Using SHA-256 hashes to identify which sources have been modified since the last compilation run
4+
sources:
5+
- knowledge-compilation.md
6+
createdAt: "2026-04-06T07:53:56.456Z"
7+
updatedAt: "2026-04-06T07:53:56.456Z"
8+
---
9+
10+
# Change Detection
11+
12+
Change Detection is a crucial component of the [[knowledge compilation]] pipeline that determines which sources need reprocessing during [[Incremental Compilation]]. It uses cryptographic hashing to efficiently identify modified content and avoid unnecessary work.
13+
14+
## Purpose
15+
16+
Change Detection serves as an optimization mechanism for [[knowledge compilation]] systems. Rather than reprocessing all sources on every compilation run, it identifies only the sources that have actually changed since the last compilation. This approach saves both processing time and API costs when working with large knowledge bases.
17+
18+
## How It Works
19+
20+
The system uses **SHA-256 hashes** to create unique fingerprints for each source document. These hashes are stored in a state file that tracks the compilation history. During each compilation run:
21+
22+
1. The system calculates SHA-256 hashes for all current sources
23+
2. These new hashes are compared against the stored hashes from the previous compilation
24+
3. Sources with different hashes are marked as changed and queued for reprocessing
25+
4. Unchanged sources are skipped entirely
26+
27+
## Integration with [[Compilation Pipeline]]
28+
29+
Change Detection operates as the second stage of the [[knowledge compilation]] pipeline, immediately after **Ingestion**. It acts as a filter that determines which sources proceed to the **Concept Extraction** stage, ensuring that only modified content triggers expensive LLM processing.
30+
31+
## State Management
32+
33+
The system maintains compilation state through a dedicated state file that preserves hash information between runs. This persistent tracking enables true [[Incremental Compilation]], where the system can resume work efficiently after interruptions or when processing sources that are updated at different intervals.
34+
35+
## Cross-Source Dependencies
36+
37+
When sources share concepts, Change Detection must account for semantic dependencies. If one source changes and affects a concept that appears in multiple sources, the system may need to recompile pages for that concept using content from all contributing sources, not just the changed one.
38+
39+
## Sources
40+
41+
- knowledge-compilation.md

0 commit comments

Comments
 (0)