Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
5504541
feat(compiler): _read_entity_briefs for entity plan context
KylinMountain May 30, 2026
7181d57
test(compiler): parity tests for _read_entity_briefs
KylinMountain May 30, 2026
efacb6f
feat(compiler): _write_entity with type/aliases frontmatter
KylinMountain May 30, 2026
71a4a14
test(compiler): assert source ordering in _write_entity; count=1 in _…
KylinMountain May 30, 2026
97f1c51
feat(lint): include entities/ in wikilink whitelist
KylinMountain May 30, 2026
3c8aa93
feat(compiler): summary<->entity backlinks
KylinMountain May 30, 2026
ff1345e
test(compiler): restore assertion erroneously deleted in 3c8aa93
KylinMountain May 30, 2026
385defd
feat(compiler): index.md Entities section
KylinMountain May 30, 2026
41cda0f
feat(compiler): remove_doc_from_entity_pages + index cleanup
KylinMountain May 30, 2026
04d2bc9
feat(compiler): plan prompt + parser for entities group
KylinMountain May 30, 2026
ad45439
fix(compiler): related entities must not downgrade index labels
KylinMountain May 30, 2026
5008a14
feat(schema): declare entities/ page type and taxonomy
KylinMountain May 30, 2026
1e82214
feat(query): point who/what questions at entities/
KylinMountain May 30, 2026
3242844
docs(readme): document entities/ page type
KylinMountain May 30, 2026
ff3fafb
feat(cli): scaffold entities/ in init and count it in status
KylinMountain May 30, 2026
a7a06ed
fix(compiler): resolve entity-page review findings (dangling links + …
claude May 30, 2026
b882ee9
fix(compiler): add [[entities/X]] whitelist rule + restore concept-to…
KylinMountain May 31, 2026
d1dc637
feat(entities): shared page-dir constants + surface entities in list/…
KylinMountain May 31, 2026
bd81f7e
feat(entities): remove preview lists entity-page actions (#1)
KylinMountain May 31, 2026
3d7c842
docs(entities): document entity pages in shipped openkb skill (#8)
KylinMountain May 31, 2026
022aad4
fix(compiler): don't write raw JSON body on empty LLM content
KylinMountain May 31, 2026
1e2d5e0
fix(compiler): graceful scalar plan + rebuild malformed entity frontm…
KylinMountain May 31, 2026
b245128
fix(compiler): keep ## Entities before ## Explorations; drop dead par…
KylinMountain May 31, 2026
2f09fad
test(compiler): cover empty-content skip, scalar plan, malformed enti…
KylinMountain May 31, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ wiki/ │ ← the foundation
├── sources/ Full-text conversions
├── summaries/ Per-document summaries
├── concepts/ Cross-document synthesis ← the good stuff
├── entities/ Specific named things (people, orgs, places, products)
├── explorations/ Saved query results
└── reports/ Lint reports
Expand Down Expand Up @@ -136,9 +137,10 @@ Short docs are read in full by the LLM. Long PDFs are indexed by PageIndex into
When you add a document, the LLM:

1. Generates a **summary** page
2. Reads existing **concept** pages
2. Reads existing **concept** and **entity** pages
3. Creates or updates concepts with cross-document synthesis
4. Updates the **index** and **log**
4. Creates or updates **entity** pages (people, orgs, places, products)
5. Updates the **index** and **log**

A single source might touch 10-15 wiki pages. Knowledge accumulates: each document enriches the existing wiki rather than sitting in isolation.

Expand Down
705 changes: 618 additions & 87 deletions openkb/agent/compiler.py

Large diffs are not rendered by default.

12 changes: 8 additions & 4 deletions openkb/agent/linter.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,16 @@
4. **Redundancy** — Are there multiple pages that cover the same content and
could be merged?
5. **Concept coverage** — Are important themes in the summaries missing concept pages?
6. **Entity coverage** — Are important named things (people, organizations, places,
products, works, events) in the summaries missing entity pages, or are existing
entity pages contradictory, redundant, or orphaned (unlinked from any source)?

## Process
1. Start with index.md to understand scope.
2. Read summary pages to understand document content.
3. Read concept pages to check for contradictions and gaps.
4. Produce a structured Markdown report listing issues found with references
4. Read entity pages to check for contradictions, redundancy, coverage, and orphans.
5. Produce a structured Markdown report listing issues found with references
to the specific pages where each issue occurs.

Be thorough but concise. If the wiki is small or sparse, say so.
Expand Down Expand Up @@ -99,9 +103,9 @@ async def run_knowledge_lint(kb_dir: Path, model: str) -> str:

prompt = (
"Please audit this knowledge base wiki for semantic quality issues: "
"contradictions, gaps, staleness, redundancy, and missing concept pages. "
"Start with index.md, then read summaries and concepts as needed. "
"Produce a structured Markdown report."
"contradictions, gaps, staleness, redundancy, and missing concept and "
"entity pages. Start with index.md, then read summaries, concepts, and "
"entities as needed. Produce a structured Markdown report."
)

result = await Runner.run(agent, prompt, max_turns=MAX_TURNS)
Expand Down
8 changes: 5 additions & 3 deletions openkb/agent/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,17 @@
Summaries may omit details — if you need more, follow the summary's
`full_text` frontmatter field to the source (see step 4).
3. Read concept pages (concepts/) for cross-document synthesis.
4. When you need detailed source document content, each summary page has a
4. For "who/what is X" questions about a specific named person, organization,
place, or product, read the matching page in entities/ first.
5. When you need detailed source document content, each summary page has a
`full_text` frontmatter field with the path to the original document content:
- Short documents (doc_type: short): read_file with that path.
- PageIndex documents (doc_type: pageindex): use get_page_content(doc_name, pages)
with tight page ranges. The summary shows document tree structure with page
ranges to help you target. Never fetch the whole document.
5. Source content may reference images (e.g. ![image](sources/images/doc/file.png)).
6. Source content may reference images (e.g. ![image](sources/images/doc/file.png)).
Use the get_image tool to view them when needed.
6. Synthesize a clear, concise, well-cited answer grounded in wiki content.
7. Synthesize a clear, concise, well-cited answer grounded in wiki content.

Answer based only on wiki content. Be concise.
Before each tool call, output one short sentence explaining the reason.
Expand Down
103 changes: 83 additions & 20 deletions openkb/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ def filter(self, record: logging.LogRecord) -> bool:
from openkb.config import DEFAULT_CONFIG, load_config, save_config, load_global_config, register_kb
from openkb.converter import convert_document
from openkb.log import append_log
from openkb.schema import AGENTS_MD
from openkb.schema import AGENTS_MD, INDEX_SEED, PAGE_CONTENT_DIRS

# Suppress warnings after all imports — markitdown overrides filters at import time
import warnings
Expand Down Expand Up @@ -217,7 +217,7 @@ def _preflight_skill_new(kb_dir: Path, name: str) -> str | None:
Checks (in order):
* skill name is a valid kebab-case slug
* ``<kb>/wiki`` exists
* ``<kb>/wiki/concepts`` or ``<kb>/wiki/summaries`` has at least
* any of ``<kb>/wiki/{summaries,concepts,entities}`` has at least
one file (i.e. some document has been ingested + compiled)

Returns ``None`` if all gates pass, else a single-line error message
Expand All @@ -239,7 +239,7 @@ def _preflight_skill_new(kb_dir: Path, name: str) -> str | None:

has_content = any(
(wiki / sub).is_dir() and any((wiki / sub).iterdir())
for sub in ("concepts", "summaries")
for sub in PAGE_CONTENT_DIRS
)
if not has_content:
return (
Expand Down Expand Up @@ -538,13 +538,11 @@ def init(model, language):
Path("wiki/sources/images").mkdir(parents=True, exist_ok=True)
Path("wiki/summaries").mkdir(parents=True, exist_ok=True)
Path("wiki/concepts").mkdir(parents=True, exist_ok=True)
Path("wiki/entities").mkdir(parents=True, exist_ok=True)

# Write wiki files
Path("wiki/AGENTS.md").write_text(AGENTS_MD, encoding="utf-8")
Path("wiki/index.md").write_text(
"# Knowledge Base Index\n\n## Documents\n\n## Concepts\n\n## Explorations\n",
encoding="utf-8",
)
Path("wiki/index.md").write_text(INDEX_SEED, encoding="utf-8")
Path("wiki/log.md").write_text("# Operations Log\n\n", encoding="utf-8")

# Create .openkb/ state directory
Expand Down Expand Up @@ -800,6 +798,7 @@ def remove(ctx, identifier, keep_raw, keep_empty_concepts, dry_run, yes):
"""
from openkb.agent.compiler import (
remove_doc_from_concept_pages,
remove_doc_from_entity_pages,
remove_doc_from_index,
)
from openkb.lint import fix_broken_links
Expand Down Expand Up @@ -895,6 +894,42 @@ def remove(ctx, identifier, keep_raw, keep_empty_concepts, dry_run, yes):
for slug in concept_edits:
actions.append(("MODIFY", f"wiki/concepts/{slug}.md (drop this doc from sources)"))

# Scan entity pages with the same frontmatter logic as concepts. The
# executor calls ``remove_doc_from_entity_pages``; this only makes the
# preview/summary truthful about what it will delete vs. edit.
affected_entities: list[tuple[str, int]] = [] # (slug, remaining_sources)
entities_dir = wiki_dir / "entities"
if entities_dir.is_dir():
for path in sorted(entities_dir.glob("*.md")):
text = path.read_text(encoding="utf-8")
if not text.startswith("---"):
continue
fm_end = text.find("---", 3)
if fm_end == -1:
continue
sources_count = 0
source_in_frontmatter = False
for line in text[:fm_end].split("\n"):
if line.lstrip().startswith("sources:"):
lb = line.find("[")
rb = line.rfind("]")
if lb != -1 and rb != -1 and rb > lb:
items = [s.strip() for s in line[lb + 1:rb].split(",") if s.strip()]
sources_count = len(items)
source_in_frontmatter = source_file_marker in items
break
if not source_in_frontmatter:
continue
remaining = max(sources_count - 1, 0)
affected_entities.append((path.stem, remaining))

entity_deletes = [s for s, r in affected_entities if r == 0 and not keep_empty_concepts]
entity_edits = [s for s, r in affected_entities if r > 0 or keep_empty_concepts]
for slug in entity_deletes:
actions.append(("DELETE", f"wiki/entities/{slug}.md (only source: this doc)"))
for slug in entity_edits:
actions.append(("MODIFY", f"wiki/entities/{slug}.md (drop this doc from sources)"))

if (wiki_dir / "index.md").exists():
actions.append(("MODIFY", "wiki/index.md (remove Documents entry)"))

Expand Down Expand Up @@ -936,6 +971,12 @@ def remove(ctx, identifier, keep_raw, keep_empty_concepts, dry_run, yes):
f" {len(concept_deletes)} concept(s) will be DELETED because this is their only source."
)
click.echo(" Pass --keep-empty-concepts to retain them instead.")
if entity_deletes:
click.echo("")
click.echo(
f" {len(entity_deletes)} entity(s) will be DELETED because this is their only source."
)
click.echo(" Pass --keep-empty-concepts to retain them instead.")
click.echo("")

if dry_run:
Expand Down Expand Up @@ -967,22 +1008,31 @@ def remove(ctx, identifier, keep_raw, keep_empty_concepts, dry_run, yes):
wiki_dir, doc_name, keep_empty=keep_empty_concepts,
)

remove_doc_from_index(wiki_dir, doc_name, concept_result["deleted"])
entity_result = remove_doc_from_entity_pages(
wiki_dir, doc_name, keep_empty=keep_empty_concepts,
)

remove_doc_from_index(wiki_dir, doc_name, concept_result["deleted"],
entity_slugs_deleted=entity_result["deleted"])

# Strip dangling wikilinks now so a retry (after a PageIndex
# failure below) finds a clean wiki — no point in re-running this
# on every attempt.
#
# Scope: only the pages this remove actually touched (modified
# concept pages ∪ index.md). Previously this swept the whole wiki
# via ``fix_broken_links(wiki_dir)``, which silently stripped
# concept + entity pages ∪ index.md). Previously this swept the whole
# wiki via ``fix_broken_links(wiki_dir)``, which silently stripped
# pre-existing dangling links in unrelated pages — see issue #58
# (Bug 2). Users who want a wiki-wide sweep can still run
# ``openkb lint --fix`` explicitly.
lint_scope: list[Path] = [
wiki_dir / "concepts" / f"{slug}.md"
for slug in concept_result["modified"]
]
lint_scope += [
wiki_dir / "entities" / f"{slug}.md"
for slug in entity_result["modified"]
]
index_md = wiki_dir / "index.md"
if index_md.exists():
lint_scope.append(index_md)
Expand Down Expand Up @@ -1277,6 +1327,15 @@ def print_list(kb_dir: Path) -> None:
for c in concepts:
click.echo(f" - {c}")

# Display entities
entities_dir = kb_dir / "wiki" / "entities"
if entities_dir.exists():
entities = sorted(p.stem for p in entities_dir.glob("*.md"))
if entities:
click.echo(f"\nEntities ({len(entities)}):")
for e in entities:
click.echo(f" - {e}")

# Display reports
reports_dir = kb_dir / "wiki" / "reports"
if reports_dir.exists():
Expand All @@ -1301,7 +1360,7 @@ def list_cmd(ctx):
def print_status(kb_dir: Path) -> None:
"""Print knowledge base status. Usable from CLI and chat REPL."""
wiki_dir = kb_dir / "wiki"
subdirs = ["sources", "summaries", "concepts", "reports"]
subdirs = ["sources", "summaries", "concepts", "entities", "reports"]

# Print the active KB path as the first line. Agents and scripts
# parse this to locate the wiki without assuming cwd == KB root.
Expand Down Expand Up @@ -1332,15 +1391,19 @@ def print_status(kb_dir: Path) -> None:
hashes = json.loads(hashes_file.read_text(encoding="utf-8"))
click.echo(f"\n Total indexed: {len(hashes)} document(s)")

# Last compile time: newest file in wiki/summaries/
summaries_dir = wiki_dir / "summaries"
if summaries_dir.exists():
summaries = list(summaries_dir.glob("*.md"))
if summaries:
newest_summary = max(summaries, key=lambda p: p.stat().st_mtime)
import datetime
mtime = datetime.datetime.fromtimestamp(newest_summary.stat().st_mtime)
click.echo(f" Last compile: {mtime.strftime('%Y-%m-%d %H:%M:%S')}")
# Last compile time: newest compiled page across summaries/, concepts/,
# and entities/ (an entity-only compile must still bump the shown time).
compiled_pages = [
p
for sub in PAGE_CONTENT_DIRS
for p in (wiki_dir / sub).glob("*.md")
if (wiki_dir / sub).exists()
]
if compiled_pages:
newest_page = max(compiled_pages, key=lambda p: p.stat().st_mtime)
import datetime
mtime = datetime.datetime.fromtimestamp(newest_page.stat().st_mtime)
click.echo(f" Last compile: {mtime.strftime('%Y-%m-%d %H:%M:%S')}")

# Last lint time: newest file in wiki/reports/
reports_dir = wiki_dir / "reports"
Expand Down
11 changes: 8 additions & 3 deletions openkb/lint.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@

import yaml

from openkb.schema import PAGE_CONTENT_DIRS

# Matches [[wikilink]] or [[subdir/link]]
_WIKILINK_RE = re.compile(r"\[\[([^\]]+)\]\]")

Expand Down Expand Up @@ -171,6 +173,9 @@ def list_existing_wiki_targets(wiki_dir: Path) -> set[str]:
targets.update(f"concepts/{p.stem}" for p in concepts_dir.glob("*.md"))
if summaries_dir.is_dir():
targets.update(f"summaries/{p.stem}" for p in summaries_dir.glob("*.md"))
entities_dir = wiki_dir / "entities"
if entities_dir.is_dir():
targets.update(f"entities/{p.stem}" for p in entities_dir.glob("*.md"))
if (wiki_dir / "index.md").exists():
targets.add("index")
return targets
Expand Down Expand Up @@ -365,7 +370,7 @@ def check_index_sync(wiki: Path) -> list[str]:

Returns issues for:
- Links in index.md pointing to non-existent pages
- Pages in summaries/ or concepts/ not mentioned in index.md
- Pages in summaries/, concepts/, or entities/ not mentioned in index.md

Args:
wiki: Path to the wiki root directory.
Expand All @@ -389,11 +394,11 @@ def check_index_sync(wiki: Path) -> list[str]:
if lnk_norm not in pages:
issues.append(f"index.md links to missing page: [[{lnk}]]")

# Check that summaries and concepts pages are mentioned in index
# Check that summaries, concepts, and entities pages are mentioned in index
index_stems = {Path(lnk.strip()).stem for lnk in index_links}
index_text_lower = index_text.lower()

for subdir in ("summaries", "concepts"):
for subdir in PAGE_CONTENT_DIRS:
subdir_path = wiki / subdir
if not subdir_path.exists():
continue
Expand Down
13 changes: 12 additions & 1 deletion openkb/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,14 @@

from pathlib import Path

# The compiled page-type subdirectories under wiki/. Shared source of truth
# for surfaces that enumerate page content (list, lint, status, skill gate).
PAGE_CONTENT_DIRS = ("summaries", "concepts", "entities")

# Canonical empty index.md seed. Used by `openkb init` and the compiler's
# lazy-create path so they never drift.
INDEX_SEED = "# Knowledge Base Index\n\n## Documents\n\n## Concepts\n\n## Entities\n\n## Explorations\n"

AGENTS_MD = """\
# Wiki Schema

Expand All @@ -10,6 +18,7 @@
- sources/images/ — Extracted images from documents, referenced by sources.
- summaries/ — One per source document. Summary of key content.
- concepts/ — Cross-document topic synthesis. Created when a theme spans multiple documents.
- entities/ — Specific named things: people, organizations, places, products, named works, events. One page per entity, accumulated across documents.
- explorations/ — Saved query results, analyses, and comparisons worth keeping.
- reports/ — Lint health check reports. Auto-generated.

Expand All @@ -20,13 +29,15 @@
## Page Types
- **Summary Page** (summaries/): Key content of a single source document.
- **Concept Page** (concepts/): Cross-document topic synthesis with [[wikilinks]].
- **Entity Page** (entities/): A specific named thing (proper noun). Frontmatter `type:` is one of: person, organization, place, product, work, event, other. An entity differs from a concept: a concept is an abstract recurring idea; an entity is a specific named thing. Create an entity page only when the entity is central to a document or recurs across sources — do not page passing mentions.
- **Exploration Page** (explorations/): Saved query results — analyses, comparisons, syntheses.
- **Index Page** (index.md): One-liner summary of every page in the wiki. Auto-maintained.

## Index Page Format
index.md lists all documents, concepts, and explorations with metadata:
index.md lists all documents, concepts, entities, and explorations with metadata:
- Documents: name, one-liner description, type (short|pageindex), detail access path
- Concepts: name, one-liner description
- Entities: name, type, one-liner description
- Explorations: name, one-liner description

## Log Format
Expand Down
13 changes: 10 additions & 3 deletions skills/openkb/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,17 @@ description: |

The user has compiled their documents into a Markdown wiki at `wiki/`.

The wiki holds three kinds of pages:
The wiki holds these kinds of pages:

- **Concept pages** at `wiki/concepts/*.md` — cross-document synthesis
on specific topics. This is where OpenKB's value compounds: a
concept with multiple sources represents knowledge merged across
documents the user has ingested.
- **Entity pages** at `wiki/entities/*.md` — one per specific named
thing (people, organizations, places, products, named works,
events), accumulated across documents. Each has a `type:`
frontmatter field. For "who is X" / "what is X" questions about a
named thing, read the matching `entities/` page first.
- **Summary pages** at `wiki/summaries/*.md` — one per ingested
document, linking to the concepts that document touches.
- **Source files** at `wiki/sources/*.{md,json}` — full text for short
Expand Down Expand Up @@ -76,8 +81,9 @@ After capturing the KB path from `openkb status`, drill in via:

- `openkb list` — table of ingested documents (name, type, page count)
plus the concept list.
- Read `<kb>/wiki/index.md` — the compiled table of contents. Every
document and concept has a one-line `brief`. Scan this and pick the
- Read `<kb>/wiki/index.md` — the compiled table of contents. It has
`## Documents`, `## Concepts`, `## Entities`, and `## Explorations`
sections; every entry has a one-line `brief`. Scan this and pick the
slugs that semantically match the user's question.

## Read content
Expand All @@ -90,6 +96,7 @@ calls these `Read` / `Grep` / `Bash`; Gemini CLI uses `read_file` /
| Goal | Action |
|---|---|
| Read a concept page | read the file at `<kb>/wiki/concepts/<slug>.md` |
| Answer "who/what is X" about a named thing | read `<kb>/wiki/entities/<slug>.md` |
| Read a document's summary | read `<kb>/wiki/summaries/<doc>.md` |
| Read a short doc's full text | read `<kb>/wiki/sources/<doc>.md` |
| Read a long doc's specific page | shell: `jq '.[N-1]' <kb>/wiki/sources/<doc>.json` (N = 1-indexed PDF page; `.[0]` is page 1) |
Expand Down
Loading