Skip to content

Commit 829f44f

Browse files
committed
feat(cli): support 'openkb add <URL>' with content-type sniffing
`openkb add` now accepts http(s) URLs in addition to local paths. URLs are fetched into raw/ first, then handed to the existing add_single_file pipeline — so PageIndex / markitdown / hash dedup / remove all work unchanged. Routing is decided by the HTTP Content-Type header, validated against magic bytes (so a CDN that mislabels a PDF as 'application/octet-stream' still routes correctly): PDF response → urllib chunked download → raw/<sanitized>.pdf → existing convert_document → PageIndex (≥20 pages) or markitdown (short) HTML response → trafilatura.fetch_url + extract main content → raw/<title-slug>.md → existing markitdown short-doc path Anything else → clear error, returns None Filename derivation: Content-Disposition header (RFC 5987 / quoted / unquoted forms) → URL basename → URL host fallback. Sanitization preserves identifier dots like arxiv's "2509.11420" (only strips an extension when it matches the target extension), replaces shell- unsafe chars (space / parens / etc.) with '-', collapses repeats, and caps the stem at 80 chars. Failure modes covered with stderr messages, no crashes: - HTTPError (404, 403, etc.) → "[ERROR] HTTP <code>" - URLError (DNS, refused) → "[ERROR] Network error" - trafilatura returns None → "[ERROR] No main content extracted" - <300 chars extracted → "[WARN] only N chars — page may be JS-rendered" (still saved so user can inspect) - Unsupported content type → "[ERROR] Unsupported content type" Architecturally lives in openkb/url_ingest.py — peer to openkb/converter.py and openkb/indexer.py rather than bloating cli.py. cli.py only adds a 4-line intercept at the top of `add`. Live-tested against four URL shapes: - arxiv.org/abs/<id> → trafilatura → "Trading-R1-...md" (3 KB) - arxiv.org/pdf/<id> → urllib → "2509.11420v1.pdf" (2.1 MB) (server's Content-Disposition gives the version suffix for free — no special arxiv handling needed) - claude.com/blog/... → trafilatura → ".md" (1.5 KB) - CDN .pdf with %20 → urllib → sanitized filename (465 KB) 31 new tests cover all routing branches, magic-byte override paths, Content-Disposition parsing (3 forms), filename sanitization edge cases (arxiv ID, spaces, long names, empty), and graceful failure on HTTP/network errors. 360 total tests pass (329 prior + 31 new).
1 parent 50d5b1a commit 829f44f

5 files changed

Lines changed: 631 additions & 3 deletions

File tree

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,8 @@ openkb init
7272

7373
# 3. Add documents
7474
openkb add paper.pdf
75-
openkb add ~/papers/ # Add a whole directory
75+
openkb add ~/papers/ # Add a whole directory
76+
openkb add https://arxiv.org/pdf/2509.11420 # Or fetch from a URL
7677

7778
# 4. Ask a question
7879
openkb query "What are the main findings?"
@@ -148,7 +149,7 @@ A single source might touch 10-15 wiki pages. Knowledge accumulates: each docume
148149
| Command | Description |
149150
|---|---|
150151
| `openkb init` | Initialize a new knowledge base (interactive) |
151-
| <code>openkb&nbsp;add&nbsp;&lt;file_or_dir&gt;</code> | Add documents and compile to wiki |
152+
| <code>openkb&nbsp;add&nbsp;&lt;file_or_dir_or_URL&gt;</code> | Add documents and compile to wiki. URL ingest auto-detects PDF (saved as `.pdf` → PageIndex / markitdown) vs HTML (trafilatura main-content extract → `.md`) |
152153
| <code>openkb&nbsp;remove&nbsp;&lt;doc&gt;</code> | Remove a document and clean up its wiki pages, images, registry, and PageIndex state (use `--dry-run` to preview, `--keep-raw` / `--keep-empty-concepts` to retain artifacts) |
153154
| <code>openkb&nbsp;query&nbsp;"question"</code> | Ask a question over the knowledge base (use `--save` to save the answer to `wiki/explorations/`) |
154155
| `openkb chat` | Start an interactive multi-turn chat (use `--resume`, `--list`, `--delete` to manage sessions) |

openkb/cli.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -395,12 +395,29 @@ def init(language):
395395
@click.argument("path")
396396
@click.pass_context
397397
def add(ctx, path):
398-
"""Add a document or directory of documents at PATH to the knowledge base."""
398+
"""Add a document or directory of documents at PATH to the knowledge base.
399+
400+
PATH may be a local file, a local directory (which is walked
401+
recursively for supported extensions), or an http(s) URL. URLs are
402+
fetched into ``raw/`` first: PDF responses (by Content-Type and
403+
magic-byte sniff) are saved as ``.pdf``; HTML responses are run
404+
through trafilatura's main-content extractor and saved as ``.md``.
405+
"""
399406
kb_dir = _find_kb_dir(ctx.obj.get("kb_dir_override"))
400407
if kb_dir is None:
401408
click.echo("No knowledge base found. Run `openkb init` first.")
402409
return
403410

411+
# URL ingest: download into raw/ first, then fall through to the
412+
# normal file-path branch below. fetch_url_to_raw returns the
413+
# local path it wrote, or None on failure.
414+
from openkb.url_ingest import looks_like_url, fetch_url_to_raw
415+
if looks_like_url(path):
416+
fetched = fetch_url_to_raw(path, kb_dir)
417+
if fetched is None:
418+
return
419+
path = str(fetched)
420+
404421
target = Path(path)
405422
if not target.exists():
406423
click.echo(f"Path does not exist: {path}")

openkb/url_ingest.py

Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
"""Fetch remote URLs into the knowledge base's ``raw/`` directory.
2+
3+
This module is the input-acquisition layer for ``openkb add <URL>``: it
4+
takes an http(s) URL, decides whether it points at a PDF or an HTML
5+
document (using both the HTTP ``Content-Type`` header and a magic-byte
6+
sniff so a mistyped header doesn't fool us), saves a local file under
7+
``raw/``, and hands the path back to the caller.
8+
9+
The caller (``openkb.cli.add``) then dispatches the saved path to the
10+
normal local-file ingest pipeline (``add_single_file`` →
11+
``convert_document`` → markitdown / PageIndex), so all the existing
12+
short-vs-long-doc routing applies automatically based on the file
13+
extension and page count.
14+
15+
PDF responses are streamed to disk in chunks (large papers can be tens
16+
of MB). HTML responses are passed through trafilatura's main-content
17+
extractor — saving the raw HTML directly would feed nav/footer/cookie
18+
chrome into the LLM and produce noisy summaries.
19+
"""
20+
from __future__ import annotations
21+
22+
import re
23+
import urllib.error
24+
import urllib.request
25+
from pathlib import Path
26+
from urllib.parse import unquote, urlparse
27+
28+
import click
29+
30+
_USER_AGENT = "openkb/url-fetcher (+https://github.com/VectifyAI/OpenKB)"
31+
_TIMEOUT_SECONDS = 30
32+
_CHUNK_BYTES = 64 * 1024
33+
_SNIFF_BYTES = 512
34+
_HTML_MIN_EXTRACT_CHARS = 300
35+
_MAX_FILENAME_STEM = 80
36+
37+
38+
def looks_like_url(s: str) -> bool:
39+
"""Cheap check used by ``openkb add`` to branch into URL ingest."""
40+
return s.startswith(("http://", "https://"))
41+
42+
43+
def _sniff_content_type(head: bytes, declared: str) -> str:
44+
"""Return ``'pdf'``, ``'html'``, or ``'unknown'`` based on magic bytes,
45+
falling back to the server's declared Content-Type.
46+
47+
Magic bytes win when they disagree with the header (some servers
48+
mislabel PDFs as ``application/octet-stream`` or send HTML
49+
interstitial pages with ``application/pdf``).
50+
"""
51+
if head.startswith(b"%PDF-"):
52+
return "pdf"
53+
stripped = head.lstrip(b" \t\r\n\xef\xbb\xbf") # strip BOM + leading whitespace
54+
if stripped[:1] == b"<":
55+
return "html"
56+
declared_main = declared.split(";")[0].strip().lower()
57+
if declared_main == "application/pdf":
58+
return "pdf"
59+
if declared_main.startswith("text/html") or declared_main == "application/xhtml+xml":
60+
return "html"
61+
return "unknown"
62+
63+
64+
def _sanitize_filename(name: str, ext: str) -> str:
65+
"""Make a filename safe for shell + filesystem use.
66+
67+
- URL-decodes percent escapes.
68+
- Strips the existing extension **only when it matches the target
69+
``ext``** — so ``"2509.11420"`` keeps its dot when we're saving a
70+
``.pdf`` (the dot is part of the arxiv identifier, not an
71+
extension).
72+
- Replaces whitespace / parentheses / other non-``[a-zA-Z0-9._-]``
73+
chars with ``-``, collapses repeated ``-``, and trims.
74+
- Caps the stem at 80 chars to avoid filesystem limits.
75+
- Returns ``<stem><ext>``, falling back to ``document<ext>`` if the
76+
sanitized stem is empty.
77+
"""
78+
decoded = unquote(name)
79+
if ext and decoded.lower().endswith(ext.lower()):
80+
stem = decoded[: -len(ext)]
81+
else:
82+
stem = decoded
83+
stem = re.sub(r"[^a-zA-Z0-9._-]+", "-", stem)
84+
stem = re.sub(r"-+", "-", stem).strip("-._")
85+
stem = stem[:_MAX_FILENAME_STEM].rstrip("-._")
86+
return f"{stem}{ext}" if stem else f"document{ext}"
87+
88+
89+
def _parse_content_disposition_filename(header: str | None) -> str | None:
90+
"""Extract a filename hint from a ``Content-Disposition`` header.
91+
92+
Handles three forms (in priority order):
93+
94+
1. ``filename*=UTF-8''percent-encoded`` (RFC 5987)
95+
2. ``filename="quoted with spaces.pdf"``
96+
3. ``filename=unquoted-no-spaces.pdf``
97+
"""
98+
if not header:
99+
return None
100+
# RFC 5987 extended form first
101+
m = re.search(r"filename\*=(?:[\w-]+'[\w-]*')?([^;]+)", header)
102+
if m:
103+
return unquote(m.group(1).strip())
104+
# Quoted form (may contain spaces / parens / commas)
105+
m = re.search(r'filename="([^"]+)"', header)
106+
if m:
107+
return m.group(1)
108+
# Unquoted form (stops at whitespace or semicolon)
109+
m = re.search(r"filename=([^\s;]+)", header)
110+
if m:
111+
return m.group(1)
112+
return None
113+
114+
115+
def _pdf_filename(url: str, content_disposition: str | None) -> str:
116+
"""Derive a ``.pdf`` filename for a downloaded PDF.
117+
118+
Priority: ``Content-Disposition: filename=`` header → URL basename →
119+
URL host. The result is run through :func:`_sanitize_filename`.
120+
"""
121+
cd_name = _parse_content_disposition_filename(content_disposition)
122+
if cd_name:
123+
return _sanitize_filename(cd_name, ".pdf")
124+
parsed = urlparse(url)
125+
last = (parsed.path.rsplit("/", 1)[-1] or parsed.netloc).strip()
126+
return _sanitize_filename(last or "document", ".pdf")
127+
128+
129+
def _download_pdf_chunked(response, head_bytes: bytes, target: Path) -> None:
130+
"""Write the already-read ``head_bytes`` plus the remaining streamed
131+
body to ``target``. Chunked so very large PDFs (50+ MB) don't sit in
132+
RAM.
133+
"""
134+
with open(target, "wb") as fh:
135+
if head_bytes:
136+
fh.write(head_bytes)
137+
while True:
138+
chunk = response.read(_CHUNK_BYTES)
139+
if not chunk:
140+
break
141+
fh.write(chunk)
142+
143+
144+
def _extract_html(url: str, raw_dir: Path) -> Path | None:
145+
"""Fetch the URL through trafilatura, extract the main content as
146+
Markdown, and write it to ``raw/<title-slug>.md``.
147+
148+
Returns the saved path, or None if extraction failed entirely. A
149+
short extraction (< 300 chars) is saved anyway but flagged on
150+
stderr — pages that are JS-rendered or paywalled often produce
151+
near-empty extractions.
152+
"""
153+
import trafilatura
154+
155+
raw_html = trafilatura.fetch_url(url)
156+
if not raw_html:
157+
click.echo(f" [ERROR] Could not fetch URL: {url}", err=True)
158+
return None
159+
160+
markdown = trafilatura.extract(
161+
raw_html, output_format="markdown", include_links=True,
162+
)
163+
if not markdown:
164+
click.echo(
165+
" [ERROR] No main content extracted — page may be empty, "
166+
"JS-rendered, or paywalled.",
167+
err=True,
168+
)
169+
return None
170+
171+
if len(markdown) < _HTML_MIN_EXTRACT_CHARS:
172+
click.echo(
173+
f" [WARN] Only {len(markdown)} chars extracted — page may be "
174+
f"JS-rendered or behind a login. Saving anyway; inspect the "
175+
f"resulting wiki entry and use `openkb remove` if it's empty.",
176+
err=True,
177+
)
178+
179+
metadata = trafilatura.extract_metadata(raw_html)
180+
title = (metadata.title if metadata else None) or url
181+
filename = _sanitize_filename(title, ".md")
182+
target = raw_dir / filename
183+
target.write_text(markdown, encoding="utf-8")
184+
click.echo(
185+
f" Extracted: {title!r}\n"
186+
f" Saved: raw/{filename} ({len(markdown) // 1024 or 1} KB clean markdown)"
187+
)
188+
return target
189+
190+
191+
def fetch_url_to_raw(url: str, kb_dir: Path) -> Path | None:
192+
"""Fetch ``url`` into ``<kb>/raw/`` and return the local path.
193+
194+
Routing is decided by HTTP ``Content-Type`` validated against magic
195+
bytes (in case the server lies):
196+
197+
- PDF → urllib chunked download → ``raw/<sanitized>.pdf``
198+
- HTML → trafilatura main-content extract → ``raw/<title-slug>.md``
199+
- anything else → error, returns None
200+
201+
The caller then hands the saved path to ``add_single_file``, so the
202+
existing PageIndex / markitdown routing by file extension and page
203+
count takes over from there.
204+
"""
205+
raw_dir = kb_dir / "raw"
206+
raw_dir.mkdir(parents=True, exist_ok=True)
207+
208+
click.echo(f"Downloading: {url}")
209+
210+
request = urllib.request.Request(
211+
url, headers={"User-Agent": _USER_AGENT, "Accept": "*/*"},
212+
)
213+
try:
214+
response = urllib.request.urlopen(request, timeout=_TIMEOUT_SECONDS)
215+
except urllib.error.HTTPError as exc:
216+
click.echo(f" [ERROR] HTTP {exc.code} {exc.reason}", err=True)
217+
return None
218+
except urllib.error.URLError as exc:
219+
click.echo(f" [ERROR] Network error: {exc.reason}", err=True)
220+
return None
221+
except Exception as exc:
222+
click.echo(f" [ERROR] Fetch failed: {exc}", err=True)
223+
return None
224+
225+
with response:
226+
declared = response.headers.get("Content-Type", "") or ""
227+
head_bytes = response.read(_SNIFF_BYTES)
228+
actual = _sniff_content_type(head_bytes, declared)
229+
230+
if actual == "pdf":
231+
filename = _pdf_filename(
232+
url, response.headers.get("Content-Disposition"),
233+
)
234+
target = raw_dir / filename
235+
_download_pdf_chunked(response, head_bytes, target)
236+
size_mb = target.stat().st_size / (1024 * 1024)
237+
click.echo(f" Saved: raw/{filename} ({size_mb:.1f} MB PDF)")
238+
return target
239+
240+
if actual == "html":
241+
return _extract_html(url, raw_dir)
242+
243+
click.echo(
244+
f" [ERROR] Unsupported content type {declared!r} for URL ingest. "
245+
"Download the file manually and pass its path to `openkb add` instead.",
246+
err=True,
247+
)
248+
return None

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ keywords = ["ai", "rag", "retrieval", "knowledge-base", "llm", "pageindex", "age
2828
dependencies = [
2929
"pageindex==0.3.0.dev1",
3030
"markitdown[all]",
31+
"trafilatura>=2.0",
3132
"click>=8.0",
3233
"watchdog>=3.0",
3334
"litellm",

0 commit comments

Comments
 (0)