Skip to content

Commit 2cc7de3

Browse files
committed
readers (txt, url, wikipedia)
1 parent c5eae5f commit 2cc7de3

14 files changed

Lines changed: 2959 additions & 3 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,8 @@
33
## [0.8.1] - work in progress
44

55
### Added
6-
- cli interface.
7-
- repl cli entry point.
8-
- parsers cli entry point.
6+
- readers (txt, url, wikipedia).
7+
- cli interface with repl, parsers, readers.
98
- hyperedge.Hyperedge.match function (calls parsers.match_pattern).
109

1110
### Changed

docs/manual/readers.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# Readers
2+
3+
The `hyperbase.readers` module provides a way to read and parse text from various sources directly into Semantic Hypergraphs. A reader handles the extraction and segmentation of text into paragraph-sized blocks, which can then be fed to a parser. Hyperbase ships with three built-in readers -- plain text files, URLs and Wikipedia articles -- and provides a registration mechanism for custom readers.
4+
5+
## Reading and parsing sources
6+
7+
The preferred way to read and parse a source is through the `Parser` methods `read_source()` and `read_source_to_jsonl()`. These handle reader selection automatically, so you only need a parser instance:
8+
9+
```python
10+
from hyperbase import get_parser
11+
12+
parser = get_parser("generative")
13+
14+
# Iterate over parse results block by block
15+
for results in parser.read_source("article.txt"):
16+
for result in results:
17+
print(result["edge"])
18+
19+
# Or write everything to a JSONL file in one call
20+
parser.read_source_to_jsonl("article.txt", "output.jsonl", progress=True)
21+
```
22+
23+
Both methods accept an optional `reader` argument to force a specific reader instead of auto-detection:
24+
25+
```python
26+
# Force the generic URL reader on a Wikipedia link
27+
for results in parser.read_source(
28+
"https://en.wikipedia.org/wiki/Hypergraph", reader="url"
29+
):
30+
...
31+
```
32+
33+
### Extracting raw text (no parsing)
34+
35+
To extract text blocks without parsing, use a reader directly via `get_reader()` (available from `hyperbase.readers`):
36+
37+
```python
38+
from hyperbase.readers import get_reader
39+
40+
reader = get_reader("article.txt")
41+
42+
# Iterate over text blocks
43+
for block in reader.read("article.txt"):
44+
print(block)
45+
46+
# Or write blocks to a plain text file
47+
reader.read_to_text("article.txt", "output.txt", progress=True)
48+
```
49+
50+
When a named reader is given, the source is not required to obtain the reader instance:
51+
52+
```python
53+
reader = get_reader(reader="wikipedia")
54+
```
55+
56+
Either `source` or a named `reader` must be provided -- calling `get_reader()` with neither raises a `ValueError`.
57+
58+
## CLI
59+
60+
The `hyperbase read` command provides a convenient way to read and parse sources from the command line:
61+
62+
```bash
63+
# Parse a local file to JSONL
64+
hyperbase read article.txt -o output.jsonl
65+
66+
# Extract raw text blocks (no parsing)
67+
hyperbase read article.txt -o output.txt
68+
69+
# Parse a Wikipedia article
70+
hyperbase read https://en.wikipedia.org/wiki/Hypergraph -o output.jsonl
71+
72+
# Specify reader and parser explicitly
73+
hyperbase read source.txt -o output.jsonl --reader plain_text --parser alphabeta --language en
74+
```
75+
76+
## Built-in readers
77+
78+
### `plain_text`
79+
80+
Reads local text files. Accepts any source that is a valid file path. The text is split into paragraph-sized blocks: if blank lines are found, they are used as paragraph separators; otherwise each line becomes its own block.
81+
82+
### `url`
83+
84+
Reads web pages via HTTP/HTTPS. Uses [trafilatura](https://trafilatura.readthedocs.io/) to extract the main text content from the HTML, stripping navigation, ads and other boilerplate.
85+
86+
### `wikipedia`
87+
88+
Reads Wikipedia articles directly from the MediaWiki API. Accepts any URL matching `*.wikipedia.org/wiki/*`. It parses the wikicode markup to extract clean text, organized by section. Boilerplate sections (e.g. "References", "See also", "External links") are automatically discarded based on the article language. The Wikipedia reader declares `url` as `more_general`, so it takes priority over the URL reader when the source is a Wikipedia link.
89+
90+
## Auto-detection
91+
92+
When a reader is not explicitly specified, all registered readers are checked and those whose `accepts()` method returns `True` for the given source are collected. If more than one reader matches, the `more_general` mechanism is used to pick the most specific one. For example, a Wikipedia URL is accepted by both the `url` and `wikipedia` readers, but because `WikipediaReader` declares `more_general = ['url']`, the Wikipedia reader is selected.
93+
94+
## Custom readers
95+
96+
You can create and register your own readers. A custom reader must subclass `Reader` and implement two methods:
97+
98+
- `accepts(source)` -- a static method that returns `True` if the reader can handle the given source string.
99+
- `read(source)` -- a generator that yields text blocks from the source.
100+
101+
Optionally, you can implement `block_count(source)` to return the total number of blocks (enabling progress bars), and set the `more_general` class attribute to declare that this reader is more specific than others.
102+
103+
Here is an example:
104+
105+
```python
106+
from hyperbase.readers import Reader, register_reader
107+
108+
class RSSReader(Reader):
109+
more_general = ['url'] # take priority over the generic URL reader
110+
111+
@staticmethod
112+
def accepts(source: str) -> bool:
113+
return source.endswith('.rss') or source.endswith('/feed')
114+
115+
def read(self, source: str):
116+
import feedparser
117+
feed = feedparser.parse(source)
118+
for entry in feed.entries:
119+
# yield the text content of each entry as a block
120+
yield entry.get('summary', '')
121+
122+
register_reader('rss', RSSReader)
123+
```
124+
125+
After registration, the new reader is automatically considered during auto-detection. It can also be requested by name:
126+
127+
```python
128+
parser.read_source_to_jsonl("https://example.com/feed", "feed.jsonl", reader="rss")
129+
```
130+
131+
## Listing registered readers
132+
133+
To see all currently registered readers:
134+
135+
```python
136+
from hyperbase.readers import list_readers
137+
138+
for name, cls in list_readers().items():
139+
print(f"{name}: {cls.__name__}")
140+
```

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ nav:
6767
- Hyperedge Operations: manual/hyperedge-operations.md
6868
- Patterns: manual/patterns.md
6969
- Parsing: manual/parsing.md
70+
- Readers: manual/readers.md
7071
- API Reference: manual/api.md
7172
- Authors: authors.md
7273

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ dependencies = [
3030
"tqdm>=4.65.0",
3131
"prompt-toolkit>=3.0.0",
3232
"rich>=13.0.0",
33+
"mwparserfromhell>=0.7.2",
34+
"trafilatura>=2.0.0",
3335
]
3436

3537
[project.scripts]

src/hyperbase/cli/__init__.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,59 @@ def main():
1515
help="List installed parser plugins",
1616
)
1717

18+
# --- read subcommand ---------------------------------------------------
19+
read_parser = subparsers.add_parser(
20+
"read",
21+
help="Read a source, parse it, and write JSONL output",
22+
)
23+
read_parser.add_argument(
24+
"source",
25+
type=str,
26+
help="Source to read (file path or URL)",
27+
)
28+
read_parser.add_argument(
29+
"-o", "--output",
30+
type=str,
31+
required=True,
32+
help="Output file path (.jsonl for parsed output, .txt for raw text)",
33+
)
34+
read_parser.add_argument(
35+
"--parser",
36+
type=str,
37+
default="generative",
38+
help="Parser plugin name (default: generative)",
39+
)
40+
read_parser.add_argument(
41+
"--reader",
42+
type=str,
43+
default="auto",
44+
help="Reader name or 'auto' (default: auto)",
45+
)
46+
read_parser.add_argument(
47+
"--model_path",
48+
type=str,
49+
default=None,
50+
help="Path to trained model (generative parser)",
51+
)
52+
read_parser.add_argument(
53+
"--language",
54+
type=str,
55+
default=None,
56+
help="Language for alphabeta parser",
57+
)
58+
read_parser.add_argument(
59+
"--device",
60+
type=str,
61+
default=None,
62+
help="Device to use (cuda/cpu/mps)",
63+
)
64+
read_parser.add_argument(
65+
"--batch_size",
66+
type=int,
67+
default=8,
68+
help="Batch size for parsing (default: 8)",
69+
)
70+
1871
# --- repl subcommand ---------------------------------------------------
1972
repl_parser = subparsers.add_parser(
2073
"repl",
@@ -94,6 +147,11 @@ def main():
94147
run_parsers()
95148
sys.exit(0)
96149

150+
if args.command == "read":
151+
from hyperbase.cli.read import run_read
152+
run_read(args)
153+
sys.exit(0)
154+
97155
if args.command == "repl":
98156
from hyperbase.cli.repl import run_repl
99157
run_repl(args)

src/hyperbase/cli/read.py

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
import argparse
2+
import json
3+
import os
4+
import sys
5+
6+
from hyperbase.parsers import get_parser
7+
from hyperbase.readers import get_reader
8+
9+
10+
def run_read(args: argparse.Namespace):
11+
ext = os.path.splitext(args.output)[1].lower()
12+
13+
if ext == '.txt':
14+
try:
15+
reader = get_reader(args.source, reader=args.reader)
16+
except ValueError as e:
17+
print(f"Error: {e}", file=sys.stderr)
18+
sys.exit(1)
19+
print(f"Reader: {type(reader).__name__}", file=sys.stderr)
20+
reader.read_to_text(args.source, args.output, progress=True)
21+
print(f"\nOutput: {args.output}", file=sys.stderr)
22+
return
23+
24+
if ext != '.jsonl':
25+
print(f"Error: unsupported output extension {ext!r}"
26+
" (use .jsonl or .txt)", file=sys.stderr)
27+
sys.exit(1)
28+
29+
# Build parser kwargs
30+
kwargs = {}
31+
if args.parser == 'generative':
32+
if args.model_path:
33+
kwargs['model_path'] = args.model_path
34+
if args.device:
35+
kwargs['device'] = args.device
36+
elif args.parser == 'alphabeta':
37+
if args.language:
38+
kwargs['lang'] = args.language
39+
40+
try:
41+
parser = get_parser(args.parser, **kwargs)
42+
except ValueError as e:
43+
print(f"Error: {e}", file=sys.stderr)
44+
sys.exit(1)
45+
46+
print(f"Parser: {args.parser}", file=sys.stderr)
47+
48+
sentences = 0
49+
edges = 0
50+
errors = 0
51+
52+
with open(args.output, 'w') as f:
53+
for results in parser.read_source(
54+
args.source, reader=args.reader,
55+
batch_size=args.batch_size, progress=True,
56+
):
57+
for result in results:
58+
sentences += 1
59+
if result.get('failed'):
60+
errors += 1
61+
elif result.get('edge') is not None:
62+
edges += 1
63+
result['edge'] = str(result['edge'])
64+
f.write(json.dumps(result, ensure_ascii=False,
65+
default=str) + '\n')
66+
67+
print(f"\nSentences: {sentences}", file=sys.stderr)
68+
print(f"Edges: {edges}", file=sys.stderr)
69+
print(f"Errors: {errors}", file=sys.stderr)
70+
print(f"Output: {args.output}", file=sys.stderr)

src/hyperbase/data/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)