|
| 1 | +# Readers |
| 2 | + |
| 3 | +The `hyperbase.readers` module provides a way to read and parse text from various sources directly into Semantic Hypergraphs. A reader handles the extraction and segmentation of text into paragraph-sized blocks, which can then be fed to a parser. Hyperbase ships with three built-in readers -- plain text files, URLs and Wikipedia articles -- and provides a registration mechanism for custom readers. |
| 4 | + |
| 5 | +## Reading and parsing sources |
| 6 | + |
| 7 | +The preferred way to read and parse a source is through the `Parser` methods `read_source()` and `read_source_to_jsonl()`. These handle reader selection automatically, so you only need a parser instance: |
| 8 | + |
| 9 | +```python |
| 10 | +from hyperbase import get_parser |
| 11 | + |
| 12 | +parser = get_parser("generative") |
| 13 | + |
| 14 | +# Iterate over parse results block by block |
| 15 | +for results in parser.read_source("article.txt"): |
| 16 | + for result in results: |
| 17 | + print(result["edge"]) |
| 18 | + |
| 19 | +# Or write everything to a JSONL file in one call |
| 20 | +parser.read_source_to_jsonl("article.txt", "output.jsonl", progress=True) |
| 21 | +``` |
| 22 | + |
| 23 | +Both methods accept an optional `reader` argument to force a specific reader instead of auto-detection: |
| 24 | + |
| 25 | +```python |
| 26 | +# Force the generic URL reader on a Wikipedia link |
| 27 | +for results in parser.read_source( |
| 28 | + "https://en.wikipedia.org/wiki/Hypergraph", reader="url" |
| 29 | +): |
| 30 | + ... |
| 31 | +``` |
| 32 | + |
| 33 | +### Extracting raw text (no parsing) |
| 34 | + |
| 35 | +To extract text blocks without parsing, use a reader directly via `get_reader()` (available from `hyperbase.readers`): |
| 36 | + |
| 37 | +```python |
| 38 | +from hyperbase.readers import get_reader |
| 39 | + |
| 40 | +reader = get_reader("article.txt") |
| 41 | + |
| 42 | +# Iterate over text blocks |
| 43 | +for block in reader.read("article.txt"): |
| 44 | + print(block) |
| 45 | + |
| 46 | +# Or write blocks to a plain text file |
| 47 | +reader.read_to_text("article.txt", "output.txt", progress=True) |
| 48 | +``` |
| 49 | + |
| 50 | +When a named reader is given, the source is not required to obtain the reader instance: |
| 51 | + |
| 52 | +```python |
| 53 | +reader = get_reader(reader="wikipedia") |
| 54 | +``` |
| 55 | + |
| 56 | +Either `source` or a named `reader` must be provided -- calling `get_reader()` with neither raises a `ValueError`. |
| 57 | + |
| 58 | +## CLI |
| 59 | + |
| 60 | +The `hyperbase read` command provides a convenient way to read and parse sources from the command line: |
| 61 | + |
| 62 | +```bash |
| 63 | +# Parse a local file to JSONL |
| 64 | +hyperbase read article.txt -o output.jsonl |
| 65 | + |
| 66 | +# Extract raw text blocks (no parsing) |
| 67 | +hyperbase read article.txt -o output.txt |
| 68 | + |
| 69 | +# Parse a Wikipedia article |
| 70 | +hyperbase read https://en.wikipedia.org/wiki/Hypergraph -o output.jsonl |
| 71 | + |
| 72 | +# Specify reader and parser explicitly |
| 73 | +hyperbase read source.txt -o output.jsonl --reader plain_text --parser alphabeta --language en |
| 74 | +``` |
| 75 | + |
| 76 | +## Built-in readers |
| 77 | + |
| 78 | +### `plain_text` |
| 79 | + |
| 80 | +Reads local text files. Accepts any source that is a valid file path. The text is split into paragraph-sized blocks: if blank lines are found, they are used as paragraph separators; otherwise each line becomes its own block. |
| 81 | + |
| 82 | +### `url` |
| 83 | + |
| 84 | +Reads web pages via HTTP/HTTPS. Uses [trafilatura](https://trafilatura.readthedocs.io/) to extract the main text content from the HTML, stripping navigation, ads and other boilerplate. |
| 85 | + |
| 86 | +### `wikipedia` |
| 87 | + |
| 88 | +Reads Wikipedia articles directly from the MediaWiki API. Accepts any URL matching `*.wikipedia.org/wiki/*`. It parses the wikicode markup to extract clean text, organized by section. Boilerplate sections (e.g. "References", "See also", "External links") are automatically discarded based on the article language. The Wikipedia reader declares `url` as `more_general`, so it takes priority over the URL reader when the source is a Wikipedia link. |
| 89 | + |
| 90 | +## Auto-detection |
| 91 | + |
| 92 | +When a reader is not explicitly specified, all registered readers are checked and those whose `accepts()` method returns `True` for the given source are collected. If more than one reader matches, the `more_general` mechanism is used to pick the most specific one. For example, a Wikipedia URL is accepted by both the `url` and `wikipedia` readers, but because `WikipediaReader` declares `more_general = ['url']`, the Wikipedia reader is selected. |
| 93 | + |
| 94 | +## Custom readers |
| 95 | + |
| 96 | +You can create and register your own readers. A custom reader must subclass `Reader` and implement two methods: |
| 97 | + |
| 98 | +- `accepts(source)` -- a static method that returns `True` if the reader can handle the given source string. |
| 99 | +- `read(source)` -- a generator that yields text blocks from the source. |
| 100 | + |
| 101 | +Optionally, you can implement `block_count(source)` to return the total number of blocks (enabling progress bars), and set the `more_general` class attribute to declare that this reader is more specific than others. |
| 102 | + |
| 103 | +Here is an example: |
| 104 | + |
| 105 | +```python |
| 106 | +from hyperbase.readers import Reader, register_reader |
| 107 | + |
| 108 | +class RSSReader(Reader): |
| 109 | + more_general = ['url'] # take priority over the generic URL reader |
| 110 | + |
| 111 | + @staticmethod |
| 112 | + def accepts(source: str) -> bool: |
| 113 | + return source.endswith('.rss') or source.endswith('/feed') |
| 114 | + |
| 115 | + def read(self, source: str): |
| 116 | + import feedparser |
| 117 | + feed = feedparser.parse(source) |
| 118 | + for entry in feed.entries: |
| 119 | + # yield the text content of each entry as a block |
| 120 | + yield entry.get('summary', '') |
| 121 | + |
| 122 | +register_reader('rss', RSSReader) |
| 123 | +``` |
| 124 | + |
| 125 | +After registration, the new reader is automatically considered during auto-detection. It can also be requested by name: |
| 126 | + |
| 127 | +```python |
| 128 | +parser.read_source_to_jsonl("https://example.com/feed", "feed.jsonl", reader="rss") |
| 129 | +``` |
| 130 | + |
| 131 | +## Listing registered readers |
| 132 | + |
| 133 | +To see all currently registered readers: |
| 134 | + |
| 135 | +```python |
| 136 | +from hyperbase.readers import list_readers |
| 137 | + |
| 138 | +for name, cls in list_readers().items(): |
| 139 | + print(f"{name}: {cls.__name__}") |
| 140 | +``` |
0 commit comments