Skip to content

Latest commit

 

History

History
816 lines (612 loc) · 24.1 KB

File metadata and controls

816 lines (612 loc) · 24.1 KB

libfyaml Python Binding — API Reference

The libfyaml Python binding exposes the high-performance libfyaml C library directly. Parsed documents are represented as FyGeneric objects — lazy wrappers that defer conversion to Python types until you ask for them. This keeps memory low and lets you navigate large documents without materialising every node.


Table of Contents

  1. Quick Start
  2. Parsing
  3. The FyGeneric Type
  4. Serialisation
  5. Converting Python objects
  6. Path navigation
  7. Mutability
  8. FyDocumentState
  9. Memory management
  10. Error handling
  11. Comparison with PyYAML

Quick Start

import libfyaml as fy

# Parse a YAML string
doc = fy.loads("name: Alice\nage: 30")
print(doc["name"])   # FyGeneric wrapping "Alice"
print(str(doc["name"]))  # "Alice"
print(doc.to_python())   # {'name': 'Alice', 'age': 30}

# Parse a file
doc = fy.load("config.yaml")

# Serialise back to YAML
print(fy.dumps(doc))

# Parse JSON
data = fy.loads('{"x": 1}', mode='json')

Parsing

loads(s, **options) → FyGeneric

Parse a YAML or JSON string. Raises ValueError if the input contains more than one document — use loads_all for multi-document streams.

doc = fy.loads("key: value")
docs = fy.loads_all("---\na: 1\n---\nb: 2")  # list of FyGeneric

load(file, **options) → FyGeneric

Parse from a file path (string — uses mmap internally) or any file-like object with a .read() method.

doc = fy.load("data.yaml")

with open("data.yaml") as f:
    doc = fy.load(f)

loads_all(s, **options) → list[FyGeneric]

load_all(file, **options) → list[FyGeneric]

Return all documents in a multi-document stream as a list.

docs = fy.loads_all("---\n1\n---\n2\n---\n3")
# [FyGeneric(1), FyGeneric(2), FyGeneric(3)]

Parse modes

The mode parameter controls which YAML dialect is accepted:

Mode string Meaning
'yaml', 'yaml1.2', '1.2' YAML 1.2 — the default
'yaml1.1', '1.1' YAML 1.1 (accepts merge keys <<, sexagesimal numbers, etc.)
'yaml1.1-pyyaml', 'pyyaml' YAML 1.1 with PyYAML-compatible quirks (used by the compat layer)
'json' Strict JSON
# Merge keys only work in YAML 1.1
doc = fy.loads("""
defaults: &defaults
  timeout: 30

server:
  <<: *defaults
  host: localhost
""", mode='yaml1.1')

Parser options

All four parse functions accept the same keyword options:

Option Default Description
mode 'yaml' Dialect — see above
dedup True Use the deduplication allocator (saves memory for documents with repeated content)
trim True Release unused allocator memory after parsing
mutable False Produce mutable FyGeneric objects (required for __setitem__ and set_at_path)
collect_diag False Attach parse diagnostics to the result instead of raising
create_markers False Record byte/line/column positions for every node
keep_comments False Preserve YAML comments in the document
keep_style False Preserve original scalar styles (literal, folded, quoted, …)

The FyGeneric Type

FyGeneric is the type returned by all parse functions. It wraps a C fy_generic value without copying data. Conversion to Python only happens when you explicitly ask for it.

doc = fy.loads("x: 42")
type(doc)          # <class 'libfyaml._libfyaml.FyGeneric'>
doc.__class__      # <class 'dict'>  — the Python equivalent class

Type checking

Eight predicate methods, all return bool:

v = fy.loads("42")
v.is_null()       # False
v.is_bool()       # False
v.is_int()        # True
v.is_float()      # False
v.is_string()     # False
v.is_sequence()   # False
v.is_mapping()    # False
v.is_indirect()   # True if the value carries a tag or anchor

Converting to Python

doc = fy.loads("items: [1, 2, 3]")

# Recursive — the whole document becomes plain Python
doc.to_python()   # {'items': [1, 2, 3]}

# Scalar coercions
n = fy.loads("99")
int(n)    # 99
float(n)  # 99.0
bool(n)   # True
str(n)    # "99"

to_python() raises TypeError if a mapping key is unhashable (e.g. a nested mapping used as a key).

Container access

Sequences and mappings support the standard Python container protocol:

doc = fy.loads("fruits: [apple, banana, cherry]")
fruits = doc["fruits"]

len(fruits)      # 3
fruits[0]        # FyGeneric("apple")
str(fruits[0])   # "apple"
"banana" in fruits  # True (linear scan)

for item in fruits:
    print(str(item))

# Mappings
doc["fruits"]           # FyGeneric sequence
doc.keys()              # ['fruits']
doc.values()            # [FyGeneric sequence]
doc.items()             # [('fruits', FyGeneric sequence)]

Attribute access on mappings delegates to the underlying dict:

doc = fy.loads("host: localhost\nport: 8080")
str(doc.host)   # "localhost"
int(doc.port)   # 8080

Numeric operations on integer and float values work directly:

v = fy.loads("10")
v + 5    # 15
v * 2    # 20
v > 5    # True

Tags and anchors

doc = fy.loads("value: !!int '42'")
v = doc["value"]
v.has_tag()    # True
v.get_tag()    # "tag:yaml.org,2002:int"

doc2 = fy.loads("x: &myanchor hello\ny: *myanchor")
doc2["x"].has_anchor()   # True
doc2["x"].get_anchor()   # "myanchor"

Source markers

Markers record the byte offset, line, and column of each node in the original source. Enable them at parse time with create_markers=True.

doc = fy.loads("host: localhost\nport: 8080", create_markers=True)

m = doc["host"].get_marker()
# (start_byte, start_line, start_col, end_byte, end_line, end_col)
# e.g. (6, 0, 6, 15, 0, 15)

doc["host"].has_marker()   # True
doc["port"].get_marker()   # (22, 1, 6, 31, 1, 15)

Lines and columns are zero-based. get_marker() returns None when markers were not enabled.

Comments

Preserve YAML comments by parsing with keep_comments=True.

yaml_text = """\
# Server settings
host: localhost  # primary
port: 8080
"""
doc = fy.loads(yaml_text, keep_comments=True)
doc["host"].get_comment()   # "# primary"
doc["host"].has_comment()   # True

Diagnostics

With collect_diag=True parse errors are attached to the document rather than raised immediately. This lets you process partially-valid input.

doc = fy.loads("good: ok\nbad: {unclosed", collect_diag=True)
doc.has_diag()   # True
doc.get_diag()   # FyGeneric describing the error(s)

Serialisation

dumps(obj, *, compact=False, json=False, style=None, indent=0) → str

Serialise a FyGeneric or plain Python object to a YAML (or JSON) string.

doc = fy.loads("name: Alice\nscores: [10, 20, 30]")
print(fy.dumps(doc))
# name: Alice
# scores:
#   - 10
#   - 20
#   - 30

print(fy.dumps(doc, compact=True))
# {name: Alice, scores: [10, 20, 30]}

print(fy.dumps(doc, json=True))
# {"name": "Alice", "scores": [10, 20, 30]}

indent sets the indentation width (2–8 spaces; 0 uses the library default).

dump(file, obj, *, mode='yaml', compact=False)

Write to a file path (string) or file-like object. mode accepts 'yaml' or 'json'.

fy.dump("output.yaml", doc)

with open("output.json", "w") as f:
    fy.dump(f, doc, mode='json')

dumps_all(documents, *, compact=False, json=False, style=None) → str

dump_all(file, documents, *, compact=False, json=False)

Serialise a list of documents with --- separators.

docs = fy.loads_all("---\na: 1\n---\nb: 2")
print(fy.dumps_all(docs))
# ---
# a: 1
# ---
# b: 2

Individual node serialisation

FyGeneric objects have their own .dump() method:

doc = fy.loads("x: 1\ny: 2")
doc["x"].dump()                          # returns "1\n"
doc["x"].dump(strip_newline=True)        # returns "1"
doc["x"].dump("node.yaml")               # writes to file
doc["x"].dump(sys.stdout, mode='json')   # writes to file object

Scalar styles

The style parameter controls how scalar values are written. Accepted values:

Style Effect
None or 'default' Library default (usually plain)
'original' Preserve the style from the parsed input (requires keep_style=True at parse time)
'block' Block scalars (literal | or folded >)
'flow' Flow / inline style
'pretty' Readable multi-line format
'compact' Compact single-line
'oneline' Force everything onto one line
doc = fy.loads("text: 'hello world'")
print(fy.dumps(doc, style='block'))
print(fy.dumps(doc, style='flow'))

Converting Python objects

from_python(obj, *, tag=None, style=None, mutable=False, dedup=True) → FyGeneric

Convert a plain Python object (dict, list, str, int, float, bool, None) to a FyGeneric. Useful for attaching tags or styles before serialisation.

# Attach a YAML tag
v = fy.from_python("hello", tag="!mytag")
print(fy.dumps(v))   # !mytag hello

# Control the scalar style
text = fy.from_python("line one\nline two\n", style='|')
print(fy.dumps(text))
# |
#   line one
#   line two

Scalar style values accepted by from_python:

Style Meaning
`' '`
'>' Folded block scalar
"'" Single-quoted
'"' Double-quoted
'plain' or '' Plain (unquoted)

Path navigation

get_at_path(path) → FyGeneric

get_at_unix_path(path_str) → FyGeneric

Navigate into a nested document. A path is a list of keys (strings) and indices (integers).

doc = fy.loads("""
servers:
  - host: web01
    port: 80
  - host: web02
    port: 443
""")

doc.get_at_path(["servers", 0, "host"])      # FyGeneric("web01")
doc.get_at_unix_path("/servers/0/host")      # FyGeneric("web01")
doc.get_at_unix_path("/servers/1/port")      # FyGeneric(443)

get_at_path raises KeyError if the path does not exist.

get_path() → tuple / get_unix_path() → str

Return the path of a node within its document (useful when iterating):

doc = fy.loads("a:\n  b:\n    c: 42")
v = doc.get_at_unix_path("/a/b/c")
v.get_unix_path()    # "/a/b/c"
v.get_path()         # ('a', 'b', 'c')

Path utility functions

fy.path_list_to_unix_path(["servers", 0, "host"])   # "/servers/0/host"
fy.unix_path_to_path_list("/servers/0/host")         # ["servers", 0, "host"]

Mutability

By default FyGeneric objects are immutable. Pass mutable=True to the parse function (or from_python) to allow in-place modification.

doc = fy.loads("x: 1\ny: 2", mutable=True)

doc["x"] = 99
str(doc["x"])   # "99"

doc.set_at_path(["y"], "updated")
doc.set_at_unix_path("/x", 0)

print(fy.dumps(doc))
# x: 0
# y: updated

Attempting to modify an immutable object raises TypeError.


FyDocumentState

FyDocumentState carries the YAML directives that appeared before a document. Access it via FyGeneric.document_state.

doc = fy.loads("%YAML 1.2\n---\nkey: value")
ds = doc.document_state

ds.version           # (1, 2)
ds.version_explicit  # True
ds.json_mode         # False
ds.tags              # list of {'handle': ..., 'prefix': ...} dicts
ds.tags_explicit     # True if %TAG directives were present

document_state is None for values that are not document roots.


Memory management

Allocator strategy

The dedup=True default uses a deduplication allocator that stores only one copy of repeated strings or scalars. This is a significant win for large documents with repeated content (e.g. YAML files with many identical keys or values).

Set dedup=False to use the standard allocator, which may be faster for small documents or documents with little repetition.

Trim

trim=True (default) releases unused allocator pages after parsing is complete. Disable with trim=False if you will be building on the document after parsing and want to avoid reallocation.

Manual trim

doc = fy.loads(large_yaml, trim=False)
# ... do some work ...
doc.trim()   # release unused memory now

Clone

clone() creates an independent copy of a FyGeneric value, decoupled from the original document's allocator:

original = fy.load("big.yaml")
part = original.get_at_unix_path("/config/server").clone()
del original   # can now be collected

Error handling

Exception Raised when
ValueError Parse error; invalid mode string; invalid style; multiple documents where one was expected
TypeError Wrong argument type; mutation on an immutable object; unhashable mapping key in to_python() or items()
KeyError Path not found in get_at_path / get_at_unix_path
RuntimeError Internal builder or emitter failure; file write error
AttributeError Attribute access on a non-mapping FyGeneric
NotImplementedError del on a FyGeneric item
try:
    doc = fy.loads("key: [unclosed")
except ValueError as e:
    print(f"Parse error: {e}")

# Or collect errors without raising:
doc = fy.loads("key: [unclosed", collect_diag=True)
if doc.has_diag():
    print(doc.get_diag().to_python())

Comparison with PyYAML

This section describes how the core libfyaml binding relates to PyYAML.

Where they are similar

  • Function names: load, loads, dump, dumps follow the same naming convention as PyYAML's yaml.safe_load / yaml.dump.
  • Python types out: both ultimately produce dict, list, str, int, float, bool, and None. Call .to_python() on a FyGeneric to get the plain Python value.
  • YAML tag handling: both support !!str, !!int, !!float, !!bool, !!null, !!seq, !!map, !!binary, and custom tags.
  • Multi-document streams: both support ----separated documents via load_all / loads_all.

Where they diverge

Return type

The most immediate difference: loads returns a FyGeneric, not a native Python object. You must call .to_python() (or use the object directly via the container/numeric protocols) to get a plain dict or list.

# PyYAML
import yaml
result = yaml.safe_load("x: 1")
type(result)          # dict

# libfyaml
import libfyaml as fy
result = fy.loads("x: 1")
type(result)          # FyGeneric
type(result.to_python())  # dict

API shape: mode instead of Loader

PyYAML selects behaviour through Loader classes (SafeLoader, FullLoader, BaseLoader). libfyaml uses a mode string:

# PyYAML
yaml.load(s, Loader=yaml.SafeLoader)
yaml.safe_load(s)

# libfyaml
fy.loads(s)                      # YAML 1.2 (roughly equivalent to SafeLoader)
fy.loads(s, mode='yaml1.1-pyyaml')  # closest to PyYAML's SafeLoader behaviour

There are no Loader or Dumper classes in the core binding.

Default YAML version: 1.2 not 1.1

libfyaml defaults to YAML 1.2. PyYAML implements YAML 1.1. This affects implicit type resolution:

Input PyYAML (1.1) libfyaml default (1.2)
yes / no / on / off True / False string
0755 493 (octal int) string
1:30 (sexagesimal) 90 (int) string
1.5e3 1500.0 1500.0
.inf / .nan inf / nan inf / nan

Use mode='yaml1.1' or mode='yaml1.1-pyyaml' to get YAML 1.1 resolution.

Strictness differences in YAML 1.1 mode

Even in yaml1.1-pyyaml mode a few corner cases differ because libfyaml follows the YAML specification more strictly than PyYAML does:

Situation PyYAML libfyaml
Duplicate anchor (&a 1 ... &a 2) ComposerError accepted (spec §3.2.2.2 allows redefinition)
Unknown %DIRECTIVE ScannerError warning, continues (spec §6.8.1 says SHOULD warn)
? in anchor name (&?foo) ScannerError accepted (? is a valid ns-anchor-char per spec §6.9.2)
Sexagesimal integers (190:20:30) 685230 string (not resolved)
Sexagesimal floats (190:20:30.15) 685230.15 string (not resolved)
Single dot (.) string 0.0 (float — C library bug)
--- as flow scalar string null (C library bug)

Error messages

libfyaml and PyYAML produce different human-readable error messages for the same parse errors. Code that pattern-matches exception strings will need adjustment; code that only catches the exception type will be fine.

Block scalar emission

libfyaml follows the YAML spec strictly when choosing scalar styles, which means it will refuse to use a block scalar (| or >) in contexts where the spec does not permit one — for example as a value inside a flow collection. PyYAML emits block scalars in those contexts anyway, producing output that is technically non-conformant. If you serialise a document that PyYAML would render with block scalars inside flow collections, libfyaml will choose a flow-compatible style (double-quoted) instead.

Unicode line separators (U+2028 / U+2029)

The YAML 1.2 spec (§6.5) classifies U+2028 (LINE SEPARATOR) and U+2029 (PARAGRAPH SEPARATOR) as line-break characters. libfyaml honours this in block scalars, treating them as line breaks during both parsing and emission. PyYAML predates this clarification and treats them as ordinary non-breaking characters throughout. If your data contains these code points, block-style round-trips will produce different results between the two libraries. Use double-quoted scalars to preserve them unambiguously in either library.

!!binary tag syntax

libfyaml accepts inline !!binary scalars (!!binary aGVsbG8=) in addition to the block form that PyYAML requires (!!binary |\n aGVsbG8=). Both forms decode to bytes.

Features not in PyYAML

The core binding provides capabilities that PyYAML has no equivalent for:

  • Source markers (create_markers=True) — byte/line/column positions for every node, without the overhead of PyYAML's Mark objects on events.
  • Comment preservation (keep_comments=True).
  • Style preservation (keep_style=True) — round-trip the original scalar style (literal, folded, single-quoted, etc.).
  • Path navigationget_at_unix_path, set_at_unix_path for direct document surgery without tree traversal code.
  • Deduplication allocator — dramatically lower memory usage for documents with repeated content.
  • FyDocumentState — programmatic access to %YAML and %TAG directives.

Appendix: Parse performance

Methodology

Configurations were measured by running docs/benchmark-parse.py against two real-world YAML files. Each configuration runs in an isolated subprocess so that allocations from earlier runs cannot inflate later measurements.

All libraries are imported before the baseline RSS is measured so that library load cost (the .so footprint) is excluded from the delta. The RSS delta therefore reflects only the memory added by parsing that specific file — the data structures created, the source text mapped, the allocator pages used.

Five timed repetitions were taken per configuration; the tables report the median parse time and median peak RSS delta across those runs.

The benchmark can be reproduced on any YAML file:

python3 docs/benchmark-parse.py <file.yaml> [--runs N] [--multi]

Use --multi for files containing multiple ----separated documents.

Note on PyYAML compatibility. PyYAML's SafeLoader and CLoader do not recognise tag:yaml.org,2002:value, the tag YAML 1.1 assigns to a bare = scalar. YAML 1.2 treats = as a plain string, and it appears legitimately in both test files (e.g. as an enum value in Kubernetes CRD schemas). The benchmark registers a one-line constructor fix so PyYAML can parse these files; libfyaml handles them correctly without any patching.

Environment

Item Version
CPU AMD Ryzen 5 5600X
Python 3.12.3
PyYAML 6.0.1
libyaml (CLoader) 0.2.5
libfyaml v0.9.3-278 (release build)

Results — 6.4 MB (AtomicCards-2-cleaned-small.yaml, single-doc)

Magic: The Gathering card database — highly varied text content with moderate key repetition.

xychart-beta horizontal
    title "Parse time — AtomicCards 6.4 MB (ms, lower is better)"
    x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
    y-axis "ms" 0 --> 7500
    bar [7155, 1228, 115, 102]
Loading
xychart-beta horizontal
    title "RSS delta — AtomicCards 6.4 MB (MB, lower is better)"
    x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
    y-axis "MB" 0 --> 175
    bar [164, 123, 28, 25]
Loading
Configuration Median Min RSS delta
PyYAML safe_load (pure Python) 7155 ms 7033 ms +164 MB
PyYAML CLoader (libyaml) 1228 ms 1172 ms +123 MB
libfyaml dedup=True (default) 115 ms 114 ms +28 MB
libfyaml dedup=False 102 ms 101 ms +25 MB

Results — 4.3 MB (bundle.yaml, multi-doc, 24 documents)

Prometheus Operator CRD bundle (source) — structured Kubernetes schemas with heavy key repetition (name, type, description, properties, spec recurring throughout).

xychart-beta horizontal
    title "Parse time — bundle.yaml 4.3 MB (ms, lower is better)"
    x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
    y-axis "ms" 0 --> 3200
    bar [2964, 274, 53, 48]
Loading
xychart-beta horizontal
    title "RSS delta — bundle.yaml 4.3 MB (MB, lower is better)"
    x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
    y-axis "MB" 0 --> 20
    bar [16, 14, 3, 10]
Loading
Configuration Median Min RSS delta
PyYAML safe_load (pure Python) 2964 ms 2919 ms +16 MB
PyYAML CLoader (libyaml) 274 ms 267 ms +14 MB
libfyaml dedup=True (default) 53 ms 52 ms +3 MB
libfyaml dedup=False 48 ms 48 ms +10 MB

Analysis

Speed. Across both files, libfyaml is 4–5× faster than CLoader and 55–60× faster than pure-Python PyYAML. The gap against the pure Python loader is expected — PyYAML constructs every node as a heap-allocated Python object while iterating the event stream in interpreted bytecode. The gap against CLoader is more meaningful: both parsers are written in C, but libfyaml uses mmap for file I/O, a purpose-built allocator, and avoids the two-phase parse/construct split that libyaml's event model requires.

Memory. libfyaml consistently uses far less RSS than PyYAML for the parsed data structure. PyYAML allocates a heap object (dict, list, str, int, …) for every node in the document; libfyaml stores values in its arena allocator with FyGeneric wrappers created lazily on access. On the card database, libfyaml uses ~78% less RSS than CLoader (+25–28 MB vs +123 MB); on the CRD bundle it uses ~80–98% less (+3–10 MB vs +14 MB).

Note that libfyaml's .so file itself has a significant up-front import cost (~50 MB RSS), which is a fixed one-time overhead amortised across all subsequent load() calls and not included in the delta figures above.

dedup vs no-dedup. On the card database, dedup=True adds ~13 ms but saves only ~3 MB — the text content is highly varied, so the dedup allocator finds little to share. On the CRD bundle, dedup=True saves 7 MB compared to dedup=False because Kubernetes schemas repeat the same field names (name, type, description, properties, …) thousands of times across 24 documents. The deduplication allocator is the right default for structured configuration and API-schema YAML; for documents with unique free-form text, dedup=False is marginally faster.