The libfyaml Python binding exposes the high-performance libfyaml C library
directly. Parsed documents are represented as FyGeneric objects — lazy
wrappers that defer conversion to Python types until you ask for them. This
keeps memory low and lets you navigate large documents without materialising
every node.
- Quick Start
- Parsing
- The FyGeneric Type
- Serialisation
- Converting Python objects
- Path navigation
- Mutability
- FyDocumentState
- Memory management
- Error handling
- Comparison with PyYAML
import libfyaml as fy
# Parse a YAML string
doc = fy.loads("name: Alice\nage: 30")
print(doc["name"]) # FyGeneric wrapping "Alice"
print(str(doc["name"])) # "Alice"
print(doc.to_python()) # {'name': 'Alice', 'age': 30}
# Parse a file
doc = fy.load("config.yaml")
# Serialise back to YAML
print(fy.dumps(doc))
# Parse JSON
data = fy.loads('{"x": 1}', mode='json')Parse a YAML or JSON string. Raises ValueError if the input contains
more than one document — use loads_all for multi-document streams.
doc = fy.loads("key: value")
docs = fy.loads_all("---\na: 1\n---\nb: 2") # list of FyGenericParse from a file path (string — uses mmap internally) or any file-like
object with a .read() method.
doc = fy.load("data.yaml")
with open("data.yaml") as f:
doc = fy.load(f)Return all documents in a multi-document stream as a list.
docs = fy.loads_all("---\n1\n---\n2\n---\n3")
# [FyGeneric(1), FyGeneric(2), FyGeneric(3)]The mode parameter controls which YAML dialect is accepted:
| Mode string | Meaning |
|---|---|
'yaml', 'yaml1.2', '1.2' |
YAML 1.2 — the default |
'yaml1.1', '1.1' |
YAML 1.1 (accepts merge keys <<, sexagesimal numbers, etc.) |
'yaml1.1-pyyaml', 'pyyaml' |
YAML 1.1 with PyYAML-compatible quirks (used by the compat layer) |
'json' |
Strict JSON |
# Merge keys only work in YAML 1.1
doc = fy.loads("""
defaults: &defaults
timeout: 30
server:
<<: *defaults
host: localhost
""", mode='yaml1.1')All four parse functions accept the same keyword options:
| Option | Default | Description |
|---|---|---|
mode |
'yaml' |
Dialect — see above |
dedup |
True |
Use the deduplication allocator (saves memory for documents with repeated content) |
trim |
True |
Release unused allocator memory after parsing |
mutable |
False |
Produce mutable FyGeneric objects (required for __setitem__ and set_at_path) |
collect_diag |
False |
Attach parse diagnostics to the result instead of raising |
create_markers |
False |
Record byte/line/column positions for every node |
keep_comments |
False |
Preserve YAML comments in the document |
keep_style |
False |
Preserve original scalar styles (literal, folded, quoted, …) |
FyGeneric is the type returned by all parse functions. It wraps a C
fy_generic value without copying data. Conversion to Python only happens
when you explicitly ask for it.
doc = fy.loads("x: 42")
type(doc) # <class 'libfyaml._libfyaml.FyGeneric'>
doc.__class__ # <class 'dict'> — the Python equivalent classEight predicate methods, all return bool:
v = fy.loads("42")
v.is_null() # False
v.is_bool() # False
v.is_int() # True
v.is_float() # False
v.is_string() # False
v.is_sequence() # False
v.is_mapping() # False
v.is_indirect() # True if the value carries a tag or anchordoc = fy.loads("items: [1, 2, 3]")
# Recursive — the whole document becomes plain Python
doc.to_python() # {'items': [1, 2, 3]}
# Scalar coercions
n = fy.loads("99")
int(n) # 99
float(n) # 99.0
bool(n) # True
str(n) # "99"to_python() raises TypeError if a mapping key is unhashable (e.g. a
nested mapping used as a key).
Sequences and mappings support the standard Python container protocol:
doc = fy.loads("fruits: [apple, banana, cherry]")
fruits = doc["fruits"]
len(fruits) # 3
fruits[0] # FyGeneric("apple")
str(fruits[0]) # "apple"
"banana" in fruits # True (linear scan)
for item in fruits:
print(str(item))
# Mappings
doc["fruits"] # FyGeneric sequence
doc.keys() # ['fruits']
doc.values() # [FyGeneric sequence]
doc.items() # [('fruits', FyGeneric sequence)]Attribute access on mappings delegates to the underlying dict:
doc = fy.loads("host: localhost\nport: 8080")
str(doc.host) # "localhost"
int(doc.port) # 8080Numeric operations on integer and float values work directly:
v = fy.loads("10")
v + 5 # 15
v * 2 # 20
v > 5 # Truedoc = fy.loads("value: !!int '42'")
v = doc["value"]
v.has_tag() # True
v.get_tag() # "tag:yaml.org,2002:int"
doc2 = fy.loads("x: &myanchor hello\ny: *myanchor")
doc2["x"].has_anchor() # True
doc2["x"].get_anchor() # "myanchor"Markers record the byte offset, line, and column of each node in the original
source. Enable them at parse time with create_markers=True.
doc = fy.loads("host: localhost\nport: 8080", create_markers=True)
m = doc["host"].get_marker()
# (start_byte, start_line, start_col, end_byte, end_line, end_col)
# e.g. (6, 0, 6, 15, 0, 15)
doc["host"].has_marker() # True
doc["port"].get_marker() # (22, 1, 6, 31, 1, 15)Lines and columns are zero-based. get_marker() returns None when markers
were not enabled.
Preserve YAML comments by parsing with keep_comments=True.
yaml_text = """\
# Server settings
host: localhost # primary
port: 8080
"""
doc = fy.loads(yaml_text, keep_comments=True)
doc["host"].get_comment() # "# primary"
doc["host"].has_comment() # TrueWith collect_diag=True parse errors are attached to the document rather than
raised immediately. This lets you process partially-valid input.
doc = fy.loads("good: ok\nbad: {unclosed", collect_diag=True)
doc.has_diag() # True
doc.get_diag() # FyGeneric describing the error(s)Serialise a FyGeneric or plain Python object to a YAML (or JSON) string.
doc = fy.loads("name: Alice\nscores: [10, 20, 30]")
print(fy.dumps(doc))
# name: Alice
# scores:
# - 10
# - 20
# - 30
print(fy.dumps(doc, compact=True))
# {name: Alice, scores: [10, 20, 30]}
print(fy.dumps(doc, json=True))
# {"name": "Alice", "scores": [10, 20, 30]}indent sets the indentation width (2–8 spaces; 0 uses the library default).
Write to a file path (string) or file-like object. mode accepts 'yaml' or
'json'.
fy.dump("output.yaml", doc)
with open("output.json", "w") as f:
fy.dump(f, doc, mode='json')Serialise a list of documents with --- separators.
docs = fy.loads_all("---\na: 1\n---\nb: 2")
print(fy.dumps_all(docs))
# ---
# a: 1
# ---
# b: 2FyGeneric objects have their own .dump() method:
doc = fy.loads("x: 1\ny: 2")
doc["x"].dump() # returns "1\n"
doc["x"].dump(strip_newline=True) # returns "1"
doc["x"].dump("node.yaml") # writes to file
doc["x"].dump(sys.stdout, mode='json') # writes to file objectThe style parameter controls how scalar values are written. Accepted values:
| Style | Effect |
|---|---|
None or 'default' |
Library default (usually plain) |
'original' |
Preserve the style from the parsed input (requires keep_style=True at parse time) |
'block' |
Block scalars (literal | or folded >) |
'flow' |
Flow / inline style |
'pretty' |
Readable multi-line format |
'compact' |
Compact single-line |
'oneline' |
Force everything onto one line |
doc = fy.loads("text: 'hello world'")
print(fy.dumps(doc, style='block'))
print(fy.dumps(doc, style='flow'))Convert a plain Python object (dict, list, str, int, float, bool,
None) to a FyGeneric. Useful for attaching tags or styles before
serialisation.
# Attach a YAML tag
v = fy.from_python("hello", tag="!mytag")
print(fy.dumps(v)) # !mytag hello
# Control the scalar style
text = fy.from_python("line one\nline two\n", style='|')
print(fy.dumps(text))
# |
# line one
# line twoScalar style values accepted by from_python:
| Style | Meaning |
|---|---|
| `' | '` |
'>' |
Folded block scalar |
"'" |
Single-quoted |
'"' |
Double-quoted |
'plain' or '' |
Plain (unquoted) |
Navigate into a nested document. A path is a list of keys (strings) and indices (integers).
doc = fy.loads("""
servers:
- host: web01
port: 80
- host: web02
port: 443
""")
doc.get_at_path(["servers", 0, "host"]) # FyGeneric("web01")
doc.get_at_unix_path("/servers/0/host") # FyGeneric("web01")
doc.get_at_unix_path("/servers/1/port") # FyGeneric(443)get_at_path raises KeyError if the path does not exist.
Return the path of a node within its document (useful when iterating):
doc = fy.loads("a:\n b:\n c: 42")
v = doc.get_at_unix_path("/a/b/c")
v.get_unix_path() # "/a/b/c"
v.get_path() # ('a', 'b', 'c')fy.path_list_to_unix_path(["servers", 0, "host"]) # "/servers/0/host"
fy.unix_path_to_path_list("/servers/0/host") # ["servers", 0, "host"]By default FyGeneric objects are immutable. Pass mutable=True to the parse
function (or from_python) to allow in-place modification.
doc = fy.loads("x: 1\ny: 2", mutable=True)
doc["x"] = 99
str(doc["x"]) # "99"
doc.set_at_path(["y"], "updated")
doc.set_at_unix_path("/x", 0)
print(fy.dumps(doc))
# x: 0
# y: updatedAttempting to modify an immutable object raises TypeError.
FyDocumentState carries the YAML directives that appeared before a document.
Access it via FyGeneric.document_state.
doc = fy.loads("%YAML 1.2\n---\nkey: value")
ds = doc.document_state
ds.version # (1, 2)
ds.version_explicit # True
ds.json_mode # False
ds.tags # list of {'handle': ..., 'prefix': ...} dicts
ds.tags_explicit # True if %TAG directives were presentdocument_state is None for values that are not document roots.
The dedup=True default uses a deduplication allocator that stores only one
copy of repeated strings or scalars. This is a significant win for large
documents with repeated content (e.g. YAML files with many identical keys or
values).
Set dedup=False to use the standard allocator, which may be faster for
small documents or documents with little repetition.
trim=True (default) releases unused allocator pages after parsing is
complete. Disable with trim=False if you will be building on the document
after parsing and want to avoid reallocation.
doc = fy.loads(large_yaml, trim=False)
# ... do some work ...
doc.trim() # release unused memory nowclone() creates an independent copy of a FyGeneric value, decoupled from
the original document's allocator:
original = fy.load("big.yaml")
part = original.get_at_unix_path("/config/server").clone()
del original # can now be collected| Exception | Raised when |
|---|---|
ValueError |
Parse error; invalid mode string; invalid style; multiple documents where one was expected |
TypeError |
Wrong argument type; mutation on an immutable object; unhashable mapping key in to_python() or items() |
KeyError |
Path not found in get_at_path / get_at_unix_path |
RuntimeError |
Internal builder or emitter failure; file write error |
AttributeError |
Attribute access on a non-mapping FyGeneric |
NotImplementedError |
del on a FyGeneric item |
try:
doc = fy.loads("key: [unclosed")
except ValueError as e:
print(f"Parse error: {e}")
# Or collect errors without raising:
doc = fy.loads("key: [unclosed", collect_diag=True)
if doc.has_diag():
print(doc.get_diag().to_python())This section describes how the core libfyaml binding relates to PyYAML.
- Function names:
load,loads,dump,dumpsfollow the same naming convention as PyYAML'syaml.safe_load/yaml.dump. - Python types out: both ultimately produce
dict,list,str,int,float,bool, andNone. Call.to_python()on aFyGenericto get the plain Python value. - YAML tag handling: both support
!!str,!!int,!!float,!!bool,!!null,!!seq,!!map,!!binary, and custom tags. - Multi-document streams: both support
----separated documents viaload_all/loads_all.
The most immediate difference: loads returns a FyGeneric, not a native
Python object. You must call .to_python() (or use the object directly via
the container/numeric protocols) to get a plain dict or list.
# PyYAML
import yaml
result = yaml.safe_load("x: 1")
type(result) # dict
# libfyaml
import libfyaml as fy
result = fy.loads("x: 1")
type(result) # FyGeneric
type(result.to_python()) # dictPyYAML selects behaviour through Loader classes (SafeLoader,
FullLoader, BaseLoader). libfyaml uses a mode string:
# PyYAML
yaml.load(s, Loader=yaml.SafeLoader)
yaml.safe_load(s)
# libfyaml
fy.loads(s) # YAML 1.2 (roughly equivalent to SafeLoader)
fy.loads(s, mode='yaml1.1-pyyaml') # closest to PyYAML's SafeLoader behaviourThere are no Loader or Dumper classes in the core binding.
libfyaml defaults to YAML 1.2. PyYAML implements YAML 1.1. This affects implicit type resolution:
| Input | PyYAML (1.1) | libfyaml default (1.2) |
|---|---|---|
yes / no / on / off |
True / False |
string |
0755 |
493 (octal int) |
string |
1:30 (sexagesimal) |
90 (int) |
string |
1.5e3 |
1500.0 |
1500.0 |
.inf / .nan |
inf / nan |
inf / nan |
Use mode='yaml1.1' or mode='yaml1.1-pyyaml' to get YAML 1.1 resolution.
Even in yaml1.1-pyyaml mode a few corner cases differ because libfyaml
follows the YAML specification more strictly than PyYAML does:
| Situation | PyYAML | libfyaml |
|---|---|---|
Duplicate anchor (&a 1 ... &a 2) |
ComposerError |
accepted (spec §3.2.2.2 allows redefinition) |
Unknown %DIRECTIVE |
ScannerError |
warning, continues (spec §6.8.1 says SHOULD warn) |
? in anchor name (&?foo) |
ScannerError |
accepted (? is a valid ns-anchor-char per spec §6.9.2) |
Sexagesimal integers (190:20:30) |
685230 |
string (not resolved) |
Sexagesimal floats (190:20:30.15) |
685230.15 |
string (not resolved) |
Single dot (.) |
string | 0.0 (float — C library bug) |
--- as flow scalar |
string | null (C library bug) |
libfyaml and PyYAML produce different human-readable error messages for the same parse errors. Code that pattern-matches exception strings will need adjustment; code that only catches the exception type will be fine.
libfyaml follows the YAML spec strictly when choosing scalar styles, which
means it will refuse to use a block scalar (| or >) in contexts where
the spec does not permit one — for example as a value inside a flow
collection. PyYAML emits block scalars in those contexts anyway, producing
output that is technically non-conformant. If you serialise a document that
PyYAML would render with block scalars inside flow collections, libfyaml will
choose a flow-compatible style (double-quoted) instead.
The YAML 1.2 spec (§6.5) classifies U+2028 (LINE SEPARATOR) and U+2029 (PARAGRAPH SEPARATOR) as line-break characters. libfyaml honours this in block scalars, treating them as line breaks during both parsing and emission. PyYAML predates this clarification and treats them as ordinary non-breaking characters throughout. If your data contains these code points, block-style round-trips will produce different results between the two libraries. Use double-quoted scalars to preserve them unambiguously in either library.
libfyaml accepts inline !!binary scalars (!!binary aGVsbG8=) in addition
to the block form that PyYAML requires (!!binary |\n aGVsbG8=). Both forms
decode to bytes.
The core binding provides capabilities that PyYAML has no equivalent for:
- Source markers (
create_markers=True) — byte/line/column positions for every node, without the overhead of PyYAML'sMarkobjects on events. - Comment preservation (
keep_comments=True). - Style preservation (
keep_style=True) — round-trip the original scalar style (literal, folded, single-quoted, etc.). - Path navigation —
get_at_unix_path,set_at_unix_pathfor direct document surgery without tree traversal code. - Deduplication allocator — dramatically lower memory usage for documents with repeated content.
FyDocumentState— programmatic access to%YAMLand%TAGdirectives.
Configurations were measured by running docs/benchmark-parse.py against
two real-world YAML files. Each configuration runs in an isolated
subprocess so that allocations from earlier runs cannot inflate later
measurements.
All libraries are imported before the baseline RSS is measured so that
library load cost (the .so footprint) is excluded from the delta. The RSS
delta therefore reflects only the memory added by parsing that specific file —
the data structures created, the source text mapped, the allocator pages used.
Five timed repetitions were taken per configuration; the tables report the median parse time and median peak RSS delta across those runs.
The benchmark can be reproduced on any YAML file:
python3 docs/benchmark-parse.py <file.yaml> [--runs N] [--multi]
Use --multi for files containing multiple ----separated documents.
Note on PyYAML compatibility. PyYAML's SafeLoader and CLoader do not
recognise tag:yaml.org,2002:value, the tag YAML 1.1 assigns to a bare =
scalar. YAML 1.2 treats = as a plain string, and it appears legitimately in
both test files (e.g. as an enum value in Kubernetes CRD schemas). The
benchmark registers a one-line constructor fix so PyYAML can parse these files;
libfyaml handles them correctly without any patching.
Environment
| Item | Version |
|---|---|
| CPU | AMD Ryzen 5 5600X |
| Python | 3.12.3 |
| PyYAML | 6.0.1 |
| libyaml (CLoader) | 0.2.5 |
| libfyaml | v0.9.3-278 (release build) |
Magic: The Gathering card database — highly varied text content with moderate key repetition.
xychart-beta horizontal
title "Parse time — AtomicCards 6.4 MB (ms, lower is better)"
x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
y-axis "ms" 0 --> 7500
bar [7155, 1228, 115, 102]
xychart-beta horizontal
title "RSS delta — AtomicCards 6.4 MB (MB, lower is better)"
x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
y-axis "MB" 0 --> 175
bar [164, 123, 28, 25]
| Configuration | Median | Min | RSS delta |
|---|---|---|---|
PyYAML safe_load (pure Python) |
7155 ms | 7033 ms | +164 MB |
PyYAML CLoader (libyaml) |
1228 ms | 1172 ms | +123 MB |
libfyaml dedup=True (default) |
115 ms | 114 ms | +28 MB |
libfyaml dedup=False |
102 ms | 101 ms | +25 MB |
Prometheus Operator CRD bundle (source)
— structured Kubernetes schemas with heavy key repetition (name, type,
description, properties, spec recurring throughout).
xychart-beta horizontal
title "Parse time — bundle.yaml 4.3 MB (ms, lower is better)"
x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
y-axis "ms" 0 --> 3200
bar [2964, 274, 53, 48]
xychart-beta horizontal
title "RSS delta — bundle.yaml 4.3 MB (MB, lower is better)"
x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
y-axis "MB" 0 --> 20
bar [16, 14, 3, 10]
| Configuration | Median | Min | RSS delta |
|---|---|---|---|
PyYAML safe_load (pure Python) |
2964 ms | 2919 ms | +16 MB |
PyYAML CLoader (libyaml) |
274 ms | 267 ms | +14 MB |
libfyaml dedup=True (default) |
53 ms | 52 ms | +3 MB |
libfyaml dedup=False |
48 ms | 48 ms | +10 MB |
Speed. Across both files, libfyaml is 4–5× faster than CLoader and 55–60× faster than pure-Python PyYAML. The gap against the pure Python loader is expected — PyYAML constructs every node as a heap-allocated Python object while iterating the event stream in interpreted bytecode. The gap against CLoader is more meaningful: both parsers are written in C, but libfyaml uses mmap for file I/O, a purpose-built allocator, and avoids the two-phase parse/construct split that libyaml's event model requires.
Memory. libfyaml consistently uses far less RSS than PyYAML for the
parsed data structure. PyYAML allocates a heap object (dict, list, str, int,
…) for every node in the document; libfyaml stores values in its arena
allocator with FyGeneric wrappers created lazily on access. On the card
database, libfyaml uses ~78% less RSS than CLoader (+25–28 MB vs +123 MB);
on the CRD bundle it uses ~80–98% less (+3–10 MB vs +14 MB).
Note that libfyaml's .so file itself has a significant up-front import cost
(~50 MB RSS), which is a fixed one-time overhead amortised across all subsequent
load() calls and not included in the delta figures above.
dedup vs no-dedup. On the card database, dedup=True adds ~13 ms but saves
only ~3 MB — the text content is highly varied, so the dedup allocator finds
little to share. On the CRD bundle, dedup=True saves 7 MB compared to
dedup=False because Kubernetes schemas repeat the same field names (name,
type, description, properties, …) thousands of times across 24 documents.
The deduplication allocator is the right default for structured configuration
and API-schema YAML; for documents with unique free-form text, dedup=False is
marginally faster.