Skip to content

Token-Efficient Tool Payloads & Prompt Compression in PromptWare OS #15

@huan

Description

@huan

Token-Efficient Tool Payloads & Prompt Compression in PromptWare OS

A living design note for humans and future AI agents.

0. Why this file exists

This document records the full path from an initial tokenization question (“what does a newline cost?”) to an architectural direction for PromptWare OS:

  • Tokens are a scarce kernel resource.
  • JSON is often token-expensive for tool calls and context packing.
  • Long prompt/reference corpora can be compressed safely with research-backed methods (LLMLingua family).
  • The syscall ingest should evolve into a Context Memory Manager (budgeting, compression, placement, audit).

The goal: design PromptWareOS prompt loading and tool-call payload formats that are context-window friendly, robust, and future-proof.


1. Origin: “How many tokens is a newline?”

1.1 Newline (actual line break) vs literal \n

A key realization:

  • A real newline in plain text / markdown is often tokenized cheaply (commonly ~1 token for \n, and sometimes \n\n may be 1 token too, depending on tokenizer and surrounding text).
  • The literal two-character sequence \n inside a JSON string is different: it requires a backslash + letter n and often costs more tokens.

1.2 Why JSON gets expensive

JSON is token-expensive for several reasons:

  • Escaping: newlines become \\n, quotes become \\", backslashes become \\\\.
  • Syntax overhead: braces, quotes, commas, colons, repeated key names.
  • Tokenizer friendliness: natural whitespace patterns merge well; escape sequences often don’t.

This leads to the principle:

Do not escape prose unless you must.


2. First architectural inference: escaped JSON costs more than plain text

2.1 The practical consequence

If PromptWareOS passes large reference content as JSON strings, the token cost can balloon.

That suggests a design goal:

Build a token-optimized data object format to pass from Prompt Kernel → Software Kernel (LLM → tool call), replacing JSON when possible.


3. Research direction: minimizing tokens for structured outputs

The AI research + engineering ecosystem is converging on a few adjacent threads:

3.1 Format choice matters (JSON vs alternatives)

Empirical comparisons (papers and serious practitioner benchmarks) show format choice affects:

  • token usage
  • parse reliability
  • accuracy

3.2 Constrained decoding / structured outputs

Instead of “freehand JSON”, many systems use:

  • JSON schema or grammar guided decoding
  • constrained generation to ensure validity

This changes the game:

  • you can use a more compact surface syntax
  • correctness comes from the decoder/grammar, not from verbose self-describing text

3.3 Prompt compression research

A separate (but crucial) research thread: compressing long prompts / references while retaining downstream performance.

This leads directly to LLMLingua.


4. Token-optimized payload formats (JSON replacement ideas)

4.1 The core problem

For tool calls / syscalls, we want:

  • minimal tokens
  • easy parsing
  • stable schemas
  • no fragile escaping

4.2 Three principles

Principle A — Schema-first, not self-describing

  • The Software Kernel already knows schemas; payloads don’t need repeated key strings.

Principle B — Avoid escaping with length-prefixed strings

  • Escapes are the silent token killer.
  • Length-prefix allows raw text (including newlines and quotes) without backslashes.

Principle C — Flat, positional, typed

  • Many tool calls are small arg lists.
  • Prefer typed atoms and positional arguments over nested structures when possible.

4.3 Candidate compact format: LTON (Length-Typed Object Notation)

A “prompt-friendly protobuf” textual encoding.

Example (keyed):

!call weather.v1
(city s12:San Leandro)
(days u1:7)
(unit e1:C)
(note s43:Line1
Line2

Line4)

Example (positional):

!call weather.v1
(s12:San Leandro u1:7 e1:C s43:Line1
Line2

Line4)

Where:

  • sN: = string of length N characters
  • uN: = unsigned integer
  • eN: = enum index

Advantages:

  • no quotes
  • no backslashes
  • multi-line strings stay cheap (literal newlines)
  • schema can be enforced in the Software Kernel

4.4 Other candidates worth exploring

  • A TOON-like “token-oriented object notation” approach (community proposals)
  • TSV / line-oriented K=V formats for shallow payloads
  • Binary formats (protobuf/cbor) with base64 or sidechannel transport (token tradeoffs)
  • “Hybrid”: compact arguments + separate raw text blocks referenced by IDs

5. Prompt compression: LLMLingua as a first-class kernel subsystem

5.1 What LLMLingua is (the mental model)

LLMLingua is a family of extractive prompt compression methods: given a long prompt, it deletes (drops) tokens that are predicted to be less important, producing a shorter prompt that aims to preserve downstream task quality.

Think of it as a token garbage collector for prompts:

  • Input: long prompt (ICL examples, CoT traces, docs, transcripts)
  • Output: shorter prompt that keeps the “information-bearing” parts
  • Goal: reduce cost and latency while maintaining accuracy

This is especially relevant when prompts reach 10k–100k tokens, where inference cost and long-context failure modes become dominant.

5.2 LLMLingua (v1): coarse-to-fine compression with a budget controller

LLMLingua v1 is built around three big ideas:

  1. Budget controller
  • You set a target token budget (or a compression ratio).
  • The compressor tries to preserve semantic integrity while hitting that budget.
  1. Token-level iterative compression
  • Compression is not “one shot.”
  • It iteratively prunes tokens while considering dependencies between remaining parts.
  1. Distribution alignment between compressor and target LLM
  • The compressor is typically a smaller model.
  • LLMLingua uses instruction tuning to reduce mismatch so the compressed prompt still “works” for the target LLM.

Tiny example (conceptual):

Before:

You are an agent. Follow rules A…Z.

Reference:
- Long doc paragraph 1...
- Long doc paragraph 2...
- Long doc paragraph 3...

Question: What are the key constraints?

After (same structure, less redundancy):

You are an agent. Follow rules A…Z.

Reference (compressed):
- Keep only constraints, definitions, edge cases.

Question: What are the key constraints?

5.3 LLMLingua-2: fast, faithful, task-agnostic compression

LLMLingua-2 reframes compression as token classification:

  • A small bidirectional encoder reads the full text.
  • It labels which tokens should be kept vs dropped.
  • Supervision is obtained via data distillation from a powerful LLM (e.g., GPT-4-generated compressed targets).

Why it matters operationally:

  • Fast (often multiple times faster than v1)
  • Task-agnostic (better out-of-domain behavior)
  • Faithful (extractive, no paraphrasing / reduced hallucination risk)

Tiny example (conceptual):

Input doc:

Install steps:
1. Download package.
2. Run installer.
3. Reboot.
Troubleshooting:
- If error X, do Y.
- If error Z, do W.
History:
This tool began in 2014...
(10 paragraphs of history)

Compressed output (keeps steps + troubleshooting; drops history):

Install steps: 1) Download package 2) Run installer 3) Reboot
Troubleshooting: error X→Y; error Z→W

5.4 LongLLMLingua: long-context + RAG-oriented compression

LongLLMLingua targets long-context issues:

  • High cost (tokens)
  • High latency
  • Performance drops in long contexts (including position bias / “middle loss”)

It focuses on improving the density and placement of key information:

  • Question-aware coarse-grained compression: compress documents relative to the query.
  • Reordering: move the most relevant information earlier (and sometimes strategically later).
  • Subsequence recovery: preserve or recover critical spans to improve faithfulness.

Tiny example (query-aware):

Query: “Which flags are required to enable non-daemon mode?”

  • Keep: paragraphs mentioning daemon/client mode, flags, examples.
  • Drop: unrelated architecture background, changelog, long introductions.

5.5 What LLMLingua is good at (and what it is not)

Great at:

  • Shrinking long, redundant references
  • Keeping dense, task-relevant snippets
  • Lowering token/cost/latency for long-context pipelines
  • Helping the model “see” key info by removing noise

Not great / caution areas:

  • Compressing high-stakes invariants (kernel laws, safety constraints)
  • Compressing tool schemas or strict formatting examples
  • Anything where a single missing negation changes meaning

Rule of thumb:

Compress the world. Don’t compress the kernel.

5.6 How to think about LLMLingua inside PromptWareOS

PromptWareOS has a natural boundary where LLMLingua belongs:

  • Prompt Kernel: declares intent + rules + ABI
  • Software Kernel: loads resources and manages execution

So LLMLingua should live near ingest as a Context Memory Manager component:

  • classify content
  • allocate budgets
  • compress references/skills (selectively)
  • pack into context
  • audit invariants

6. Reframing ingest: from “load files” to “Context Memory Manager”

Instead of merely concatenating prompt files, ingest should:

  1. Classify content (core vs optional)
  2. Allocate budget per class and per doc
  3. Compress where appropriate (LLMLingua family)
  4. Place the results into context (ordering/packing)
  5. Audit invariants and parse reliability

This yields a “context image” concept:

  • context.core (raw, pinned)
  • context.skills (lightly compressed)
  • context.refs (heavily compressed)
  • context.sidecar (raw store addressable via syscalls)

7. Integration designs for LLMLingua in PromptWareOS

Design A — Deterministic build artifact (offline compression)

Idea: compress at build/CI time.

  • Generate .pwc.md artifacts (PromptWare Compressed)
  • Store manifest: original hash → compressed hash → ratio
  • Runtime uses compressed by default; --raw for debugging

Why it fits:

  • deterministic + reviewable diffs
  • fast runtime
  • works like shipping binaries

Design B — Adaptive memory paging (runtime compression)

Idea: ingest --budget N with paging.

  • Page 0 pinned core
  • Other pages compressible
  • When new info arrives, re-compress low-priority pages more aggressively

Why it fits:

  • handles agent life: late evidence
  • stable core + adaptive periphery
  • most “OS-like”

Design C — Question-aware ingest (LongLLMLingua / RAG-optimized)

Idea: compress with awareness of the current task/query.

  • retrieve candidates
  • compress them conditioned on the question
  • preserve key spans for grounding

Why it fits:

  • avoids compressed irrelevance
  • reduces long-context failure patterns

Design D — Two-track ingest (compressed context + raw sidecar)

Idea: keep raw docs outside context, page-in slices on demand.

  • Track 1: compressed goes into context
  • Track 2: raw stored, addressable via sys_read(ref_id, span)

Why it fits:

  • token savings + high-fidelity evidence
  • prevents “compression drift” → “factual drift”
  • like mmap() for documents

Design E — Policy-tagged markdown (author-controlled compression)

Idea: authors annotate blocks:

:::pwo:pin
MUST keep
:::

:::pwo:compress ratio=0.25
Large background
:::

ingest honors tags.

Why it fits:

  • avoids compressing guardrails
  • readable and maintainable
  • works naturally with skills + references

8. Recommended operational policies (v1)

8.1 Compression policy

  • Never compress kernel invariants, tool ABI, or critical formatting examples.
  • Compress references heavily; optionally keep a raw sidecar.

8.2 Budgeting

Allocate token budgets by class:

  • Core: fixed, pinned
  • Skills: medium compression
  • References: aggressive compression

8.3 Auditing

Log for every ingest:

  • compression ratio per document
  • hashes for raw and compressed
  • budgets used
  • parsing success/failure

8.4 Prefer LLMLingua-2 as default engine

Use LLMLingua-2 for throughput; use LongLLMLingua where long-context/RAG behavior dominates.


9. Open questions / next steps

  1. Define a PromptWareOS-native payload spec:

    • LTON v0.1 (typed atoms, length-prefixed strings)
    • grammar + parser in Software Kernel
  2. Define ingest API:

    • inputs: file paths, URLs, tags, budgets
    • outputs: context image, sidecar IDs, audit logs
  3. Benchmarks:

    • compare JSON vs LTON token counts for typical tool calls
    • compare LLMLingua vs LLMLingua-2 for reference corpora
    • measure downstream task accuracy vs compression ratio
  4. Safety envelope:

    • invariants checker
    • schema validation
    • fallbacks (--raw, “re-ingest with lower compression”)

10. Summary

We started from a seemingly small question (“newline token cost”) and reached an OS-level conclusion:

  • Treat the context window as managed memory.
  • JSON is often an inefficient wire format for tool arguments and large text.
  • Build a compact, schema-first payload format (e.g., LTON).
  • Turn ingest into a Context Memory Manager with LLMLingua-powered compression.
  • Keep the kernel’s invariants uncompressed; compress the world.

This document is the seed for turning PromptWareOS into a system that is not just prompt-engineered, but prompt-operating-system engineered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions