Token-Efficient Tool Payloads & Prompt Compression in PromptWare OS

# Token-Efficient Tool Payloads & Prompt Compression in PromptWare OS

> A living design note for humans and future AI agents.

## 0. Why this file exists

This document records the full path from an initial tokenization question (“what does a newline cost?”) to an architectural direction for PromptWare OS:

* **Tokens are a scarce kernel resource**.
* JSON is often **token-expensive** for tool calls and context packing.
* Long prompt/reference corpora can be **compressed** safely with research-backed methods (LLMLingua family).
* The syscall `ingest` should evolve into a **Context Memory Manager** (budgeting, compression, placement, audit).

The goal: design PromptWareOS prompt loading and tool-call payload formats that are **context-window friendly**, robust, and future-proof.

---

## 1. Origin: “How many tokens is a newline?”

### 1.1 Newline (actual line break) vs literal `\n`

A key realization:

* A **real newline** in plain text / markdown is often tokenized cheaply (commonly ~1 token for `\n`, and sometimes `\n\n` may be 1 token too, depending on tokenizer and surrounding text).
* The literal two-character sequence **`\n`** *inside a JSON string* is different: it requires a backslash + letter `n` and often costs more tokens.

### 1.2 Why JSON gets expensive

JSON is token-expensive for several reasons:

* **Escaping**: newlines become `\\n`, quotes become `\\"`, backslashes become `\\\\`.
* **Syntax overhead**: braces, quotes, commas, colons, repeated key names.
* **Tokenizer friendliness**: natural whitespace patterns merge well; escape sequences often don’t.

This leads to the principle:

> **Do not escape prose unless you must.**

---

## 2. First architectural inference: escaped JSON costs more than plain text

### 2.1 The practical consequence

If PromptWareOS passes large reference content as JSON strings, the token cost can balloon.

That suggests a design goal:

> Build a **token-optimized data object format** to pass from Prompt Kernel → Software Kernel (LLM → tool call), replacing JSON when possible.

---

## 3. Research direction: minimizing tokens for structured outputs

The AI research + engineering ecosystem is converging on a few adjacent threads:

### 3.1 Format choice matters (JSON vs alternatives)

Empirical comparisons (papers and serious practitioner benchmarks) show format choice affects:

* token usage
* parse reliability
* accuracy

### 3.2 Constrained decoding / structured outputs

Instead of “freehand JSON”, many systems use:

* **JSON schema or grammar guided decoding**
* constrained generation to ensure validity

This changes the game:

* you can use a more compact surface syntax
* correctness comes from the decoder/grammar, not from verbose self-describing text

### 3.3 Prompt compression research

A separate (but crucial) research thread: compressing long prompts / references while retaining downstream performance.

This leads directly to LLMLingua.

---

## 4. Token-optimized payload formats (JSON replacement ideas)

### 4.1 The core problem

For tool calls / syscalls, we want:

* minimal tokens
* easy parsing
* stable schemas
* no fragile escaping

### 4.2 Three principles

**Principle A — Schema-first, not self-describing**

* The Software Kernel already knows schemas; payloads don’t need repeated key strings.

**Principle B — Avoid escaping with length-prefixed strings**

* Escapes are the silent token killer.
* Length-prefix allows raw text (including newlines and quotes) without backslashes.

**Principle C — Flat, positional, typed**

* Many tool calls are small arg lists.
* Prefer typed atoms and positional arguments over nested structures when possible.

### 4.3 Candidate compact format: LTON (Length-Typed Object Notation)

A “prompt-friendly protobuf” textual encoding.

Example (keyed):

```
!call weather.v1
(city s12:San Leandro)
(days u1:7)
(unit e1:C)
(note s43:Line1
Line2

Line4)
```

Example (positional):

```
!call weather.v1
(s12:San Leandro u1:7 e1:C s43:Line1
Line2

Line4)
```

Where:

* `sN:` = string of length N characters
* `uN:` = unsigned integer
* `eN:` = enum index

Advantages:

* no quotes
* no backslashes
* multi-line strings stay cheap (literal newlines)
* schema can be enforced in the Software Kernel

### 4.4 Other candidates worth exploring

* A TOON-like “token-oriented object notation” approach (community proposals)
* TSV / line-oriented K=V formats for shallow payloads
* Binary formats (protobuf/cbor) with base64 or sidechannel transport (token tradeoffs)
* “Hybrid”: compact arguments + separate raw text blocks referenced by IDs

---

## 5. Prompt compression: LLMLingua as a first-class kernel subsystem

### 5.1 What LLMLingua is (the mental model)

LLMLingua is a family of **extractive prompt compression** methods: given a long prompt, it deletes (drops) tokens that are predicted to be less important, producing a shorter prompt that aims to preserve downstream task quality.

Think of it as a **token garbage collector** for prompts:

* Input: long prompt (ICL examples, CoT traces, docs, transcripts)
* Output: shorter prompt that keeps the “information-bearing” parts
* Goal: reduce **cost** and **latency** while maintaining **accuracy**

This is especially relevant when prompts reach **10k–100k tokens**, where inference cost and long-context failure modes become dominant.

### 5.2 LLMLingua (v1): coarse-to-fine compression with a budget controller

LLMLingua v1 is built around three big ideas:

1. **Budget controller**

* You set a target token budget (or a compression ratio).
* The compressor tries to preserve semantic integrity while hitting that budget.

2. **Token-level iterative compression**

* Compression is not “one shot.”
* It iteratively prunes tokens while considering dependencies between remaining parts.

3. **Distribution alignment between compressor and target LLM**

* The compressor is typically a smaller model.
* LLMLingua uses instruction tuning to reduce mismatch so the compressed prompt still “works” for the target LLM.

**Tiny example (conceptual):**

Before:

```text
You are an agent. Follow rules A…Z.

Reference:
- Long doc paragraph 1...
- Long doc paragraph 2...
- Long doc paragraph 3...

Question: What are the key constraints?
```

After (same structure, less redundancy):

```text
You are an agent. Follow rules A…Z.

Reference (compressed):
- Keep only constraints, definitions, edge cases.

Question: What are the key constraints?
```

### 5.3 LLMLingua-2: fast, faithful, task-agnostic compression

LLMLingua-2 reframes compression as **token classification**:

* A small bidirectional encoder reads the full text.
* It labels which tokens should be kept vs dropped.
* Supervision is obtained via **data distillation from a powerful LLM** (e.g., GPT-4-generated compressed targets).

Why it matters operationally:

* **Fast** (often multiple times faster than v1)
* **Task-agnostic** (better out-of-domain behavior)
* **Faithful** (extractive, no paraphrasing / reduced hallucination risk)

**Tiny example (conceptual):**

Input doc:

```text
Install steps:
1. Download package.
2. Run installer.
3. Reboot.
Troubleshooting:
- If error X, do Y.
- If error Z, do W.
History:
This tool began in 2014...
(10 paragraphs of history)
```

Compressed output (keeps steps + troubleshooting; drops history):

```text
Install steps: 1) Download package 2) Run installer 3) Reboot
Troubleshooting: error X→Y; error Z→W
```

### 5.4 LongLLMLingua: long-context + RAG-oriented compression

LongLLMLingua targets long-context issues:

* **High cost** (tokens)
* **High latency**
* **Performance drops** in long contexts (including position bias / “middle loss”)

It focuses on improving the **density and placement** of key information:

* **Question-aware coarse-grained compression**: compress documents relative to the query.
* **Reordering**: move the most relevant information earlier (and sometimes strategically later).
* **Subsequence recovery**: preserve or recover critical spans to improve faithfulness.

**Tiny example (query-aware):**

Query: “Which flags are required to enable non-daemon mode?”

* Keep: paragraphs mentioning daemon/client mode, flags, examples.
* Drop: unrelated architecture background, changelog, long introductions.

### 5.5 What LLMLingua is good at (and what it is not)

**Great at:**

* Shrinking long, redundant references
* Keeping dense, task-relevant snippets
* Lowering token/cost/latency for long-context pipelines
* Helping the model “see” key info by removing noise

**Not great / caution areas:**

* Compressing high-stakes invariants (kernel laws, safety constraints)
* Compressing tool schemas or strict formatting examples
* Anything where a single missing negation changes meaning

Rule of thumb:

> Compress the world. Don’t compress the kernel.

### 5.6 How to think about LLMLingua inside PromptWareOS

PromptWareOS has a natural boundary where LLMLingua belongs:

* Prompt Kernel: declares intent + rules + ABI
* Software Kernel: loads resources and manages execution

So LLMLingua should live near `ingest` as a **Context Memory Manager component**:

* classify content
* allocate budgets
* compress references/skills (selectively)
* pack into context
* audit invariants

---

## 6. Reframing `ingest`: from “load files” to “Context Memory Manager”

Instead of merely concatenating prompt files, `ingest` should:

1. **Classify** content (core vs optional)
2. **Allocate budget** per class and per doc
3. **Compress** where appropriate (LLMLingua family)
4. **Place** the results into context (ordering/packing)
5. **Audit** invariants and parse reliability

This yields a “context image” concept:

* `context.core` (raw, pinned)
* `context.skills` (lightly compressed)
* `context.refs` (heavily compressed)
* `context.sidecar` (raw store addressable via syscalls)

---

## 7. Integration designs for LLMLingua in PromptWareOS

### Design A — Deterministic build artifact (offline compression)

**Idea:** compress at build/CI time.

* Generate `.pwc.md` artifacts (PromptWare Compressed)
* Store manifest: original hash → compressed hash → ratio
* Runtime uses compressed by default; `--raw` for debugging

**Why it fits:**

* deterministic + reviewable diffs
* fast runtime
* works like shipping binaries

### Design B — Adaptive memory paging (runtime compression)

**Idea:** `ingest --budget N` with paging.

* Page 0 pinned core
* Other pages compressible
* When new info arrives, re-compress low-priority pages more aggressively

**Why it fits:**

* handles agent life: late evidence
* stable core + adaptive periphery
* most “OS-like”

### Design C — Question-aware ingest (LongLLMLingua / RAG-optimized)

**Idea:** compress with awareness of the current task/query.

* retrieve candidates
* compress them conditioned on the question
* preserve key spans for grounding

**Why it fits:**

* avoids compressed irrelevance
* reduces long-context failure patterns

### Design D — Two-track ingest (compressed context + raw sidecar)

**Idea:** keep raw docs outside context, page-in slices on demand.

* Track 1: compressed goes into context
* Track 2: raw stored, addressable via `sys_read(ref_id, span)`

**Why it fits:**

* token savings + high-fidelity evidence
* prevents “compression drift” → “factual drift”
* like `mmap()` for documents

### Design E — Policy-tagged markdown (author-controlled compression)

**Idea:** authors annotate blocks:

```md
:::pwo:pin
MUST keep
:::

:::pwo:compress ratio=0.25
Large background
:::
```

`ingest` honors tags.

**Why it fits:**

* avoids compressing guardrails
* readable and maintainable
* works naturally with skills + references

---

## 8. Recommended operational policies (v1)

### 8.1 Compression policy

* Never compress kernel invariants, tool ABI, or critical formatting examples.
* Compress references heavily; optionally keep a raw sidecar.

### 8.2 Budgeting

Allocate token budgets by class:

* Core: fixed, pinned
* Skills: medium compression
* References: aggressive compression

### 8.3 Auditing

Log for every ingest:

* compression ratio per document
* hashes for raw and compressed
* budgets used
* parsing success/failure

### 8.4 Prefer LLMLingua-2 as default engine

Use LLMLingua-2 for throughput; use LongLLMLingua where long-context/RAG behavior dominates.

---

## 9. Open questions / next steps

1. Define a PromptWareOS-native payload spec:

   * LTON v0.1 (typed atoms, length-prefixed strings)
   * grammar + parser in Software Kernel

2. Define `ingest` API:

   * inputs: file paths, URLs, tags, budgets
   * outputs: context image, sidecar IDs, audit logs

3. Benchmarks:

   * compare JSON vs LTON token counts for typical tool calls
   * compare LLMLingua vs LLMLingua-2 for reference corpora
   * measure downstream task accuracy vs compression ratio

4. Safety envelope:

   * invariants checker
   * schema validation
   * fallbacks (`--raw`, “re-ingest with lower compression”)

---

## 10. Summary

We started from a seemingly small question (“newline token cost”) and reached an OS-level conclusion:

* Treat the context window as **managed memory**.
* JSON is often an inefficient wire format for tool arguments and large text.
* Build a compact, schema-first payload format (e.g., LTON).
* Turn `ingest` into a Context Memory Manager with LLMLingua-powered compression.
* Keep the kernel’s invariants uncompressed; compress the world.

This document is the seed for turning PromptWareOS into a system that is not just prompt-engineered, but **prompt-operating-system engineered**.


Token-Efficient Tool Payloads & Prompt Compression in PromptWare OS #15

Description

Token-Efficient Tool Payloads & Prompt Compression in PromptWare OS

0. Why this file exists

1. Origin: “How many tokens is a newline?”

1.1 Newline (actual line break) vs literal \n

1.2 Why JSON gets expensive

2. First architectural inference: escaped JSON costs more than plain text

2.1 The practical consequence

3. Research direction: minimizing tokens for structured outputs

3.1 Format choice matters (JSON vs alternatives)

3.2 Constrained decoding / structured outputs

3.3 Prompt compression research

4. Token-optimized payload formats (JSON replacement ideas)

4.1 The core problem

4.2 Three principles

4.3 Candidate compact format: LTON (Length-Typed Object Notation)

4.4 Other candidates worth exploring

5. Prompt compression: LLMLingua as a first-class kernel subsystem

5.1 What LLMLingua is (the mental model)

5.2 LLMLingua (v1): coarse-to-fine compression with a budget controller

5.3 LLMLingua-2: fast, faithful, task-agnostic compression

5.4 LongLLMLingua: long-context + RAG-oriented compression

5.5 What LLMLingua is good at (and what it is not)

5.6 How to think about LLMLingua inside PromptWareOS

6. Reframing ingest: from “load files” to “Context Memory Manager”

7. Integration designs for LLMLingua in PromptWareOS

Design A — Deterministic build artifact (offline compression)

Design B — Adaptive memory paging (runtime compression)

Design C — Question-aware ingest (LongLLMLingua / RAG-optimized)

Design D — Two-track ingest (compressed context + raw sidecar)

Design E — Policy-tagged markdown (author-controlled compression)

8. Recommended operational policies (v1)

8.1 Compression policy

8.2 Budgeting

8.3 Auditing

8.4 Prefer LLMLingua-2 as default engine

9. Open questions / next steps

10. Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1.1 Newline (actual line break) vs literal `\n`

6. Reframing `ingest`: from “load files” to “Context Memory Manager”