Skip to content

fix(spec): add Base45 and Proquint fixtures & update encoding rules#138

Open
rvagg wants to merge 2 commits into
masterfrom
rvagg/test-vectors
Open

fix(spec): add Base45 and Proquint fixtures & update encoding rules#138
rvagg wants to merge 2 commits into
masterfrom
rvagg/test-vectors

Conversation

@rvagg
Copy link
Copy Markdown
Member

@rvagg rvagg commented May 11, 2026

Replaces #125

  • Drops case insensitivity; RFC has a "MUST" on this
  • Adds an opinionated odd-byte Proquint rule; see Update test files #125 for background on that, I'm picking my version from that unresolved thread, I think it's the superior choice for us.

ben221199 and others added 2 commits May 11, 2026 20:55
Add test vectors for the newly-registered base45 (RFC9285) and proquint
encodings across the basic, leading_zero, two_leading_zeros and
case_insensitivity test files.
Copy link
Copy Markdown
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure what this PR is about, so I did some spelunking. Thanks for taking this on @rvagg, it's been sitting for a while.

In a vacuum, lgtm, but there are multiple ways of solving this, and unsure which one is canonical. Sharing what I found to save others time / also catch any misunderstanding on my end:

The problem this PR fixes

The original proquint paper (arxiv.org/abs/0901.4016 from 2009) describes encoding 16-bit words as CVCVC blocks and doesn't say what to do with an odd number of input bytes.

Multibase needs to handle any byte sequence, so somebody has to decide.
This PR locks that decision in, but there are multiple ways of solving this:

Four different rules exist for this

PR #125 discussed three; an IETF draft added a fourth later:

  • CVC truncation (this PR, by @rvagg): trailing byte → 3-character CVC block. [0x21]pro-fah.
  • Pad + prefix swap (by @ben221199): pad with 0x00, swap pro-por-. [0x21]por-fahab.
  • Silent "y" (by @dsw, the original proquint author): trailing vowels with an unwritten silent y. [0x00]baa, read "baya". dsw promised a spec update on July 18, 2024 but in Feb 2025 confirmed he hadn't gotten to it. dsw/proquint has had no spec activity since.
  • Pad + trailing hyphen: draft-rayner-proquint, an independent IETF submission, v00 through v11 between Aug 11 and Oct 13, 2025. Pad with 0x00, append a literal - (U+002D HYPHEN-MINUS) as the signal. [0x21]pro-fahab-. Withdrawn Nov 21, 2025, expired Apr 16, 2026, never adopted by a WG.

Our libraries

  • multiformats/go-multibase (Kubo): no proquint or working base45 support. There's a Base45 = 'R' constant in the source, but it isn't registered in the EncodingToStr map, so the encoder can't actually use it.
  • multiformats/js-multiformats (Helia): proquint PR #292 is in draft; base45 PR #291 is open + approved but unmerged. Both by @rvagg; #292 uses this PR's CVC-truncation rule.

Neither library encodes proquint today.

My concern

This PR isn't documenting deployed behavior. It picks one of four proposed rules and makes it canonical for multibase. The other three are live enough that someone reading the paper, the IETF draft, or the PR #125 thread could reasonably implement a different rule. When that happens, pro-* strings won't round-trip between implementations and the failure is silent.

Preference

I'd lean toward parking the proquint parts of this PR (the rfcs/Proquint.md change and the proquint rows in the test CSVs) until we have upstream clarity. The base45 fixtures look ready to ship and could move to a separate PR so they're not blocked on the proquint conversation.

I know things often can get stuck forever due to lack of decision, but how urgent is for use to resolve the odd proquint question?

I pinged dsw in dsw/proquint#23 with this four-option summary and asking him to either endorse one (with an update to the paper) or explicitly say the spec doesn't cover odd-byte inputs seems like the sensible next step. If he stays quiet for, say, 30 days, this PR can land as-is having made a good-faith effort + at least we left bread crumbs for people to discover discrepancy once they start googling.

Comment thread rfcs/Proquint.md

The original proquint specification operates on 16-bit chunks and is silent on inputs whose length is not a multiple of two bytes. Multibase requires arbitrary byte sequences to be encodable, so this document specifies the following extension:

When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block representing the high 8 bits of a 16-bit value whose low 8 bits are zero. The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(optional) hm.. the 16-bit framing is hard to follow (and slightly imprecise, iiuc a CVC block carries 10 bits, not 8?). A bullet list spelling out where each pair of bits goes is clearer and avoids the implicit 16-bit construction:

Suggested change
When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block representing the high 8 bits of a 16-bit value whose low 8 bits are zero. The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way.
When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block in which:
- the first consonant carries bits 7-4 of the byte;
- the vowel carries bits 3-2 of the byte;
- the second consonant carries bits 1-0 of the byte in its high 2 bits, with its low 2 bits zero.
The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way.

Comment thread rfcs/Proquint.md
pro-fah
```

The second consonant in the trailing `CVC` block carries the two least-significant bits of the input byte in its two most-significant bits; its two least-significant bits MUST be zero. Decoders MUST reject any trailing `CVC` block where this is not the case, so that the encoding is canonical and bijective. The four valid trailing consonants are `b`, `h`, `m`, and `s`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(optional) the example shows pro-fah for a 1-byte input, but the prose only says blocks join with - to preceding blocks. adding a sentence that covers the single-block case removes the ambiguity:

Suggested change
The second consonant in the trailing `CVC` block carries the two least-significant bits of the input byte in its two most-significant bits; its two least-significant bits MUST be zero. Decoders MUST reject any trailing `CVC` block where this is not the case, so that the encoding is canonical and bijective. The four valid trailing consonants are `b`, `h`, `m`, and `s`.
The second consonant in the trailing `CVC` block carries the two least-significant bits of the input byte in its two most-significant bits; its two least-significant bits MUST be zero. Decoders MUST reject any trailing `CVC` block where this is not the case, so that the encoding is canonical and bijective. The four valid trailing consonants are `b`, `h`, `m`, and `s`. When the input is a single byte the trailing `CVC` block is the only block, so the output is `pro-` followed directly by the three characters, as in the example above.

@dsw
Copy link
Copy Markdown

dsw commented May 12, 2026 via email

@rvagg
Copy link
Copy Markdown
Member Author

rvagg commented May 12, 2026

too much text @lidel; what's the TL;DR for me on this, this is a low-value PR for me, I'm inclined to close this given any friction, you're welcome to take it over if you care enough?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test vectors needed for Base45

4 participants