fix(spec): add Base45 and Proquint fixtures & update encoding rules by rvagg · Pull Request #138 · multiformats/multibase

rvagg · 2026-05-11T11:04:53Z

Replaces #125

Drops case insensitivity; RFC has a "MUST" on this
Adds an opinionated odd-byte Proquint rule; see Update test files #125 for background on that, I'm picking my version from that unresolved thread, I think it's the superior choice for us.

Add test vectors for the newly-registered base45 (RFC9285) and proquint encodings across the basic, leading_zero, two_leading_zeros and case_insensitivity test files.

lidel

I was unsure what this PR is about, so I did some spelunking. Thanks for taking this on @rvagg, it's been sitting for a while.

In a vacuum, lgtm, but there are multiple ways of solving this, and unsure which one is canonical. Sharing what I found to save others time / also catch any misunderstanding on my end:

The problem this PR fixes

The original proquint paper (arxiv.org/abs/0901.4016 from 2009) describes encoding 16-bit words as CVCVC blocks and doesn't say what to do with an odd number of input bytes.

Multibase needs to handle any byte sequence, so somebody has to decide.
This PR locks that decision in, but there are multiple ways of solving this:

Four different rules exist for this

PR #125 discussed three; an IETF draft added a fourth later:

CVC truncation (this PR, by @rvagg): trailing byte → 3-character CVC block. [0x21] → pro-fah.
Pad + prefix swap (by @ben221199): pad with 0x00, swap pro- → por-. [0x21] → por-fahab.
Silent "y" (by @dsw, the original proquint author): trailing vowels with an unwritten silent y. [0x00] → baa, read "baya". dsw promised a spec update on July 18, 2024 but in Feb 2025 confirmed he hadn't gotten to it. dsw/proquint has had no spec activity since.
Pad + trailing hyphen: draft-rayner-proquint, an independent IETF submission, v00 through v11 between Aug 11 and Oct 13, 2025. Pad with 0x00, append a literal - (U+002D HYPHEN-MINUS) as the signal. [0x21] → pro-fahab-. Withdrawn Nov 21, 2025, expired Apr 16, 2026, never adopted by a WG.

Our libraries

multiformats/go-multibase (Kubo): no proquint or working base45 support. There's a Base45 = 'R' constant in the source, but it isn't registered in the EncodingToStr map, so the encoder can't actually use it.
multiformats/js-multiformats (Helia): proquint PR #292 is in draft; base45 PR #291 is open + approved but unmerged. Both by @rvagg; #292 uses this PR's CVC-truncation rule.

Neither library encodes proquint today.

My concern

This PR isn't documenting deployed behavior. It picks one of four proposed rules and makes it canonical for multibase. The other three are live enough that someone reading the paper, the IETF draft, or the PR #125 thread could reasonably implement a different rule. When that happens, pro-* strings won't round-trip between implementations and the failure is silent.

Preference

I'd lean toward parking the proquint parts of this PR (the rfcs/Proquint.md change and the proquint rows in the test CSVs) until we have upstream clarity. The base45 fixtures look ready to ship and could move to a separate PR so they're not blocked on the proquint conversation.

I know things often can get stuck forever due to lack of decision, but how urgent is for use to resolve the odd proquint question?

I pinged dsw in dsw/proquint#23 with this four-option summary and asking him to either endorse one (with an update to the paper) or explicitly say the spec doesn't cover odd-byte inputs seems like the sensible next step. If he stays quiet for, say, 30 days, this PR can land as-is having made a good-faith effort + at least we left bread crumbs for people to discover discrepancy once they start googling.

lidel · 2026-05-11T12:18:45Z

+
+The original proquint specification operates on 16-bit chunks and is silent on inputs whose length is not a multiple of two bytes. Multibase requires arbitrary byte sequences to be encodable, so this document specifies the following extension:
+
+When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block representing the high 8 bits of a 16-bit value whose low 8 bits are zero. The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way.


(optional) hm.. the 16-bit framing is hard to follow (and slightly imprecise, iiuc a CVC block carries 10 bits, not 8?). A bullet list spelling out where each pair of bits goes is clearer and avoids the implicit 16-bit construction:

Suggested change

When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block representing the high 8 bits of a 16-bit value whose low 8 bits are zero. The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way.

When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block in which:

- the first consonant carries bits 7-4 of the byte;

- the vowel carries bits 3-2 of the byte;

- the second consonant carries bits 1-0 of the byte in its high 2 bits, with its low 2 bits zero.

The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way.

lidel · 2026-05-11T12:20:39Z

+pro-fah
+```
+
+The second consonant in the trailing `CVC` block carries the two least-significant bits of the input byte in its two most-significant bits; its two least-significant bits MUST be zero. Decoders MUST reject any trailing `CVC` block where this is not the case, so that the encoding is canonical and bijective. The four valid trailing consonants are `b`, `h`, `m`, and `s`.


(optional) the example shows pro-fah for a 1-byte input, but the prose only says blocks join with - to preceding blocks. adding a sentence that covers the single-block case removes the ambiguity:

Suggested change

The second consonant in the trailing `CVC` block carries the two least-significant bits of the input byte in its two most-significant bits; its two least-significant bits MUST be zero. Decoders MUST reject any trailing `CVC` block where this is not the case, so that the encoding is canonical and bijective. The four valid trailing consonants are `b`, `h`, `m`, and `s`.

The second consonant in the trailing `CVC` block carries the two least-significant bits of the input byte in its two most-significant bits; its two least-significant bits MUST be zero. Decoders MUST reject any trailing `CVC` block where this is not the case, so that the encoding is canonical and bijective. The four valid trailing consonants are `b`, `h`, `m`, and `s`. When the input is a single byte the trailing `CVC` block is the only block, so the output is `pro-` followed directly by the three characters, as in the example above.

dsw · 2026-05-12T00:30:19Z

I am being cc:ed on this conversation, but I lack context. Who are you and what is your interest in Proquints? Is all you want a way to extend Proquints to encode byte sequences that are not a multiple of 16 bytes? Daniel

…

On Mon, May 11, 2026 at 9:32 AM Marcin Rataj ***@***.***> wrote: ***@***.**** commented on this pull request. I was unsure what this PR is about, so I did some spelunking. Thanks for taking this on @rvagg <https://github.com/rvagg>, it's been sitting for a while. In a vacuum, lgtm, but there are multiple ways of solving this, and unsure which one is canonical. Sharing what I found to save others time / also catch any misunderstanding on my end: The problem this PR fixes The original proquint paper (arxiv.org/abs/0901.4016 from 2009) describes encoding 16-bit words as CVCVC blocks and doesn't say what to do with an odd number of input bytes. Multibase needs to handle any byte sequence, so somebody has to decide. This PR locks that decision in, but there are multiple ways of solving this: Four different rules exist for this PR #125 <#125> discussed three; an IETF draft added a fourth later: - *CVC truncation* (this PR, by @rvagg <https://github.com/rvagg>): trailing byte → 3-character CVC block. [0x21] → pro-fah. - *Pad + prefix swap* (by @ben221199 <https://github.com/ben221199>): pad with 0x00, swap pro- → por-. [0x21] → por-fahab. - *Silent "y"* (by @dsw <https://github.com/dsw>, the original proquint author): trailing vowels with an unwritten silent y. [0x00] → baa, read "baya". dsw promised a spec update on July 18, 2024 but in Feb 2025 confirmed he hadn't gotten to it. dsw/proquint <https://github.com/dsw/proquint> has had no spec activity since. - *Pad + trailing hyphen*: draft-rayner-proquint <https://datatracker.ietf.org/doc/draft-rayner-proquint/>, an independent IETF submission, v00 through v11 between Aug 11 and Oct 13, 2025. Pad with 0x00, append a literal - (U+002D HYPHEN-MINUS) as the signal. [0x21] → pro-fahab-. Withdrawn Nov 21, 2025, expired Apr 16, 2026, never adopted by a WG. Our libraries - multiformats/go-multibase <https://github.com/multiformats/go-multibase> (Kubo): no proquint or working base45 support. There's a Base45 = 'R' constant in the source, but it isn't registered in the EncodingToStr map, so the encoder can't actually use it. - multiformats/js-multiformats <https://github.com/multiformats/js-multiformats> (Helia): proquint PR #292 <multiformats/js-multiformats#292> is in *draft*; base45 PR #291 <multiformats/js-multiformats#291> is open + approved but unmerged. Both by @rvagg <https://github.com/rvagg>; #292 uses this PR's CVC-truncation rule. Neither library encodes proquint today. My concern This PR isn't documenting deployed behavior. It picks one of four proposed rules and makes it canonical for multibase. The other three are live enough that someone reading the paper, the IETF draft, or the PR #125 <#125> thread could reasonably implement a different rule. When that happens, pro-* strings won't round-trip between implementations and the failure is silent. Preference I'd lean toward parking the proquint parts of this PR (the rfcs/Proquint.md change and the proquint rows in the test CSVs) until we have upstream clarity. The base45 fixtures look ready to ship and could move to a separate PR so they're not blocked on the proquint conversation. I know things often can get stuck forever due to lack of decision, but how urgent is for use to resolve the odd proquint question? I pinged dsw in dsw/proquint#23 <dsw/proquint#23> with this four-option summary and asking him to either endorse one (with an update to the paper) or explicitly say the spec doesn't cover odd-byte inputs seems like the sensible next step. If he stays quiet for, say, 30 days, this PR can land as-is having made a good-faith effort + at least we left bread crumbs for people to discover discrepancy once they start googling. ------------------------------ In rfcs/Proquint.md <#138 (comment)> : > \ No newline at end of file +``` + +## Odd-byte inputs + +The original proquint specification operates on 16-bit chunks and is silent on inputs whose length is not a multiple of two bytes. Multibase requires arbitrary byte sequences to be encodable, so this document specifies the following extension: + +When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block representing the high 8 bits of a 16-bit value whose low 8 bits are zero. The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way. (optional) hm.. the 16-bit framing is hard to follow (and slightly imprecise, iiuc a CVC block carries 10 bits, not 8?). A bullet list spelling out where each pair of bits goes is clearer and avoids the implicit 16-bit construction: ⬇️ Suggested change -When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block representing the high 8 bits of a 16-bit value whose low 8 bits are zero. The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way. +When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block in which: + +- the first consonant carries bits 7-4 of the byte; +- the vowel carries bits 3-2 of the byte; +- the second consonant carries bits 1-0 of the byte in its high 2 bits, with its low 2 bits zero. + +The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way. ------------------------------ In rfcs/Proquint.md <#138 (comment)> : > \ No newline at end of file +``` + +## Odd-byte inputs + +The original proquint specification operates on 16-bit chunks and is silent on inputs whose length is not a multiple of two bytes. Multibase requires arbitrary byte sequences to be encodable, so this document specifies the following extension: + +When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block representing the high 8 bits of a 16-bit value whose low 8 bits are zero. The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way. + +For example, the single byte `[0x21]` (`!`) is encoded as: + +``` +pro-fah +``` + +The second consonant in the trailing `CVC` block carries the two least-significant bits of the input byte in its two most-significant bits; its two least-significant bits MUST be zero. Decoders MUST reject any trailing `CVC` block where this is not the case, so that the encoding is canonical and bijective. The four valid trailing consonants are `b`, `h`, `m`, and `s`. (optional) the example shows pro-fah for a 1-byte input, but the prose only says blocks join with - to *preceding* blocks. adding a sentence that covers the single-block case removes the ambiguity: ⬇️ Suggested change -The second consonant in the trailing `CVC` block carries the two least-significant bits of the input byte in its two most-significant bits; its two least-significant bits MUST be zero. Decoders MUST reject any trailing `CVC` block where this is not the case, so that the encoding is canonical and bijective. The four valid trailing consonants are `b`, `h`, `m`, and `s`. +The second consonant in the trailing `CVC` block carries the two least-significant bits of the input byte in its two most-significant bits; its two least-significant bits MUST be zero. Decoders MUST reject any trailing `CVC` block where this is not the case, so that the encoding is canonical and bijective. The four valid trailing consonants are `b`, `h`, `m`, and `s`. When the input is a single byte the trailing `CVC` block is the only block, so the output is `pro-` followed directly by the three characters, as in the example above. — Reply to this email directly, view it on GitHub <#138 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAL6XBYKL7VYU5Y3LQJACL42H6A3AVCNFSM6AAAAACYY6Z756VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DENRTGU2DCNJXGE> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

rvagg · 2026-05-12T02:58:01Z

too much text @lidel; what's the TL;DR for me on this, this is a low-value PR for me, I'm inclined to close this given any friction, you're welcome to take it over if you care enough?

ben221199 and others added 2 commits May 11, 2026 20:55

test: add Base45 and Proquint test vectors

aac9633

Add test vectors for the newly-registered base45 (RFC9285) and proquint encodings across the basic, leading_zero, two_leading_zeros and case_insensitivity test files.

fix(spec): refine Base45 and Proquint encoding rules

c79d7be

rvagg requested a review from lidel May 11, 2026 11:04

rvagg linked an issue May 11, 2026 that may be closed by this pull request

Test vectors needed for Base45 #124

Open

lidel mentioned this pull request May 11, 2026

How should proquint encode inputs whose byte length isn't a multiple of two? dsw/proquint#23

Open

lidel reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spec): add Base45 and Proquint fixtures & update encoding rules#138

fix(spec): add Base45 and Proquint fixtures & update encoding rules#138
rvagg wants to merge 2 commits into
masterfrom
rvagg/test-vectors

rvagg commented May 11, 2026

Uh oh!

lidel left a comment

Uh oh!

lidel May 11, 2026

Uh oh!

lidel May 11, 2026

Uh oh!

dsw commented May 12, 2026 via email

Uh oh!

rvagg commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		The original proquint specification operates on 16-bit chunks and is silent on inputs whose length is not a multiple of two bytes. Multibase requires arbitrary byte sequences to be encodable, so this document specifies the following extension:

		When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block representing the high 8 bits of a 16-bit value whose low 8 bits are zero. The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way.

-When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block representing the high 8 bits of a 16-bit value whose low 8 bits are zero. The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way.
+When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block in which:
+- the first consonant carries bits 7-4 of the byte;
+- the vowel carries bits 3-2 of the byte;
+- the second consonant carries bits 1-0 of the byte in its high 2 bits, with its low 2 bits zero.
+The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way.

Conversation

rvagg commented May 11, 2026

Uh oh!

lidel left a comment

Choose a reason for hiding this comment

The problem this PR fixes

Four different rules exist for this

Our libraries

My concern

Preference

Uh oh!

lidel May 11, 2026

Choose a reason for hiding this comment

Uh oh!

lidel May 11, 2026

Choose a reason for hiding this comment

Uh oh!

dsw commented May 12, 2026 via email

Uh oh!

rvagg commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants