fix(spec): add Base45 and Proquint fixtures & update encoding rules#138
fix(spec): add Base45 and Proquint fixtures & update encoding rules#138rvagg wants to merge 2 commits into
Conversation
Add test vectors for the newly-registered base45 (RFC9285) and proquint encodings across the basic, leading_zero, two_leading_zeros and case_insensitivity test files.
lidel
left a comment
There was a problem hiding this comment.
I was unsure what this PR is about, so I did some spelunking. Thanks for taking this on @rvagg, it's been sitting for a while.
In a vacuum, lgtm, but there are multiple ways of solving this, and unsure which one is canonical. Sharing what I found to save others time / also catch any misunderstanding on my end:
The problem this PR fixes
The original proquint paper (arxiv.org/abs/0901.4016 from 2009) describes encoding 16-bit words as CVCVC blocks and doesn't say what to do with an odd number of input bytes.
Multibase needs to handle any byte sequence, so somebody has to decide.
This PR locks that decision in, but there are multiple ways of solving this:
Four different rules exist for this
PR #125 discussed three; an IETF draft added a fourth later:
- CVC truncation (this PR, by @rvagg): trailing byte → 3-character
CVCblock.[0x21]→pro-fah. - Pad + prefix swap (by @ben221199): pad with
0x00, swappro-→por-.[0x21]→por-fahab. - Silent "y" (by @dsw, the original proquint author): trailing vowels with an unwritten silent
y.[0x00]→baa, read "baya". dsw promised a spec update on July 18, 2024 but in Feb 2025 confirmed he hadn't gotten to it.dsw/proquinthas had no spec activity since. - Pad + trailing hyphen:
draft-rayner-proquint, an independent IETF submission, v00 through v11 between Aug 11 and Oct 13, 2025. Pad with0x00, append a literal-(U+002D HYPHEN-MINUS) as the signal.[0x21]→pro-fahab-. Withdrawn Nov 21, 2025, expired Apr 16, 2026, never adopted by a WG.
Our libraries
multiformats/go-multibase(Kubo): no proquint or working base45 support. There's aBase45 = 'R'constant in the source, but it isn't registered in theEncodingToStrmap, so the encoder can't actually use it.multiformats/js-multiformats(Helia): proquint PR #292 is in draft; base45 PR #291 is open + approved but unmerged. Both by @rvagg; #292 uses this PR's CVC-truncation rule.
Neither library encodes proquint today.
My concern
This PR isn't documenting deployed behavior. It picks one of four proposed rules and makes it canonical for multibase. The other three are live enough that someone reading the paper, the IETF draft, or the PR #125 thread could reasonably implement a different rule. When that happens, pro-* strings won't round-trip between implementations and the failure is silent.
Preference
I'd lean toward parking the proquint parts of this PR (the rfcs/Proquint.md change and the proquint rows in the test CSVs) until we have upstream clarity. The base45 fixtures look ready to ship and could move to a separate PR so they're not blocked on the proquint conversation.
I know things often can get stuck forever due to lack of decision, but how urgent is for use to resolve the odd proquint question?
I pinged dsw in dsw/proquint#23 with this four-option summary and asking him to either endorse one (with an update to the paper) or explicitly say the spec doesn't cover odd-byte inputs seems like the sensible next step. If he stays quiet for, say, 30 days, this PR can land as-is having made a good-faith effort + at least we left bread crumbs for people to discover discrepancy once they start googling.
|
|
||
| The original proquint specification operates on 16-bit chunks and is silent on inputs whose length is not a multiple of two bytes. Multibase requires arbitrary byte sequences to be encodable, so this document specifies the following extension: | ||
|
|
||
| When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block representing the high 8 bits of a 16-bit value whose low 8 bits are zero. The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way. |
There was a problem hiding this comment.
(optional) hm.. the 16-bit framing is hard to follow (and slightly imprecise, iiuc a CVC block carries 10 bits, not 8?). A bullet list spelling out where each pair of bits goes is clearer and avoids the implicit 16-bit construction:
| When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block representing the high 8 bits of a 16-bit value whose low 8 bits are zero. The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way. | |
| When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block in which: | |
| - the first consonant carries bits 7-4 of the byte; | |
| - the vowel carries bits 3-2 of the byte; | |
| - the second consonant carries bits 1-0 of the byte in its high 2 bits, with its low 2 bits zero. | |
| The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way. |
| pro-fah | ||
| ``` | ||
|
|
||
| The second consonant in the trailing `CVC` block carries the two least-significant bits of the input byte in its two most-significant bits; its two least-significant bits MUST be zero. Decoders MUST reject any trailing `CVC` block where this is not the case, so that the encoding is canonical and bijective. The four valid trailing consonants are `b`, `h`, `m`, and `s`. |
There was a problem hiding this comment.
(optional) the example shows pro-fah for a 1-byte input, but the prose only says blocks join with - to preceding blocks. adding a sentence that covers the single-block case removes the ambiguity:
| The second consonant in the trailing `CVC` block carries the two least-significant bits of the input byte in its two most-significant bits; its two least-significant bits MUST be zero. Decoders MUST reject any trailing `CVC` block where this is not the case, so that the encoding is canonical and bijective. The four valid trailing consonants are `b`, `h`, `m`, and `s`. | |
| The second consonant in the trailing `CVC` block carries the two least-significant bits of the input byte in its two most-significant bits; its two least-significant bits MUST be zero. Decoders MUST reject any trailing `CVC` block where this is not the case, so that the encoding is canonical and bijective. The four valid trailing consonants are `b`, `h`, `m`, and `s`. When the input is a single byte the trailing `CVC` block is the only block, so the output is `pro-` followed directly by the three characters, as in the example above. |
|
I am being cc:ed on this conversation, but I lack context. Who are you and
what is your interest in Proquints?
Is all you want a way to extend Proquints to encode byte sequences that are
not a multiple of 16 bytes?
Daniel
…On Mon, May 11, 2026 at 9:32 AM Marcin Rataj ***@***.***> wrote:
***@***.**** commented on this pull request.
I was unsure what this PR is about, so I did some spelunking. Thanks for
taking this on @rvagg <https://github.com/rvagg>, it's been sitting for a
while.
In a vacuum, lgtm, but there are multiple ways of solving this, and unsure
which one is canonical. Sharing what I found to save others time / also
catch any misunderstanding on my end:
The problem this PR fixes
The original proquint paper (arxiv.org/abs/0901.4016 from 2009) describes
encoding 16-bit words as CVCVC blocks and doesn't say what to do with an
odd number of input bytes.
Multibase needs to handle any byte sequence, so somebody has to decide.
This PR locks that decision in, but there are multiple ways of solving
this:
Four different rules exist for this
PR #125 <#125> discussed
three; an IETF draft added a fourth later:
- *CVC truncation* (this PR, by @rvagg <https://github.com/rvagg>):
trailing byte → 3-character CVC block. [0x21] → pro-fah.
- *Pad + prefix swap* (by @ben221199 <https://github.com/ben221199>):
pad with 0x00, swap pro- → por-. [0x21] → por-fahab.
- *Silent "y"* (by @dsw <https://github.com/dsw>, the original
proquint author): trailing vowels with an unwritten silent y. [0x00] →
baa, read "baya". dsw promised a spec update on July 18, 2024 but in
Feb 2025 confirmed he hadn't gotten to it. dsw/proquint
<https://github.com/dsw/proquint> has had no spec activity since.
- *Pad + trailing hyphen*: draft-rayner-proquint
<https://datatracker.ietf.org/doc/draft-rayner-proquint/>, an
independent IETF submission, v00 through v11 between Aug 11 and Oct 13,
2025. Pad with 0x00, append a literal - (U+002D HYPHEN-MINUS) as the
signal. [0x21] → pro-fahab-. Withdrawn Nov 21, 2025, expired Apr 16,
2026, never adopted by a WG.
Our libraries
- multiformats/go-multibase
<https://github.com/multiformats/go-multibase> (Kubo): no proquint or
working base45 support. There's a Base45 = 'R' constant in the source,
but it isn't registered in the EncodingToStr map, so the encoder can't
actually use it.
- multiformats/js-multiformats
<https://github.com/multiformats/js-multiformats> (Helia): proquint PR
#292 <multiformats/js-multiformats#292> is in
*draft*; base45 PR #291
<multiformats/js-multiformats#291> is open +
approved but unmerged. Both by @rvagg <https://github.com/rvagg>; #292
uses this PR's CVC-truncation rule.
Neither library encodes proquint today.
My concern
This PR isn't documenting deployed behavior. It picks one of four proposed
rules and makes it canonical for multibase. The other three are live enough
that someone reading the paper, the IETF draft, or the PR #125
<#125> thread could
reasonably implement a different rule. When that happens, pro-* strings
won't round-trip between implementations and the failure is silent.
Preference
I'd lean toward parking the proquint parts of this PR (the
rfcs/Proquint.md change and the proquint rows in the test CSVs) until we
have upstream clarity. The base45 fixtures look ready to ship and could
move to a separate PR so they're not blocked on the proquint conversation.
I know things often can get stuck forever due to lack of decision, but how
urgent is for use to resolve the odd proquint question?
I pinged dsw in dsw/proquint#23
<dsw/proquint#23> with this four-option summary
and asking him to either endorse one (with an update to the paper) or
explicitly say the spec doesn't cover odd-byte inputs seems like the
sensible next step. If he stays quiet for, say, 30 days, this PR can land
as-is having made a good-faith effort + at least we left bread crumbs for
people to discover discrepancy once they start googling.
------------------------------
In rfcs/Proquint.md
<#138 (comment)>
:
> \ No newline at end of file
+```
+
+## Odd-byte inputs
+
+The original proquint specification operates on 16-bit chunks and is silent on inputs whose length is not a multiple of two bytes. Multibase requires arbitrary byte sequences to be encodable, so this document specifies the following extension:
+
+When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block representing the high 8 bits of a 16-bit value whose low 8 bits are zero. The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way.
(optional) hm.. the 16-bit framing is hard to follow (and slightly
imprecise, iiuc a CVC block carries 10 bits, not 8?). A bullet list
spelling out where each pair of bits goes is clearer and avoids the
implicit 16-bit construction:
⬇️ Suggested change
-When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block representing the high 8 bits of a 16-bit value whose low 8 bits are zero. The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way.
+When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block in which:
+
+- the first consonant carries bits 7-4 of the byte;
+- the vowel carries bits 3-2 of the byte;
+- the second consonant carries bits 1-0 of the byte in its high 2 bits, with its low 2 bits zero.
+
+The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way.
------------------------------
In rfcs/Proquint.md
<#138 (comment)>
:
> \ No newline at end of file
+```
+
+## Odd-byte inputs
+
+The original proquint specification operates on 16-bit chunks and is silent on inputs whose length is not a multiple of two bytes. Multibase requires arbitrary byte sequences to be encodable, so this document specifies the following extension:
+
+When the input has an odd number of bytes, every pair of bytes is encoded as a 5-character `CVCVC` block as usual, and the final byte is encoded as a 3-character `CVC` block representing the high 8 bits of a 16-bit value whose low 8 bits are zero. The trailing `CVC` block is joined to the preceding blocks with `-` in the usual way.
+
+For example, the single byte `[0x21]` (`!`) is encoded as:
+
+```
+pro-fah
+```
+
+The second consonant in the trailing `CVC` block carries the two least-significant bits of the input byte in its two most-significant bits; its two least-significant bits MUST be zero. Decoders MUST reject any trailing `CVC` block where this is not the case, so that the encoding is canonical and bijective. The four valid trailing consonants are `b`, `h`, `m`, and `s`.
(optional) the example shows pro-fah for a 1-byte input, but the prose
only says blocks join with - to *preceding* blocks. adding a sentence
that covers the single-block case removes the ambiguity:
⬇️ Suggested change
-The second consonant in the trailing `CVC` block carries the two least-significant bits of the input byte in its two most-significant bits; its two least-significant bits MUST be zero. Decoders MUST reject any trailing `CVC` block where this is not the case, so that the encoding is canonical and bijective. The four valid trailing consonants are `b`, `h`, `m`, and `s`.
+The second consonant in the trailing `CVC` block carries the two least-significant bits of the input byte in its two most-significant bits; its two least-significant bits MUST be zero. Decoders MUST reject any trailing `CVC` block where this is not the case, so that the encoding is canonical and bijective. The four valid trailing consonants are `b`, `h`, `m`, and `s`. When the input is a single byte the trailing `CVC` block is the only block, so the output is `pro-` followed directly by the three characters, as in the example above.
—
Reply to this email directly, view it on GitHub
<#138 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAL6XBYKL7VYU5Y3LQJACL42H6A3AVCNFSM6AAAAACYY6Z756VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DENRTGU2DCNJXGE>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
too much text @lidel; what's the TL;DR for me on this, this is a low-value PR for me, I'm inclined to close this given any friction, you're welcome to take it over if you care enough? |
Replaces #125