Skip to content

Commit cec061e

Browse files
author
Luke Sneeringer
authored
feat: AIP-210 – Unicode (#28)
1 parent 683f5e5 commit cec061e

2 files changed

Lines changed: 165 additions & 0 deletions

File tree

aip/general/0210/aip.md

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
# Unicode
2+
3+
APIs should be consistent on how they explain, limit, and bill for string
4+
values and their encodings. This ranges from little ambiguities (like fields
5+
"limited to 1024 characters") all the way to billing confusion (are names and
6+
values of properties in Datastore billed based on characters or bytes?).
7+
8+
In general, if limits are measured in bytes, we are discriminating against
9+
non-ASCII text since it takes up more space. On the other hand, if limits are
10+
measured in "characters", this is ambiguous about whether those are Unicode
11+
"code points", "code units" for a particular encoding (e.g. UTF-8 or UTF-16),
12+
"graphemes", or "grapheme clusters".
13+
14+
## Unicode primer
15+
16+
Character encoding tends to be an area we often gloss over, so a quick primer:
17+
18+
- Strings are just sequences of bytes that represent text according to some
19+
encoding format.
20+
- When we talk about **characters**, we sometimes mean Unicode **code points**,
21+
which are 21-bit unsigned integers `0` through `0x10FFFF`.
22+
- Other times we might mean **grapheme clusters**, which are _perceived_ as
23+
single characters but may be composed of multiple code points. For example,
24+
`á` can be represented as the single code point `U+00E1` or as a sequence of
25+
`U+0061` followed by `U+0301` (the letter `a`, then a combining acute
26+
accent).
27+
- Protocol buffers uses **UTF-8** ("Unicode Transformation Format") which is a
28+
variable-length encoding scheme that represents each code point as a sequence
29+
of 1 to 4 single-byte **code units**.
30+
31+
## Guidance
32+
33+
### Character definition
34+
35+
**TL;DR:** In our APIs, "character" means "Unicode code point".
36+
37+
In API documentation (e.g., API reference documents, blog posts, marketing
38+
documentation, billing explanations, etc), "character" **must** be defined as a
39+
Unicode code point.
40+
41+
### Length units
42+
43+
**TL;DR:** Set size limits in "characters" (as defined above).
44+
45+
All string field length limits defined in the API **must** be measured and
46+
enforced in characters as defined above. This means that there is an underlying
47+
maximum limit of (`4 * characters`) bytes, though this limit will only be hit
48+
when using exclusively characters that consist of 4 UTF-8 code units (32 bits).
49+
50+
If you use a database system (e.g. Spanner) which allows you to define a limit
51+
in characters, it is safe to assume that this byte-defined requirement is
52+
handled by the underlying storage system.
53+
54+
### Billing units
55+
56+
APIs **may** use either code points or bytes (using the UTF-8 encoding) as the
57+
unit for billing or quota measurement (e.g., Cloud Translation chooses to use
58+
characters). If an API does not define this, the assumption is that the unit of
59+
billing is characters (e.g., $0.01 _per character_, not $0.01 _per byte_).
60+
61+
### Unique identifiers
62+
63+
**TL;DR:** Unique identifiers **should** limit to ASCII, generally only
64+
letters, numbers, hyphens, and underscores. Additionally, unique identifiers
65+
**should** start with a letter, **should** end in either a letter or number,
66+
and **should not** have hyphens or underscores that are next to other hyphens
67+
or underscores.
68+
69+
Strings used as unique identifiers **should** limit inputs to ASCII characters,
70+
typically letters, numbers, hyphens, and underscores
71+
(`[a-zA-Z][a-zA-Z0-9_-]*`). This ensures that there are never accidental
72+
collisions due to normalization. If an API decides to allow all valid Unicode
73+
characters in unique identifiers, the API **must** reject any inputs that are
74+
not in Normalization Form C.
75+
76+
Unique identifiers **should** use a maximum length of 64 characters, though
77+
this limit may be expanded as necessary. 64 characters should be sufficient for
78+
most purposes as even UUIDs only require 36 characters.
79+
80+
### Normalization
81+
82+
**TL;DR:** Unicode values **should** be stored in [Normalization Form C][].
83+
84+
Values **should** always be normalized into Normalization Form C. Unique
85+
identifiers **must** always be stored in Normalization Form C (see the next
86+
section).
87+
88+
Imagine we're dealing with Spanish input "estar<strong>é</strong>" (the
89+
accented part will be bolded throughout). This text has 6 grapheme clusters,
90+
and can be represented by two distinct sequences of Unicode code points:
91+
92+
- Using 6 code points: `U+0065` `U+0073` `U+0074` `U+0061` `U+0072`
93+
**`U+00E9`**
94+
- Using 7 code points: `U+0065` `U+0073` `U+0074` `U+0061` `U+0072` **`U+0065`
95+
`U+0301`**
96+
97+
Further, when encoding to UTF-8, these code points have two different
98+
serialized representations:
99+
100+
- Using 7 code-units (7 bytes): `0x65` `0x73` `0x74` `0x61` `0x72` **`0xC3`
101+
`0xA9`**
102+
- Using 8 code-units (8 bytes): `0x65` `0x73` `0x74` `0x61` `0x72` **`0x65`
103+
`0xCC` `0x81`**
104+
105+
To avoid this discrepancy in size (both code units and code points), use
106+
[Normalization Form C][] which provides a canonical representation for strings.
107+
108+
[normalization form c]: https://unicode.org/reports/tr15/
109+
110+
### Uniqueness
111+
112+
**TL;DR:** Unicode values **must** be normalized to [Normalization Form C][]
113+
before checking uniqueness.
114+
115+
For the purposes of unique identification (e.g., `name`, `id`, or `parent`),
116+
the value **must** be normalized into [Normalization Form C][] (which happens
117+
to be the most compact). Otherwise we may have what is essentially "the same
118+
string" used to identify two entirely different resources.
119+
120+
In our example above, there are two ways of representing what is essentially
121+
the same text. This raises the question about whether the two representations
122+
should be treated as equivalent or not. In other words, if someone were to use
123+
both of those byte sequences in a string field that acts as a unique
124+
identifier, would it violate a uniqueness constraint?
125+
126+
The W3C recommends using Normalization Form C for all content moving across the
127+
internet. It is the most compact normalized form on Unicode text, and avoids
128+
most interoperability problems. If we were to treat two Unicode byte sequences
129+
as different when they have the same representation in NFC, we'd be required to
130+
reply to possible "Get" requests with content that is **not** in normalized
131+
form. Since that is definitely unacceptable, we **must** treat the two as
132+
identical by transforming any incoming string data into Normalized Form C or
133+
rejecting identifiers not in the normalized form.
134+
135+
There is some debate about whether we should view strings as sequences of code
136+
points encoded into byte sequences (leading to uniqueness determined based on
137+
the byte-representation of said string) or to interpret strings as a higher
138+
level abstraction having many different possible byte-representations. The
139+
stance taken here is that we already have a field type for handling that:
140+
`bytes`. Fields of type `string` already express an opinion of the validity of
141+
an input (it must be valid UTF-8). As a result, treating two inputs that have
142+
identical normalized forms as different due to their underlying byte
143+
representation seems to go against the original intent of the `string` type.
144+
This distinction typically doesn't matter for strings that are opaque to our
145+
services (e.g., `description` or `display_name`), however when we rely on
146+
strings to uniquely identify resources, we are forced to take a stance.
147+
148+
Put differently, our goal is to allow someone with text in any encoding (ASCII,
149+
UTF-16, UTF-32, etc) to interact with our APIs without a lot of "gotchas".
150+
151+
## References
152+
153+
- [Unicode normalization forms](https://unicode.org/reports/tr15/)
154+
- [Datastore pricing "name and value of each property"](https://cloud.google.com/datastore/pricing)
155+
doesn't clarify this.
156+
- [Natural Language pricing](https://cloud.google.com/natural-language/pricing)
157+
uses charges based on UTF-8 code points rather than code units.
158+
- [Text matching and normalization](https://sites.google.com/a/google.com/intl-eng/apis/matching?pli=1)

aip/general/0210/aip.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
id: 210
3+
state: approved
4+
created: 2018-08-20
5+
placement:
6+
category: design-patterns
7+
order: 110

0 commit comments

Comments
 (0)