|
| 1 | +# Unicode |
| 2 | + |
| 3 | +APIs should be consistent on how they explain, limit, and bill for string |
| 4 | +values and their encodings. This ranges from little ambiguities (like fields |
| 5 | +"limited to 1024 characters") all the way to billing confusion (are names and |
| 6 | +values of properties in Datastore billed based on characters or bytes?). |
| 7 | + |
| 8 | +In general, if limits are measured in bytes, we are discriminating against |
| 9 | +non-ASCII text since it takes up more space. On the other hand, if limits are |
| 10 | +measured in "characters", this is ambiguous about whether those are Unicode |
| 11 | +"code points", "code units" for a particular encoding (e.g. UTF-8 or UTF-16), |
| 12 | +"graphemes", or "grapheme clusters". |
| 13 | + |
| 14 | +## Unicode primer |
| 15 | + |
| 16 | +Character encoding tends to be an area we often gloss over, so a quick primer: |
| 17 | + |
| 18 | +- Strings are just sequences of bytes that represent text according to some |
| 19 | + encoding format. |
| 20 | +- When we talk about **characters**, we sometimes mean Unicode **code points**, |
| 21 | + which are 21-bit unsigned integers `0` through `0x10FFFF`. |
| 22 | +- Other times we might mean **grapheme clusters**, which are _perceived_ as |
| 23 | + single characters but may be composed of multiple code points. For example, |
| 24 | + `á` can be represented as the single code point `U+00E1` or as a sequence of |
| 25 | + `U+0061` followed by `U+0301` (the letter `a`, then a combining acute |
| 26 | + accent). |
| 27 | +- Protocol buffers uses **UTF-8** ("Unicode Transformation Format") which is a |
| 28 | + variable-length encoding scheme that represents each code point as a sequence |
| 29 | + of 1 to 4 single-byte **code units**. |
| 30 | + |
| 31 | +## Guidance |
| 32 | + |
| 33 | +### Character definition |
| 34 | + |
| 35 | +**TL;DR:** In our APIs, "character" means "Unicode code point". |
| 36 | + |
| 37 | +In API documentation (e.g., API reference documents, blog posts, marketing |
| 38 | +documentation, billing explanations, etc), "character" **must** be defined as a |
| 39 | +Unicode code point. |
| 40 | + |
| 41 | +### Length units |
| 42 | + |
| 43 | +**TL;DR:** Set size limits in "characters" (as defined above). |
| 44 | + |
| 45 | +All string field length limits defined in the API **must** be measured and |
| 46 | +enforced in characters as defined above. This means that there is an underlying |
| 47 | +maximum limit of (`4 * characters`) bytes, though this limit will only be hit |
| 48 | +when using exclusively characters that consist of 4 UTF-8 code units (32 bits). |
| 49 | + |
| 50 | +If you use a database system (e.g. Spanner) which allows you to define a limit |
| 51 | +in characters, it is safe to assume that this byte-defined requirement is |
| 52 | +handled by the underlying storage system. |
| 53 | + |
| 54 | +### Billing units |
| 55 | + |
| 56 | +APIs **may** use either code points or bytes (using the UTF-8 encoding) as the |
| 57 | +unit for billing or quota measurement (e.g., Cloud Translation chooses to use |
| 58 | +characters). If an API does not define this, the assumption is that the unit of |
| 59 | +billing is characters (e.g., $0.01 _per character_, not $0.01 _per byte_). |
| 60 | + |
| 61 | +### Unique identifiers |
| 62 | + |
| 63 | +**TL;DR:** Unique identifiers **should** limit to ASCII, generally only |
| 64 | +letters, numbers, hyphens, and underscores. Additionally, unique identifiers |
| 65 | +**should** start with a letter, **should** end in either a letter or number, |
| 66 | +and **should not** have hyphens or underscores that are next to other hyphens |
| 67 | +or underscores. |
| 68 | + |
| 69 | +Strings used as unique identifiers **should** limit inputs to ASCII characters, |
| 70 | +typically letters, numbers, hyphens, and underscores |
| 71 | +(`[a-zA-Z][a-zA-Z0-9_-]*`). This ensures that there are never accidental |
| 72 | +collisions due to normalization. If an API decides to allow all valid Unicode |
| 73 | +characters in unique identifiers, the API **must** reject any inputs that are |
| 74 | +not in Normalization Form C. |
| 75 | + |
| 76 | +Unique identifiers **should** use a maximum length of 64 characters, though |
| 77 | +this limit may be expanded as necessary. 64 characters should be sufficient for |
| 78 | +most purposes as even UUIDs only require 36 characters. |
| 79 | + |
| 80 | +### Normalization |
| 81 | + |
| 82 | +**TL;DR:** Unicode values **should** be stored in [Normalization Form C][]. |
| 83 | + |
| 84 | +Values **should** always be normalized into Normalization Form C. Unique |
| 85 | +identifiers **must** always be stored in Normalization Form C (see the next |
| 86 | +section). |
| 87 | + |
| 88 | +Imagine we're dealing with Spanish input "estar<strong>é</strong>" (the |
| 89 | +accented part will be bolded throughout). This text has 6 grapheme clusters, |
| 90 | +and can be represented by two distinct sequences of Unicode code points: |
| 91 | + |
| 92 | +- Using 6 code points: `U+0065` `U+0073` `U+0074` `U+0061` `U+0072` |
| 93 | + **`U+00E9`** |
| 94 | +- Using 7 code points: `U+0065` `U+0073` `U+0074` `U+0061` `U+0072` **`U+0065` |
| 95 | + `U+0301`** |
| 96 | + |
| 97 | +Further, when encoding to UTF-8, these code points have two different |
| 98 | +serialized representations: |
| 99 | + |
| 100 | +- Using 7 code-units (7 bytes): `0x65` `0x73` `0x74` `0x61` `0x72` **`0xC3` |
| 101 | + `0xA9`** |
| 102 | +- Using 8 code-units (8 bytes): `0x65` `0x73` `0x74` `0x61` `0x72` **`0x65` |
| 103 | + `0xCC` `0x81`** |
| 104 | + |
| 105 | +To avoid this discrepancy in size (both code units and code points), use |
| 106 | +[Normalization Form C][] which provides a canonical representation for strings. |
| 107 | + |
| 108 | +[normalization form c]: https://unicode.org/reports/tr15/ |
| 109 | + |
| 110 | +### Uniqueness |
| 111 | + |
| 112 | +**TL;DR:** Unicode values **must** be normalized to [Normalization Form C][] |
| 113 | +before checking uniqueness. |
| 114 | + |
| 115 | +For the purposes of unique identification (e.g., `name`, `id`, or `parent`), |
| 116 | +the value **must** be normalized into [Normalization Form C][] (which happens |
| 117 | +to be the most compact). Otherwise we may have what is essentially "the same |
| 118 | +string" used to identify two entirely different resources. |
| 119 | + |
| 120 | +In our example above, there are two ways of representing what is essentially |
| 121 | +the same text. This raises the question about whether the two representations |
| 122 | +should be treated as equivalent or not. In other words, if someone were to use |
| 123 | +both of those byte sequences in a string field that acts as a unique |
| 124 | +identifier, would it violate a uniqueness constraint? |
| 125 | + |
| 126 | +The W3C recommends using Normalization Form C for all content moving across the |
| 127 | +internet. It is the most compact normalized form on Unicode text, and avoids |
| 128 | +most interoperability problems. If we were to treat two Unicode byte sequences |
| 129 | +as different when they have the same representation in NFC, we'd be required to |
| 130 | +reply to possible "Get" requests with content that is **not** in normalized |
| 131 | +form. Since that is definitely unacceptable, we **must** treat the two as |
| 132 | +identical by transforming any incoming string data into Normalized Form C or |
| 133 | +rejecting identifiers not in the normalized form. |
| 134 | + |
| 135 | +There is some debate about whether we should view strings as sequences of code |
| 136 | +points encoded into byte sequences (leading to uniqueness determined based on |
| 137 | +the byte-representation of said string) or to interpret strings as a higher |
| 138 | +level abstraction having many different possible byte-representations. The |
| 139 | +stance taken here is that we already have a field type for handling that: |
| 140 | +`bytes`. Fields of type `string` already express an opinion of the validity of |
| 141 | +an input (it must be valid UTF-8). As a result, treating two inputs that have |
| 142 | +identical normalized forms as different due to their underlying byte |
| 143 | +representation seems to go against the original intent of the `string` type. |
| 144 | +This distinction typically doesn't matter for strings that are opaque to our |
| 145 | +services (e.g., `description` or `display_name`), however when we rely on |
| 146 | +strings to uniquely identify resources, we are forced to take a stance. |
| 147 | + |
| 148 | +Put differently, our goal is to allow someone with text in any encoding (ASCII, |
| 149 | +UTF-16, UTF-32, etc) to interact with our APIs without a lot of "gotchas". |
| 150 | + |
| 151 | +## References |
| 152 | + |
| 153 | +- [Unicode normalization forms](https://unicode.org/reports/tr15/) |
| 154 | +- [Datastore pricing "name and value of each property"](https://cloud.google.com/datastore/pricing) |
| 155 | + doesn't clarify this. |
| 156 | +- [Natural Language pricing](https://cloud.google.com/natural-language/pricing) |
| 157 | + uses charges based on UTF-8 code points rather than code units. |
| 158 | +- [Text matching and normalization](https://sites.google.com/a/google.com/intl-eng/apis/matching?pli=1) |
0 commit comments