Version: 3.11 (__version__)
Type: Pure Python (with ~13 KLoC of embedded Unicode tables)
SPM target: Bundled in the Python framework (pulled in by requests)
Total Python modules: 8
Internationalized Domain Names
encoder / decoder. Lets you handle hostnames with non-ASCII characters
(münchen.de, 中国.中国, ดอทคอม.ไทย) by converting them to ASCII
"Punycode" form (xn--mnchen-3ya.de).
requests / urllib3 call this for you when you pass a Unicode URL.
You'd call it directly to validate input, normalize stored hostnames,
or implement a custom URL parser.
| Module | What it does |
|---|---|
idna.__init__ |
Public API: encode, decode, IDNAError, IDNABidiError, InvalidCodepoint, InvalidCodepointContext, alabel, ulabel, check_bidi, check_hyphen_ok, check_initial_combiner, check_label, check_nfc, uts46_remap, valid_contextj, valid_contexto, valid_label_length, valid_string_length, intranges_contain |
idna.core |
The encoder/decoder algorithm — UTS #46 + RFC 3490 / 5891 (437 lines) |
idna.codec |
Python codec registration — "name".encode("idna") (122 lines) |
idna.compat |
py2/py3 shim (deprecated, kept for compat — 15 lines) |
idna.idnadata |
Unicode tables — joining types, categories, BIDI classes (4309 lines, ~1.5 MB) |
idna.intranges |
Range-based table compression / lookup (57 lines) |
idna.package_data |
__version__ = "3.11" |
idna.uts46data |
UTS #46 case-mapping + character-status table (8841 lines, ~3 MB) |
import idna
# Encode → ASCII Punycode (for putting on the wire)
print(idna.encode("münchen.de")) # b'xn--mnchen-3ya.de'
print(idna.encode("ドメイン.テスト")) # b'xn--eckwd4c7c.xn--zckzah'
# Decode → human-readable Unicode
print(idna.decode("xn--mnchen-3ya.de")) # 'münchen.de'
# Validate a label (single hostname segment, e.g. "münchen")
try:
idna.uts46_remap("test_label") # underscores forbidden
except idna.IDNAError as e:
print(f"invalid: {e}")
# Codec registration — once `import idna` runs, `"name".encode("idna")` works
print("münchen".encode("idna")) # b'xn--mnchen-3ya'
print(b"xn--mnchen-3ya".decode("idna")) # 'münchen'# Useful when normalizing user-typed URLs
from urllib.parse import urlparse
import idna
def canonical_host(url: str) -> str:
"""Return the URL's hostname in ASCII Punycode form."""
host = urlparse(url).hostname or ""
try:
return idna.encode(host).decode("ascii")
except idna.IDNAError:
return host # already ASCII or invalid- Validating hostname input before passing it to a non-Unicode API
- Comparing hostnames safely (always normalize via Punycode first)
- Building a DNS lookup tool / URL bar
- UTS #46 strictness mode — by default,
idna.encodeusesuts46=Falseand rejects some ambiguous Unicode (full-width ASCII, deprecated codepoints). Passuts46=Trueto relax - No internationalized scheme support (e.g.
http://中国.中国/works becauserequestscallsidnaon the hostname; the scheme + path remain ASCII) - No emoji domains. Per RFC 5891, emoji are intentionally not IDNA-allowed.
xn--encodings of emoji you see in the wild are technically invalid even though they may resolve
idna.idnadata (1.5 MB) holds joining types, Unicode categories, and BIDI
classes per codepoint. idna.uts46data (3 MB) holds UTS #46 case mappings
and per-character status (valid / disallowed / deviation / ignored / mapped).
Combined, the embedded data is ~4.5 MB — the biggest cost of bundling
idna, but it lets the library work with no external dependencies.
When you do requests.get("https://münchen.de"):
requestsconstructs aurllib3.HTTPSConnectionPoolwithhost="münchen.de"urllib3passes the host toidna.encode(...)→b'xn--mnchen-3ya.de'urllib3opens a TCP socket to that ASCII hostname- Server responds;
requestsdecodes the body viacharset_normalizer
Both idna and charset_normalizer are essential AND silent — you'll only
notice them when something goes wrong (a server with Content-Type: text/html no charset, or a hostname with disallowed characters).
- 100% pure Python with embedded Unicode tables — no native code, no platform-specific paths
- ~4.5 MB on disk (mostly the
idnadata+uts46datatables) — biggest cost item in this library - Works identically on iOS as on any other platform
- requests.md — primary consumer
- encoding.md — TOC linking this + charset_normalizer
- charset-normalizer.md — the other "requests pulls it in" encoding library