Skip to content

Commit 7b208bc

Browse files
Mazyodclaude
andcommitted
feat: add normalized semantic tokens with canonical legend
Add a normalized semantic tokens layer that allows Monaco/editors to use a single fixed legend regardless of which backend (Pyright, Pyrefly, ty) is active. - Add lsp_types/semantic_tokens.py with CANONICAL_LEGEND and normalization - Add get_semantic_tokens(normalize=True) parameter to Session - Add canonical_legend and backend_legend properties to Session - Capture server legend during Session initialization - Add get_semantic_tokens_legend() to LSPBackend protocol - Export CANONICAL_LEGEND from lsp_types Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 0158416 commit 7b208bc

11 files changed

Lines changed: 408 additions & 17 deletions

File tree

CLAUDE.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,14 @@ This is a minimal-dependency Python library providing typed LSP (Language Server
7070
- Reusable across different LSP implementations (not just Pyright)
7171
- Handles process lifecycle: creation, reuse, idle cleanup, and shutdown
7272

73+
**Semantic Tokens Normalization (`lsp_types/semantic_tokens.py`)**
74+
- `CANONICAL_LEGEND`: Fixed canonical legend for Monaco/editor integration
75+
- `CANONICAL_TOKEN_TYPES`, `CANONICAL_TOKEN_MODIFIERS`: LSP standard types/modifiers plus backend-specific
76+
- `build_type_mapping()`, `build_modifier_mapping()`: Create index mapping tables
77+
- `normalize_tokens()`: Remap token indices from backend-specific to canonical legend
78+
- `PYREFLY_LEGEND`: Hardcoded legend for Pyrefly (doesn't advertise via LSP)
79+
- Used by `Session.get_semantic_tokens(normalize=True)` for backend-agnostic tokens
80+
7381
**Backend Integrations**
7482

7583
**Pyright Integration (`lsp_types/pyright/`)**

docs/SEMANTIC_TOKENS.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,65 @@ The token types and modifiers must be registered in the **exact same order** as
205205

206206
---
207207

208+
## Normalized Semantic Tokens API
209+
210+
The library provides a **normalized tokens API** that remaps token indices to a canonical legend. This allows Monaco/editors to use a single fixed legend regardless of which backend is active.
211+
212+
### The Problem
213+
214+
Each backend has different legend ordering:
215+
216+
| Token | Pyright Index | Pyrefly Index | ty Index |
217+
|-------|---------------|---------------|----------|
218+
| `namespace` | 0 | 0 | 0 |
219+
| `class` | 2 | 2 | 1 |
220+
| `variable` | 6 | 8 | 5 |
221+
| `function` | 9 | 12 | 7 |
222+
223+
A Monaco client configured with one legend breaks when switching backends.
224+
225+
### The Solution
226+
227+
Use the `normalize=True` parameter to get tokens with indices remapped to the canonical legend:
228+
229+
```python
230+
from lsp_types import Session, CANONICAL_LEGEND
231+
from lsp_types.pyright.backend import PyrightBackend
232+
233+
session = await Session.create(PyrightBackend(), initial_code="x = 1")
234+
235+
# Original tokens (backend-specific indices)
236+
raw = await session.get_semantic_tokens()
237+
238+
# Normalized tokens (canonical indices matching CANONICAL_LEGEND)
239+
normalized = await session.get_semantic_tokens(normalize=True)
240+
241+
# Monaco uses one fixed legend for all backends
242+
monaco_legend = CANONICAL_LEGEND
243+
```
244+
245+
### Available Properties
246+
247+
```python
248+
session.canonical_legend # The canonical legend (fixed, same for all backends)
249+
session.backend_legend # The original legend from the server/backend
250+
```
251+
252+
### Canonical Legend Order
253+
254+
The canonical legend follows LSP standard ordering, with backend-specific tokens appended:
255+
256+
**Token Types (index 0-26):**
257+
- 0-22: LSP standard types (namespace, type, class, enum, interface, struct, typeParameter, parameter, variable, property, enumMember, event, function, method, macro, keyword, modifier, comment, string, number, regexp, operator, decorator)
258+
- 23: label (LSP standard)
259+
- 24-26: Backend-specific (selfParameter, clsParameter, builtinConstant)
260+
261+
**Token Modifiers (bit 0-12):**
262+
- 0-9: LSP standard modifiers (declaration, definition, readonly, static, deprecated, abstract, async, modification, documentation, defaultLibrary)
263+
- 10-12: Backend-specific (builtin, classMember, parameter)
264+
265+
---
266+
208267
## Updating This Document
209268

210269
Run the extraction script to get the latest legends:

examples/extract_semantic_legends.py

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -65,22 +65,26 @@ async def extract_legend(
6565

6666
if semantic_provider is None:
6767
# Try requesting tokens anyway - some servers respond without advertising
68-
await process.notify.did_open_text_document({
69-
"textDocument": {
70-
"uri": f"file://{base_path}/test.py",
71-
"languageId": types.LanguageKind.Python,
72-
"version": 1,
73-
"text": "x = 1\n",
68+
await process.notify.did_open_text_document(
69+
{
70+
"textDocument": {
71+
"uri": f"file://{base_path}/test.py",
72+
"languageId": types.LanguageKind.Python,
73+
"version": 1,
74+
"text": "x = 1\n",
75+
}
7476
}
75-
})
77+
)
7678
tokens = await asyncio.wait_for(
77-
process.send.semantic_tokens_full({
78-
"textDocument": {"uri": f"file://{base_path}/test.py"}
79-
}),
79+
process.send.semantic_tokens_full(
80+
{"textDocument": {"uri": f"file://{base_path}/test.py"}}
81+
),
8082
timeout=5.0,
8183
)
8284
if tokens and tokens.get("data"):
83-
print(f" {backend_name}: No legend advertised, but returns tokens (unusable without legend)")
85+
print(
86+
f" {backend_name}: No legend advertised, but returns tokens (unusable without legend)"
87+
)
8488
else:
8589
print(f" {backend_name}: No semantic tokens provider")
8690
return None

lsp_types/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
from . import methods # noqa: F401
44
from .requests import * # noqa: F401, F403
5+
from .semantic_tokens import CANONICAL_LEGEND # noqa: F401
56
from .session import * # noqa: F401, F403
67
from .types import * # noqa: F401, F403
78

lsp_types/pyrefly/backend.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
import lsp_types
88
from lsp_types import types
99
from lsp_types.process import ProcessLaunchInfo
10+
from lsp_types.semantic_tokens import PYREFLY_LEGEND
1011
from lsp_types.session import LSPBackend
1112

1213
from .config_schema import Model as PyreflyConfig
@@ -79,3 +80,7 @@ def get_workspace_settings(
7980
) -> types.DidChangeConfigurationParams:
8081
"""Get workspace settings for didChangeConfiguration"""
8182
return {"settings": options}
83+
84+
def get_semantic_tokens_legend(self) -> types.SemanticTokensLegend | None:
85+
"""Pyrefly doesn't advertise legend via LSP, return hardcoded legend."""
86+
return PYREFLY_LEGEND

lsp_types/pyright/backend.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,3 +79,7 @@ def get_workspace_settings(
7979
) -> types.DidChangeConfigurationParams:
8080
"""Get workspace settings for didChangeConfiguration"""
8181
return {"settings": options}
82+
83+
def get_semantic_tokens_legend(self) -> types.SemanticTokensLegend | None:
84+
"""Pyright advertises legend via LSP, use server-provided."""
85+
return None

lsp_types/semantic_tokens.py

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
"""Canonical semantic token legend and normalization utilities."""
2+
3+
from __future__ import annotations
4+
5+
from . import types
6+
7+
# Canonical token types (LSP standard order, then backend-specific)
8+
CANONICAL_TOKEN_TYPES: list[str] = [
9+
# LSP standard (SemanticTokenTypes enum order)
10+
"namespace", # 0
11+
"type", # 1
12+
"class", # 2
13+
"enum", # 3
14+
"interface", # 4
15+
"struct", # 5
16+
"typeParameter", # 6
17+
"parameter", # 7
18+
"variable", # 8
19+
"property", # 9
20+
"enumMember", # 10
21+
"event", # 11
22+
"function", # 12
23+
"method", # 13
24+
"macro", # 14
25+
"keyword", # 15
26+
"modifier", # 16
27+
"comment", # 17
28+
"string", # 18
29+
"number", # 19
30+
"regexp", # 20
31+
"operator", # 21
32+
"decorator", # 22
33+
"label", # 23 (LSP standard)
34+
# Backend-specific (appended)
35+
"selfParameter", # 24 (pyright, ty)
36+
"clsParameter", # 25 (pyright, ty)
37+
"builtinConstant", # 26 (ty)
38+
]
39+
40+
# Canonical token modifiers (LSP standard order, then backend-specific)
41+
CANONICAL_TOKEN_MODIFIERS: list[str] = [
42+
# LSP standard (SemanticTokenModifiers enum order)
43+
"declaration", # bit 0
44+
"definition", # bit 1
45+
"readonly", # bit 2
46+
"static", # bit 3
47+
"deprecated", # bit 4
48+
"abstract", # bit 5
49+
"async", # bit 6
50+
"modification", # bit 7
51+
"documentation", # bit 8
52+
"defaultLibrary", # bit 9
53+
# Backend-specific (appended)
54+
"builtin", # bit 10 (pyright)
55+
"classMember", # bit 11 (pyright)
56+
"parameter", # bit 12 (pyright - modifier, not to be confused with type)
57+
]
58+
59+
# The canonical legend for Monaco/editor integration
60+
CANONICAL_LEGEND: types.SemanticTokensLegend = {
61+
"tokenTypes": CANONICAL_TOKEN_TYPES,
62+
"tokenModifiers": CANONICAL_TOKEN_MODIFIERS,
63+
}
64+
65+
# Build lookup tables for canonical indices
66+
_CANONICAL_TYPE_INDEX: dict[str, int] = {
67+
name: idx for idx, name in enumerate(CANONICAL_TOKEN_TYPES)
68+
}
69+
_CANONICAL_MODIFIER_INDEX: dict[str, int] = {
70+
name: idx for idx, name in enumerate(CANONICAL_TOKEN_MODIFIERS)
71+
}
72+
73+
# Pyrefly legend (server doesn't advertise it via LSP)
74+
# Source: https://github.com/facebook/pyrefly/blob/main/pyrefly/lib/state/semantic_tokens.rs
75+
PYREFLY_LEGEND: types.SemanticTokensLegend = {
76+
"tokenTypes": [
77+
"namespace", # 0
78+
"type", # 1
79+
"class", # 2
80+
"enum", # 3
81+
"interface", # 4
82+
"struct", # 5
83+
"typeParameter", # 6
84+
"parameter", # 7
85+
"variable", # 8
86+
"property", # 9
87+
"enumMember", # 10
88+
"event", # 11
89+
"function", # 12
90+
"method", # 13
91+
"macro", # 14
92+
"keyword", # 15
93+
"modifier", # 16
94+
"comment", # 17
95+
"string", # 18
96+
"number", # 19
97+
"regexp", # 20
98+
"operator", # 21
99+
"decorator", # 22
100+
],
101+
"tokenModifiers": [
102+
"declaration", # bit 0
103+
"definition", # bit 1
104+
"readonly", # bit 2
105+
"static", # bit 3
106+
"deprecated", # bit 4
107+
"abstract", # bit 5
108+
"async", # bit 6
109+
"modification", # bit 7
110+
"documentation", # bit 8
111+
"defaultLibrary", # bit 9
112+
],
113+
}
114+
115+
116+
def build_type_mapping(backend_legend: types.SemanticTokensLegend) -> dict[int, int]:
117+
"""Build mapping from backend token type indices to canonical indices."""
118+
mapping: dict[int, int] = {}
119+
for backend_idx, type_name in enumerate(backend_legend["tokenTypes"]):
120+
canonical_idx = _CANONICAL_TYPE_INDEX.get(type_name, -1)
121+
mapping[backend_idx] = canonical_idx
122+
return mapping
123+
124+
125+
def build_modifier_mapping(
126+
backend_legend: types.SemanticTokensLegend,
127+
) -> dict[int, int]:
128+
"""Build mapping from backend modifier bit positions to canonical positions."""
129+
mapping: dict[int, int] = {}
130+
for backend_bit, modifier_name in enumerate(backend_legend["tokenModifiers"]):
131+
canonical_bit = _CANONICAL_MODIFIER_INDEX.get(modifier_name, -1)
132+
mapping[backend_bit] = canonical_bit
133+
return mapping
134+
135+
136+
def normalize_tokens(
137+
tokens: types.SemanticTokens,
138+
type_map: dict[int, int],
139+
modifier_map: dict[int, int],
140+
) -> types.SemanticTokens:
141+
"""Remap token indices to use canonical legend."""
142+
data = tokens.get("data", [])
143+
if not data:
144+
return tokens
145+
146+
# Each token is 5 integers: deltaLine, deltaStart, length, typeIndex, modifiers
147+
normalized_data: list[int] = []
148+
149+
for i in range(0, len(data), 5):
150+
if i + 4 >= len(data):
151+
break # Incomplete token data
152+
153+
delta_line = data[i]
154+
delta_start = data[i + 1]
155+
length = data[i + 2]
156+
type_index = data[i + 3]
157+
modifier_bits = data[i + 4]
158+
159+
# Remap token type index
160+
canonical_type = type_map.get(type_index, type_index)
161+
if canonical_type == -1:
162+
canonical_type = type_index # Keep original if unknown
163+
164+
# Remap modifier bitmask
165+
canonical_modifiers = 0
166+
for backend_bit, canonical_bit in modifier_map.items():
167+
if modifier_bits & (1 << backend_bit):
168+
if canonical_bit >= 0:
169+
canonical_modifiers |= 1 << canonical_bit
170+
171+
normalized_data.extend(
172+
[
173+
delta_line,
174+
delta_start,
175+
length,
176+
canonical_type,
177+
canonical_modifiers,
178+
]
179+
)
180+
181+
result: types.SemanticTokens = {"data": normalized_data}
182+
if "resultId" in tokens:
183+
result["resultId"] = tokens["resultId"]
184+
185+
return result

0 commit comments

Comments
 (0)