Commit 7976dc3
authored
Use two-level dedup in named data store (#18351)
Replace single SHA-256 hash with a two-level approach:
1. Fast fingerprint (length + first 32 bytes) for cheap rejection
2. SHA-256 only when the fingerprint matches, to confirm without
full byte comparison
For a 35B MoE model with ~29 GB of named data where most buffers
are unique, the fingerprint rejects non-matches instantly. SHA-256
is only computed on the rare fingerprint match, avoiding the ~98s
cost of hashing everything upfront.
Fingerprint collisions are handled by storing a list of candidate
buffer indices per fingerprint, so no dedup opportunities are lost.
Test plan:
- All 12 tests pass in test_named_data_store.py
- Added test_fingerprint_collision: same fingerprint, different
content produces separate buffers
- Added test_fingerprint_collision_with_dedup: after a collision,
a true duplicate of an earlier blob still dedupes correctly1 parent 4f80b77 commit 7976dc3
2 files changed
Lines changed: 73 additions & 28 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
11 | | - | |
| 11 | + | |
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| |||
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
78 | | - | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
85 | 83 | | |
86 | 84 | | |
87 | 85 | | |
| |||
91 | 89 | | |
92 | 90 | | |
93 | 91 | | |
94 | | - | |
95 | | - | |
| 92 | + | |
| 93 | + | |
96 | 94 | | |
97 | 95 | | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
98 | 103 | | |
99 | 104 | | |
100 | 105 | | |
| |||
119 | 124 | | |
120 | 125 | | |
121 | 126 | | |
122 | | - | |
123 | | - | |
124 | | - | |
125 | 127 | | |
126 | 128 | | |
127 | | - | |
128 | | - | |
129 | | - | |
130 | | - | |
131 | | - | |
132 | | - | |
133 | | - | |
134 | | - | |
135 | | - | |
136 | | - | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
137 | 136 | | |
138 | | - | |
139 | | - | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
140 | 148 | | |
141 | | - | |
142 | 149 | | |
143 | 150 | | |
144 | | - | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
145 | 154 | | |
146 | | - | |
147 | 155 | | |
148 | 156 | | |
149 | 157 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
210 | 210 | | |
211 | 211 | | |
212 | 212 | | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
0 commit comments