You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<b>Figure 1. Overview.</b> Our method enables direct interpretation of vision encoder features through image reconstruction, revealing how different architectures internally represent visual information. We demonstrate this by (a) comparing feature informativeness between model families, (b) ranking encoders by their feature representation quality, and (c) showing how controlled feature space manipulations produce predictable image changes.
<p>We introduce a new approach to interpret vision encoder features through direct image reconstruction, providing insights into how these models internally represent visual information.</p>
163
+
</div>
164
+
</div>
165
+
<divclass="box">
166
+
<divclass="content">
167
+
<h4>📊 Model Family Comparison</h4>
168
+
<p>We reveal that encoders pre-trained on image-based tasks retain significantly more image information compared to those trained on contrastive learning tasks, demonstrated through our SigLIP vs SigLIP2 analysis.</p>
169
+
</div>
170
+
</div>
171
+
<divclass="box">
172
+
<divclass="content">
173
+
<h4>🎨 Feature Space Control</h4>
174
+
<p>We demonstrate that orthogonal rotations in feature space control color encoding, enabling predictable image manipulations and revealing the structured nature of the feature representations.</p>
Our method enables direct interpretation of vision encoder features through image reconstruction. We train a decoder network that learns to reconstruct original images from their feature representations, providing a quantitative measure of feature informativeness.
<b>Figure 1. Image reconstructor training.</b> For pretrained model we train a reconstructor model that
153
-
restores the image from the feature space.
204
+
<b>Figure 1.</b> Our reconstruction framework trains a decoder to restore images from feature representations, enabling direct assessment of feature informativeness.
154
205
</p>
206
+
</div>
207
+
</div>
208
+
209
+
<!-- Comparative Analysis -->
210
+
<divclass="content mt-6">
211
+
<h3class="title is-4">Comparative Analysis: SigLIP vs SigLIP2</h3>
212
+
<pclass="has-text-justified">
213
+
We compare two related model families that differ only in their training objective: SigLIP (trained with contrastive learning) and SigLIP2 (trained on image-based tasks). This controlled comparison reveals how training objectives influence feature representations.
<b>Figure 2. Reconstruction Metrics.</b> We show the results of the reconstruction for SigLip and SigLip2
158
-
for different image resultions.
218
+
<b>Figure 2.</b> Reconstruction quality comparison between SigLIP and SigLIP2 across different image resolutions demonstrates that image-based training leads to more informative feature representations.
<!-- Визуализация фреймворка: обобщил оператор в пр-ве картинок и в пр-ве фичей -->
171
-
<!-- Примеры работы с RGB -->
172
-
<!-- Примеры работы с отключением одного канала (ожелтением) -->
173
-
<!-- Примеры Спектра такой м-цы, показать, что только небольшое кол-во каналов меняется -->
174
-
<!-- -->
175
-
227
+
<!-- Feature Space Analysis -->
176
228
<sectionclass="section">
177
229
<divclass="container is-max-desktop">
178
230
<divclass="columns is-centered">
179
231
<divclass="column is-four-fifths">
180
-
<h2class="title is-3 has-text-centered">Feature-space transformations. Q matrix Calculation and Application.</h2>
232
+
<h2class="title is-3 has-text-centered">Feature Space Analysis</h2>
233
+
234
+
<!-- Q Matrix Framework -->
181
235
<divclass="content">
236
+
<h3class="title is-4">Q Matrix: A Tool for Feature Manipulation</h3>
237
+
<pclass="has-text-justified">
238
+
We introduce the Q matrix framework that enables controlled manipulation of feature representations. This orthogonal transformation matrix is learned to perform specific image manipulations, revealing how visual attributes are encoded in the feature space.
<b>Figure 4. Feature-space transformations. Q matrix Application.</b>After Q matrix is calculated, we apply it to the feature space. For each patch embedding.
251
+
<b>Figure 4.</b>Application of Q matrix to feature embeddings enables controlled image manipulation.
194
252
</p>
195
253
</div>
196
254
</div>
197
255
</div>
198
-
</div>
199
-
</div>
200
-
</div>
201
-
</section>
202
256
203
-
<sectionclass="section">
204
-
<divclass="container is-max-desktop">
205
-
<divclass="columns is-centered">
206
-
<divclass="column is-four-fifths">
207
-
<h2class="title is-3 has-text-centered">Feature-space transformations. Color Swap Examples.</h2>
Through our Q matrix framework, we demonstrate precise control over color attributes in the feature space. Our experiments reveal that color information is encoded through orthogonal rotations rather than spatial transformations.
<b>Figure 6. Eigenvalues for red-blue channel swap matrix.</b>Majority of eigenvalues are close to 1, which means that the transformation is close to an identity matrix. While the other cluster of eigenvalues are close to -1, which means that for these channels direction is changed to the opposite.
276
+
<b>Figure 6.</b>Eigenvalue analysis reveals that color transformations affect only specific feature dimensions while preserving others.
220
277
</p>
221
278
</div>
222
279
</div>
223
-
</div>
224
-
</div>
225
-
</div>
226
-
</div>
227
-
</section>
228
280
229
-
<sectionclass="section">
230
-
<divclass="container is-max-desktop">
231
-
<divclass="columns is-centered">
232
-
<divclass="column is-four-fifths">
233
-
<h2class="title is-3 has-text-centered">Feature-space transformations. Blue Channel Suppression.</h2>
Our work introduces a novel approach to understanding vision encoder features through image reconstruction. We demonstrate that:
321
+
</p>
322
+
<ul>
323
+
<li>Training objectives significantly impact how models internally represent visual information</li>
324
+
<li>Image-based pre-training leads to more informative feature representations compared to contrastive learning</li>
325
+
<li>Color information is encoded through orthogonal rotations in feature space</li>
326
+
<li>Our method provides a general framework for analyzing any vision encoder's feature representations</li>
327
+
</ul>
328
+
<p>
329
+
These findings have important implications for model design and provide new tools for understanding and controlling vision encoder behavior. Our approach opens new avenues for feature analysis and manipulation in vision models.
0 commit comments