You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Infrastructure for capturing LLM activations and SAE (Sparse Autoencoders) features, training probes for prompt maliciousness detection, and evaluating out-of-distribution generalization with Leave-One-Dataset-Out (LODO)
Transformers carry internal error signals long before output. Architecture determines whether those signals are linearly monitorable or effectively hidden.
Evaluation framework of different methods for probing and steering LLMs activations to mitigate Chain-of-Thought Unfaithfulness. Research project by Giovanni M. Occhipinti (University of Bologna), Alessandro Abate e Nandi Schoots (University of Oxford).