activation-probing

Here are 3 public repositories matching this topic...

maxf-zn / prompt-mining

Infrastructure for capturing LLM activations and SAE (Sparse Autoencoders) features, training probes for prompt maliciousness detection, and evaluating out-of-distribution generalization with Leave-One-Dataset-Out (LODO)

sparse-autoencoders out-of-distribution activation-analysis mechanistic-interpretability llm-security prompt-injection-detection activation-probing lodo-evaluation

Updated Feb 18, 2026
Python

tmcarmichael / nn-observability

Star

Transformers carry internal error signals long before output. Architecture determines whether those signals are linearly monitorable or effectively hidden.

pytorch transformer ai-safety interpretability probing ai-research model-observability mechanistic-interpretability ai-safety-research activation-probing

Updated Apr 16, 2026
Python

Joe-Occhipinti / unfaithfulness_steering

Star

Evaluation framework of different methods for probing and steering LLMs activations to mitigate Chain-of-Thought Unfaithfulness. Research project by Giovanni M. Occhipinti (University of Bologna), Alessandro Abate e Nandi Schoots (University of Oxford).

chain-of-thought-reasoning representation-engineering activation-probing chain-of-thought-unfaithfulness

Updated Mar 29, 2026
Python

Improve this page

Add a description, image, and links to the activation-probing topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the activation-probing topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly