ArticlesFindingsMediaAbout
GitHubX.comRSS Feed
Vansh Vazirani

Articles / Interpretability

Replicating Introspection on Injected Content in Open-Source Language Models

This is my attempt at reproducing concept injection locally and investigating whether small open-source models can genuinely introspect and detect concepts artificially injected into their internal states.

Dec 02, 2025
10 min read

Categories

InterpretabilityReasoning models