Articles / Interpretability
Replicating Introspection on Injected Content in Open-Source Language Models
This is my attempt at reproducing concept injection locally and investigating whether small open-source models can genuinely introspect and detect concepts artificially injected into their internal states.
Dec 02, 2025
10 min read