DFlash uses a lightweight block diffusion model as a drafter for speculative decoding, conditioned to predict future token blocks based on the K/V cache extracted directly from the target LLM during inference. Reportedly achieves ~6.17x lossless speed-up for Qwen3-8B.
Findings
This is where I'll be saving links to interesting things I stumble across on the internet. You'll find cool research papers on LLMs that have caught my attention, fascinating tech discoveries, and other curiosities worth preserving.
DFlash: Block Diffusion for Flash Speculative Decoding
The Universal Weight Subspace Hypothesis
This research from JHU presents groundbreaking evidence that deep neural networks trained across vastly different tasks converge to shared, low-dimensional parametric subspaces. Fine-tuning using LoRA updates only a fraction of weights; this paper suggests that even these low-rank updates are redundant across different tasks.
- They show that a subspace learned from one set of tasks can be effectively applied to adapt models to completely different distributions with minimal performance loss.
- The paper provides mathematical analysis to explain why these subspaces emerge, linking them to the spectral properties of the model's Hessian and the pre-training data distribution.
Emergent Introspective Awareness in LLMs
Research report from Anthropic exhibiting that language models can demonstrate introspective awareness of their internal states. Under certain conditions, Claude models show the ability to notice and identify injected concept vectors and internal states.
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
This paper releases 256 open-source SAEs trained on every layer and sublayer of Llama-3.1-8B, covering residual streams, attention outputs, MLP outputs, and transcoders at both 32K and 128K feature widths.
- Sparse Autoencoders (SAEs) are unsupervised machine learning methods designed to extract interpretable features from neural networks by addressing superposition of features.
- The paper employs TopK SAEs, an improved variant that directly selects the K highest-activating features rather than using L1 penalties. This, along with other improvements, result in high reconstruction quality while achieving ~3x better sparsity (Lā ā 50 vs 150) compared to state-of-the-art JumpReLU SAEs.
Attention Is All You Need
The seminal paper introducing the Transformer architecture.