Engram: DeepSeek's Brilliant Architecture for Information Recall

Vansh Vazirani·Apr 12, 2026·LLM architecture

Engram is a new conditional memory module for the Transformer architecture proposed by DeepSeek which acts as a massive lookup table storing embedding vectors for known sequences of multiple tokens (N-grams). It functions by using a multi-head hashing mechanism to retrieve these memory vectors in $O(1)$ time and fuses them into the model's hidden state through a context-aware gate.

LLMs are incredibly heavyweight machines. When you give a model a prompt, it throws every token at a series of massive attention and feedforward layers to predict the next one - and it does this regardless of whether it's solving a complex quantum physics equation or simply recalling what the capital of France is. Even with optimisations like Mixture-of-Experts and speculative decoding, these models always have to perform the cumbersome process of reconstructing the meaning of sequences of tokens, layer by layer.

The problem runs deeper than raw compute, though. Even when knowledge is already encoded in the weights, a transformer still has to reconstruct its meaning at runtime. For single tokens this is relatively cheap, but when knowledge is spread across multiple tokens, the model has to spend computation building up a unified representation of that multi-token pattern before it can be used downstream. It has to do this every single time the same pattern appears. Performing this kind of neural computation solely to perform information recall is really inefficient, especially when it can be offloaded to much simpler mechanisms.

Lookup tables.

Specifically, hashed N-gram embedding tables – a structure so computationally cheap it predates neural networks entirely and responds on O(1) time. DeepSeek's Engram takes this classical idea and integrates it into transformer layers as a fundamental architectural component. Understanding exactly how requires unpacking three things: how the lookup works, how the model decides whether to trust it, and how it integrates with the rest of the network.

The Engram Architecture

They say a picture is worth a thousand words - so here is a picturesque representation of Engram, presented to you by Claude and yours truly.

Hashing the context

For each token position, the Engram layer computes multiple hash indices – one for each N-gram order up to some configurable maximum. With a maximum of 3, for instance, every token produces two indices: one derived from the current and previous token (the 2-gram), and one derived from the current and two previous tokens (the 3-gram). Each index points to a specific row in a massive embedding table, and that row is retrieved directly.

These multi-token embeddings are learned during training. Through standard backpropagation, the entries in the table are updated with frequently appearing token sequences to be useful for next-token prediction. So if a particular sequence of tokens appears frequently enough during training, its corresponding embedding vector gradually becomes a rich representation of that entity.

Before hashing, the tokens go through a compression step which collapses semantically equivalent tokens onto the same canonical ID. By the way, by the way, and btw, all point to the same embedding. Because these N-grams address the same embedding, the learned vectors are updated more frequently during training, making the embedding more reliable.

Assigning the N-grams to embeddings in the lookup table is tricky, because giving each token sequence its own embedding index would lead to an immense set of permutations. To address this, Engram hashes the compressed N-grams using multiple distinct hash functions. Each function independently maps the same N-gram to a different slot in its own sub-table, and the resulting vectors are concatenated into a single memory vector. A given 3-gram might collide with an unrelated sequence under one hash function, but the probability of colliding under all of them simultaneously is vanishingly small – so the concatenated vector remains distinctive even when individual heads overlap. The retrieved vectors from all N-gram orders and all hash heads are then concatenated into a single memory vector, which is passed to the gating stage.

The gating mechanism

The raw retrieved vector represents the isolated N-gram representation learned during training, without any knowledge of context. Because phrases can have multiple possible representations depending on the context, we cannot trust the N-gram just yet. The Big Bang might mean the theory explaining the birth of the Universe or the popular sitcom. It's also a problem for hash collisions, where the retrieved vector might be meaningful for the phrase that most frequently occupies that slot, but noise for a rarer phrase that collided into it. So how does the model know when to trust Engram's retrieved embeddings?

Engram addresses this with a gating mechanism that operates like a stripped-down attention operation. The current hidden state $h_t$ – which has already seen attention from previous layers and carries contextual information – acts as the query. The retrieved memory vector is projected into keys and values. The dot product of query and key produces a scalar gate value $\alpha$ between 0 and 1. This output is simply multiplied by the value projection. If the retrieved memory aligns with the current context, $\alpha$ opens. If it contradicts or is irrelevant, $\alpha$ closes and the module contributes almost nothing to the residual stream. A short depthwise causal convolution then refines the gated output before it's added back to $h_t$ .

With hashing and context-aware gating implemented, this is what Engram essentially looks like.

DeepSeek mHC

One subtle but important detail: DeepSeek uses as their backbone, which expands the residual stream into multiple parallel branches. This turns out to work beautifully with Engram's gating – each branch independently decides how much to trust the retrieved memory using its own key projection, allowing for multiple perspectives of the embedding to be considered independently. Ablation experiments show that removing multi-branch integration causes the single largest performance decrease of any design choice tested, more damaging than even removing the gating mechanism itself. mHC was published in December 2025, and it's already proving indispensable for the next generation of sparse architectures. That a research technique this new is already being rigorously integrated and ablated – and proves to be of such paramount importance – says so much about the pace and depth of thinking at DeepSeek. Seriously, hats off to DeepSeek's researchers!

A New Axis of Sparsity

Currently, the prime design principle for scaling LLM capacity without proportionally scaling compute is sparsity, predominantly through Mixture-of-Experts. MoE scales a model's capacity via conditional computation. You can drastically increase model size without proportionally blowing up the compute budget by selectively activating specific groups of FFN parameters to process dynamic logic.

DeepSeek proposes that Engram introduces a new dimension of sparsity to language models, which they call conditional memory. A model can maintain vast lookup tables storing embeddings for a plethora of possible multi-token sequences, and selectively retrieve very few of those embeddings on any given forward pass.

The sparsity allocation problem

These two dimensions of sparsity – MoE and Engram – can be seen as two independent ways of scaling model size without increasing compute costs. But in practice when you have a fixed parameter budget, you need to decide how to distribute parameters between them. DeepSeek formalises this as the Sparsity Allocation problem. They hold total parameter count and per-token FLOPs constant, then define an allocation ratio $\rho$ representing the fraction of the inactive parameter budget assigned to MoE experts. At $\rho=1$ the model is a pure MoE; at $\rho=0$ all inactive parameters are Engram embeddings with no routed experts at all.

What they found was a U-shaped curve between validation loss and $\rho$ . Loss is minimised when roughly 20-25% of the sparse parameter budget is allocated to Engram – worsening as you approach either extreme. A pure MoE model wastes depth reconstructing meanings that could simply be looked up. A pure Engram model cannot reason as well without enough MoE parameters, because you can't always replace true reasoning with memory. The optimal ratio 20-25% allocation was a stable finding across two different compute budgets ( $2 \times 10^{20}$ and $6 \times 10^{20}$ FLOPs).

What happens if we scale for infinite?

Under a fixed budget, the U-shaped curve clearly tells us the optimal sparsity to allocate to Engram. But since retrieving an N-gram embedding costs O(1) regardless of how large the embedding table is, what happens if you keep growing the table beyond a normal parameter budget? To test this out, DeepSeek held the MoE backbone fixed and swept the number of Engram embedding slots from roughly 250,000 up to 10 million – adding up to around 13 billion additional parameters in the process. Validation loss followed a clean power law, falling log-linearly with the number of embedding slots and showing no signs of saturation in the explored range. All of those performance gains came at virtually no additional compute cost.

Turns out memory isn't a problem

Scaling the embedding tables into the hundreds of billions of parameters is pretty alarming on first sight, but during inference, Engram's tables don't need to sit on-device at all. You can offload them to host memory (or to a different machine entirely) and prefetch them while the preceding transformer block is still computing, hiding the transfer delay almost entirely. The overhead clocks in at under 3% in DeepSeek's experiments even at 100B parameters. So the lookup is O(1), and the memory is someone else's problem.

Where should memory be injected?

The gating mechanism depends on $h_t$ being sufficiently contextualised – the hidden state needs enough signal from surrounding tokens so that the gate can judge whether a retrieved N-gram is actually relevant. But the whole point of injecting Engram early is to spare the backbone from reconstructing static patterns through multiple expensive layers. So Engram needs context to work – which is a bit awkward, given that its whole job is to arrive before the network has built up much context.

One round of attention, it turns out, is already sufficient. In DeepSeek's experiments, layer 2 turned out to be most optimal, giving the hidden state just enough contextual grounding to gate reliably. Splitting the total parameter budget across two layers with Engram – layers 2 and 6 – performs even better. This way, the first module handles the earliest patterns with a lightweight gate and the second can use richer hidden states to fetch relevant embeddings and modulates them more precisely.

Engram as a depth multiplier

This placement strategy hints at how Engram actually influences a model's computation. Standard transformers have to reconstruct the meaning of multi-token entities layer by layer, leading to wasted depth. By short-circuiting that reconstruction, Engram effectively makes the model simulate a deeper network than its layer count implies. DeepSeek verified this using two mechanistic interpretability tools:

LogitLens – which prematurely projects each layer's hidden state through the LM head and measures KL Divergence against the final prediction – showed that Engram models reach high-confidence, prediction-ready representations much earlier than the MoE baseline.
Centered Kernel Alignment – a measure of the similarity between layers' hidden states – revealed the scale of this shift. Layer 5 of their experimental model, Engram-27B, aligns most closely with layer 12 of the MoE baseline. This suggests the model is effectively traversing the equivalent of seven extra layers of feature composition before it even reaches its fifth physical layer.

Conclusion

Engram presents a very new perspective to creating models at scale, showing that the dominant framing of language model scaling – more compute and more parameters – has been leaving a lot of performance on the table. The scaling curves are genuinely insightful, and indicate that the frontier model of the future will almost certainly be a hybrid: spending its parameter budget on dynamic computation where the task demands it, and on cheap lookup where it doesn't.

It's hard not to feel a little awestruck reading this. The sheer detail with which DeepSeek has chased down the idea, from architecture to scaling laws to mechanistic analysis to perfecting the design – all tightly integrated in one paper – is really rare to see. Whoever's behind Engram, hats off!

Thank you for reading,
Vansh