1. Introduction

Transformers provide accurate short-term memory through attention but suffer from quadratic cost and fixed context window limits. Linear Transformers and modern linear RNNs improve efficiency but must compress all history into fixed-size states, which causes memory overflow and poor long-range recall.v

Human memory has separate short-term and long-term systems; existing architectures usually miss long-term memory that can adapt at test time.

Key question: How to design a neural long-term memory that can learn, forget, and recall information over extremely long contexts efficiently?

Titans introduce:

A neural long-term memory (LMM) that updates weights at test time.
Titans architectures that integrate LMM with attention and persistent memory.

2. Method

2.1 Neural Long-term Memory (LMM)

LMM treats learning as memorizing past tokens into its parameters during test time.

Uses a surprise metric: the gradient of the associative memory loss with respect to the input, larger gradient → more “surprising” token → more memorable.

The memory update dynamics are:

Memory stores Associative key–value pairs using a loss**:**

$$ \ell = |M_{t-1}(k_t) - v_t|_2^2 $$

where $k_t = x_t W_K$ and $v_t = x_t W_V$.

Surprise gradient:

$$ g_t = \nabla_{M_{t-1}}\ell $$

Surprise momentum:

$$ S_t = \eta_t S_{t-1} - \theta_t g_t $$

Combined using a data-dependent decay $\eta_t$ and learning rate $\theta_t$
- Forget gate: $\alpha_t \in [0,1]$
  - $\alpha_t \to 1$ ⇒ forget history
  - $\alpha_t \to 0$ ⇒ retain history

Memory update:

$$ M_t = (1 - \alpha_t) M_{t-1} + S_t $$

Retrieval:

$$ y_t = M_t(q_t) $$

2.2 Parallelizable Training

LMM training is equivalent to mini-batch gradient descent with momentum + weight decay.

The authors show this can be reformulated into operations using matmuls + associative scan, enabling fast, hardware-friendly parallel training.

2.3 Persistent Memory

A small set of fixed, learnable vectors prepended to the sequence.

Purpose:

Store task-level knowledge (not input-dependent).
Counteract attention bias toward early tokens.
Equivalent to data-independent attention keys/values (as shown by FFN→softmax reinterpretation).

2.4 Titans Architectures (Three Variants)

All Titans have three components:

Short-term memory = attention (sliding window attention)
Long-term memory = LMM
Persistent memory = learned prefix

Variants:

MAC — Memory as Context:

Retrieve memory → concatenate with persistent memory → feed into attention → Best long-context performance.

MAG — Memory as Gate: Combine Sliding Window Attention output and memory output via gating.

MAL — Memory as Layer: Sequential: LMM → Sliding Window Attention. Simpler but weaker performance.

3. Novelty

3.1. Memory Structure

Titans introduce a long-term memory (LMM) that can learn and store information across millions of tokens.
This memory is a deep, learnable module, not just a matrix or KV-cache.
Three designs (MAL) show flexible ways to combine long-term memory with short-term attention.

3.2. Memory Update

LMM learns during inference, using a simple idea: more surprising tokens are written more strongly.
Updates use momentum (past surprise + current surprise) for stability.
A forget gate decides how much old memory to remove to avoid overflow.

3.3. Memory Retrieval

LMM learns a key → value mapping, acting like a smart, compressing KV-cache.
The model retrieves long-term information when needed and mixes it with short-term attention (SWA).

4. More Details

4.1. What is M?

M is a learnable function (an MLP) that stores long-term information.

$$ M : R^d \rightarrow R^d $$

It takes a vector (key or query) and outputs another vector (a “memory value”).

You can think of M as a neural dictionary:

input = address
output = content

Except this dictionary learns at test time, and compresses many past tokens into a fixed-sized neural network.

4.2. What does M do during RETRIEVAL?

Retrieval input: query

$$ q_t = x_t W_Q $$

Memory returns:

$$ y^{(LMM)}_t = M(q_t) $$

Meaning:

“Given this query, what long-term knowledge have we stored that matches it?”

So during retrieval:

M behaves as a lookup function
q = the question
M(q) = the answer from long-term memory

Here, all the knowledge is compressed inside the MLP weights.

4.3. What does M do during UPDATE?

Update input: key

$$ k_t = x_t W_K,\qquad v_t = x_t W_V $$

Memory tries to learn the mapping:

$$ M(k_t) \approx v_t $$

Error:

$$ \ell = | M(k_t) - v_t |^2 $$

Gradient:

$$ g_t = \nabla_{M} \ell $$

Memory update:

$$ M \leftarrow M - \theta g_t $$

Meaning:

“Given this key, the correct value should be v. Update yourself so you can remember this in the future.”

So during update:

M behaves as a learnable associative memory
k = the address
v = the content
M learns: k → v

This is exactly like writing to a KV-cache — except the cache is a learnable neural network that can compress, forget, and generalize.

4.4. What role does M play?

Role 1: long-term storage

M learns from (k,v) pairs:

$$ M(k) \approx v $$

This is how it stores information.

Role 2: long-term retrieval

M responds to queries:

$$ M(q) = \text{long-term memory output} $$

This is how it retrieves information.

Why both? Because:

Keys write into memory
Queries read from memory
Both must use the same space so they match

1. Introduction#

2. Method#

2.1 Neural Long-term Memory (LMM)#

2.2 Parallelizable Training#

2.3 Persistent Memory#

2.4 Titans Architectures (Three Variants)#

3. Novelty#

3.1. Memory Structure#

3.2. Memory Update#

3.3. Memory Retrieval#

4. More Details#

4.1. What is M?#

4.2. What does M do during RETRIEVAL?#

4.3. What does M do during UPDATE?#

4.4. What role does M play?#

Role 1: long-term storage#

Role 2: long-term retrieval#