1. Core Question

Transformers work extremely well, but what can they theoretically express or compute? (Not about training, but about the architecture’s raw capability.)

Expressivity is studied from two angles:

Approximation theory: Can transformers approximate certain functions?
Formal language theory (the focus): Can transformers recognize/generate certain formal languages?

2. Formal Background

2.1. Chomsky hierarchy

Regular → finite automaton
Context-free → pushdown automaton
Context-sensitive → bounded-tape Turing machine
Recursively enumerable → full Turing machine

2.2. Circuit hierarchy

Transformers behave like constant-depth parallel circuits, so we compare with:

AC⁰ (very weak — no PARITY)
TC⁰ (MAJORITY allowed)
NC¹ (log-depth circuits, stronger)

2.3. DLOGTIME-uniform

Means: the circuit for length n must be generable by a very simple, O(log n)-time algorithm.

Equivalent to: no cheating, one systematic architecture, just like a real transformer.

3. Key Positive Results

Depending on assumptions:

With hard attention, special positional encodings, or extra intermediate steps
→ transformers can simulate arbitrary algorithms (Turing-complete).
With scratchpads / chain-of-thought
→ decoder-only LMs can match probabilistic Turing machines in expressivity.
With sinusoidal or arbitrary positional encodings
→ they can handle some context-free patterns (e.g., limited Dyck languages).

4. Key Negative Results

Under realistic constraints:

constant depth
finite precision
softmax attention
no scratchpad / CoT

transformers collapse to very weak circuit classes (AC⁰ or TC⁰).

Consequences:

They cannot express PARITY on unbounded input length.
They have trouble with even simple forms of counting or nesting (e.g., Dyck-k).
They cannot recognize certain languages beyond TC⁰.

Interpretation:

A shallow, finite-precision transformer cannot compute global, unbounded structure.

5. Ramifications (Why this matters)

Theory explains why LLMs struggle with long-range algorithmic tasks.
Theory explains why chain-of-thought reliably boosts reasoning: extra generated steps dramatically increase theoretical power.
It clarifies how sensitive expressivity is to assumptions like precision, depth, attention type, and positional encoding.

6. Takeaway

A finite-precision transformer with fixed depth is as weak as a small parallel circuit (TC⁰), but with hard attention or chain-of-thought steps, its expressivity can rise all the way to Turing-complete.

1. Core Question#

2. Formal Background#

2.1. Chomsky hierarchy#

2.2. Circuit hierarchy#

2.3. DLOGTIME-uniform#

3. Key Positive Results#

4. Key Negative Results#

5. Ramifications (Why this matters)#

6. Takeaway#