1. Core Question

Transformers work extremely well, but what can they theoretically express or compute? (Not about training, but about the architecture’s raw capability.)

Expressivity is studied from two angles:

  • Approximation theory: Can transformers approximate certain functions?
  • Formal language theory (the focus): Can transformers recognize/generate certain formal languages?

2. Formal Background

2.1. Chomsky hierarchy

  • Regular → finite automaton
  • Context-free → pushdown automaton
  • Context-sensitive → bounded-tape Turing machine
  • Recursively enumerable → full Turing machine

2.2. Circuit hierarchy

Transformers behave like constant-depth parallel circuits, so we compare with:

  • AC⁰ (very weak — no PARITY)
  • TC⁰ (MAJORITY allowed)
  • NC¹ (log-depth circuits, stronger)

2.3. DLOGTIME-uniform

Means: the circuit for length n must be generable by a very simple, O(log n)-time algorithm.

Equivalent to: no cheating, one systematic architecture, just like a real transformer.

3. Key Positive Results

Depending on assumptions:

  • With hard attention, special positional encodings, or extra intermediate steps

    → transformers can simulate arbitrary algorithms (Turing-complete).

  • With scratchpads / chain-of-thought

    → decoder-only LMs can match probabilistic Turing machines in expressivity.

  • With sinusoidal or arbitrary positional encodings

    → they can handle some context-free patterns (e.g., limited Dyck languages).

4. Key Negative Results

Under realistic constraints:

  • constant depth
  • finite precision
  • softmax attention
  • no scratchpad / CoT

transformers collapse to very weak circuit classes (AC⁰ or TC⁰).

Consequences:

  • They cannot express PARITY on unbounded input length.
  • They have trouble with even simple forms of counting or nesting (e.g., Dyck-k).
  • They cannot recognize certain languages beyond TC⁰.

Interpretation:

A shallow, finite-precision transformer cannot compute global, unbounded structure.

5. Ramifications (Why this matters)

  • Theory explains why LLMs struggle with long-range algorithmic tasks.
  • Theory explains why chain-of-thought reliably boosts reasoning: extra generated steps dramatically increase theoretical power.
  • It clarifies how sensitive expressivity is to assumptions like precision, depth, attention type, and positional encoding.

6. Takeaway

A finite-precision transformer with fixed depth is as weak as a small parallel circuit (TC⁰), but with hard attention or chain-of-thought steps, its expressivity can rise all the way to Turing-complete.