Architecture Overview¶

The Free Transformer introduces a novel approach to sequence modeling by incorporating explicit latent planning into the traditional autoregressive generation process.

Core Concept¶

Traditional autoregressive Transformers generate tokens sequentially, conditioning only on previously generated tokens. This "reactive" approach can lead to:

Local coherence but global inconsistency
Difficulty with long-range planning
Limited controllability in generation

The Free Transformer addresses these limitations by introducing a latent planning mechanism that:

First creates an abstract plan Z for the entire sequence
Then generates tokens conditioned on both the history and the plan

High-Level Architecture¶

graph TB
    subgraph "Input Processing"
        A[Input Tokens] --> B[Token Embeddings]
        B --> C[Positional Encoding]
    end

    subgraph "Early Layers (Context Building)"
        C --> D[Decoder Blocks 1...L/2]
    end

    subgraph "Latent Planning (Training Only)"
        D --> E[Non-Causal Encoder]
        E --> F[Latent Variable Z]
        F --> G[Plan Injection]
    end

    subgraph "Late Layers (Plan-Conditioned Generation)"
        D --> H[Decoder Blocks L/2+1...L]
        G --> H
        H --> I[Output Logits]
    end

    subgraph "Inference Mode"
        J[Random Z Sampling] --> K[Plan Injection]
        K --> H
    end

Key Components¶

1. Decoder Backbone¶

Based on the Llama architecture with modern optimizations:

RMSNorm: More stable than LayerNorm
SwiGLU Activation: Better than ReLU/GELU
RoPE: Rotary Position Embedding for better length generalization
Grouped-Query Attention (GQA): Efficient multi-head attention

2. Latent Planning System¶

The core innovation consists of three components:

Encoder Block¶

Non-causal attention: Can attend to the entire sequence
Learned query vector ζ: Aggregates sequence information
Separate from decoder: Doesn't interfere with autoregressive flow

Binary Mapper¶

Differentiable discretization: Converts continuous representations to binary plans
Gumbel-Softmax: Enables gradient flow through discrete sampling
Configurable dimensionality: Latent plan size Z ∈ {0,1}^d

Plan Injection¶

Post-sampler FC layer: Integrates plan into decoder representations
Residual connections: Preserves original information flow
Layer-wise injection: Plan influences multiple decoder layers

3. Conditional VAE Framework¶

The model is trained as a conditional Variational Autoencoder:

Training Mode¶

p(x|z) * p(z|x) = Reconstruction * Posterior

Reconstruction Loss: Standard language modeling loss
KL Divergence: Regularizes latent space toward uniform prior
Free Bits: Prevents posterior collapse

Inference Mode¶

p(x|z) * p(z) = Generation * Prior

Prior Sampling: Sample z from uniform distribution
Conditional Generation: Generate tokens given the sampled plan

Training vs Inference Modes¶

Training Mode Flow¶

Forward Pass: Input → Early Layers → Encoder → Latent Z
Plan Injection: Z → Late Layers → Output Logits
Loss Computation: Reconstruction + KL Divergence
Backward Pass: Gradients flow through differentiable components

Inference Mode Flow¶

Context Processing: Prompt → Early Layers
Plan Sampling: Sample Z from uniform prior
Plan Injection: Z → Late Layers
Generation: Autoregressive token generation

Mathematical Formulation¶

Encoder¶

The encoder produces a latent representation from the full sequence:

\[h_{enc} = \text{Encoder}(X, \zeta)\]

where \(\zeta\) is a learned query vector and \(X\) is the input sequence.

Binary Mapping¶

The continuous representation is mapped to a binary plan:

\[Z = \text{BinaryMapper}(h_{enc})\]

using differentiable binary encoding (e.g., Gumbel-Softmax).

Plan Injection¶

The binary plan is injected into the decoder:

\[h_{inj} = h_{dec} + \text{FC}(Z)\]

where \(h_{dec}\) comes from the early decoder layers.

Loss Function¶

The total loss combines reconstruction and regularization:

\[\mathcal{L} = \mathcal{L}_{recon} + \beta \cdot \text{KL}(q(Z|X) || p(Z))\]

with free bits regularization to prevent collapse.

Design Principles¶

1. Modularity¶

Separate components: Encoder, decoder, and injection are independent
Configurable: Easy to modify latent dimensions, injection points
Extensible: Can add new components without major changes

2. Efficiency¶

Shared backbone: Reuses decoder architecture
Minimal overhead: Encoder only active during training
Memory efficient: Gradient checkpointing and optimized attention

3. Compatibility¶

Standard interfaces: Compatible with HuggingFace ecosystem
Flexible training: Works with existing training pipelines
Easy deployment: Standard PyTorch model for inference

Comparison with Baselines¶

Aspect	Standard Transformer	Free Transformer
Planning	Implicit, reactive	Explicit, proactive
Coherence	Local	Global + Local
Controllability	Limited	High (via plan manipulation)
Training	Language modeling	Conditional VAE
Inference	Autoregressive	Plan-conditioned autoregressive
Complexity	O(n²) attention	O(n²) + O(d) latent

Next Steps¶

Free Transformer Details: Deep dive into the model
Latent Planning: Understanding the planning mechanism
Training Guide: How to train the model effectively