Architecture Overview¶
The Free Transformer introduces a novel approach to sequence modeling by incorporating explicit latent planning into the traditional autoregressive generation process.
Core Concept¶
Traditional autoregressive Transformers generate tokens sequentially, conditioning only on previously generated tokens. This "reactive" approach can lead to:
- Local coherence but global inconsistency
- Difficulty with long-range planning
- Limited controllability in generation
The Free Transformer addresses these limitations by introducing a latent planning mechanism that:
- First creates an abstract plan
Zfor the entire sequence - Then generates tokens conditioned on both the history and the plan
High-Level Architecture¶
graph TB
subgraph "Input Processing"
A[Input Tokens] --> B[Token Embeddings]
B --> C[Positional Encoding]
end
subgraph "Early Layers (Context Building)"
C --> D[Decoder Blocks 1...L/2]
end
subgraph "Latent Planning (Training Only)"
D --> E[Non-Causal Encoder]
E --> F[Latent Variable Z]
F --> G[Plan Injection]
end
subgraph "Late Layers (Plan-Conditioned Generation)"
D --> H[Decoder Blocks L/2+1...L]
G --> H
H --> I[Output Logits]
end
subgraph "Inference Mode"
J[Random Z Sampling] --> K[Plan Injection]
K --> H
end
Key Components¶
1. Decoder Backbone¶
Based on the Llama architecture with modern optimizations:
- RMSNorm: More stable than LayerNorm
- SwiGLU Activation: Better than ReLU/GELU
- RoPE: Rotary Position Embedding for better length generalization
- Grouped-Query Attention (GQA): Efficient multi-head attention
2. Latent Planning System¶
The core innovation consists of three components:
Encoder Block¶
- Non-causal attention: Can attend to the entire sequence
- Learned query vector ζ: Aggregates sequence information
- Separate from decoder: Doesn't interfere with autoregressive flow
Binary Mapper¶
- Differentiable discretization: Converts continuous representations to binary plans
- Gumbel-Softmax: Enables gradient flow through discrete sampling
- Configurable dimensionality: Latent plan size
Z ∈ {0,1}^d
Plan Injection¶
- Post-sampler FC layer: Integrates plan into decoder representations
- Residual connections: Preserves original information flow
- Layer-wise injection: Plan influences multiple decoder layers
3. Conditional VAE Framework¶
The model is trained as a conditional Variational Autoencoder:
Training Mode¶
- Reconstruction Loss: Standard language modeling loss
- KL Divergence: Regularizes latent space toward uniform prior
- Free Bits: Prevents posterior collapse
Inference Mode¶
- Prior Sampling: Sample
zfrom uniform distribution - Conditional Generation: Generate tokens given the sampled plan
Training vs Inference Modes¶
Training Mode Flow¶
- Forward Pass: Input → Early Layers → Encoder → Latent Z
- Plan Injection: Z → Late Layers → Output Logits
- Loss Computation: Reconstruction + KL Divergence
- Backward Pass: Gradients flow through differentiable components
Inference Mode Flow¶
- Context Processing: Prompt → Early Layers
- Plan Sampling: Sample Z from uniform prior
- Plan Injection: Z → Late Layers
- Generation: Autoregressive token generation
Mathematical Formulation¶
Encoder¶
The encoder produces a latent representation from the full sequence:
where \(\zeta\) is a learned query vector and \(X\) is the input sequence.
Binary Mapping¶
The continuous representation is mapped to a binary plan:
using differentiable binary encoding (e.g., Gumbel-Softmax).
Plan Injection¶
The binary plan is injected into the decoder:
where \(h_{dec}\) comes from the early decoder layers.
Loss Function¶
The total loss combines reconstruction and regularization:
with free bits regularization to prevent collapse.
Design Principles¶
1. Modularity¶
- Separate components: Encoder, decoder, and injection are independent
- Configurable: Easy to modify latent dimensions, injection points
- Extensible: Can add new components without major changes
2. Efficiency¶
- Shared backbone: Reuses decoder architecture
- Minimal overhead: Encoder only active during training
- Memory efficient: Gradient checkpointing and optimized attention
3. Compatibility¶
- Standard interfaces: Compatible with HuggingFace ecosystem
- Flexible training: Works with existing training pipelines
- Easy deployment: Standard PyTorch model for inference
Comparison with Baselines¶
| Aspect | Standard Transformer | Free Transformer |
|---|---|---|
| Planning | Implicit, reactive | Explicit, proactive |
| Coherence | Local | Global + Local |
| Controllability | Limited | High (via plan manipulation) |
| Training | Language modeling | Conditional VAE |
| Inference | Autoregressive | Plan-conditioned autoregressive |
| Complexity | O(n²) attention | O(n²) + O(d) latent |
Next Steps¶
- Free Transformer Details: Deep dive into the model
- Latent Planning: Understanding the planning mechanism
- Training Guide: How to train the model effectively