Frequently Asked Questions¶
General Questions¶
What is the Free Transformer?¶
The Free Transformer is a novel neural architecture that extends traditional autoregressive Transformers with explicit latent planning. Instead of generating tokens purely reactively based on previous tokens, it first creates an abstract "plan" (latent variable Z) and then generates tokens conditioned on both the history and this plan.
How does it differ from standard Transformers?¶
| Aspect | Standard Transformer | Free Transformer |
|---|---|---|
| Generation | Reactive (token-by-token) | Plan-then-generate |
| Training | Language modeling loss | Conditional VAE loss |
| Coherence | Local | Global + Local |
| Controllability | Limited | High (via plan manipulation) |
| Architecture | Decoder-only | Decoder + Encoder + Latent |
What are the main benefits?¶
- Better long-range coherence: The latent plan helps maintain consistency across long sequences
- Controllable generation: You can potentially manipulate the latent plan for controlled text generation
- Richer representations: The model learns more structured internal representations
- Improved sample diversity: Different plans lead to different generation styles
Technical Questions¶
What is the latent dimension and how do I choose it?¶
The latent dimension (latent_dim) determines the size of the binary plan vector Z. Typical values:
- Small models (< 100M params): 8-16 dimensions
- Medium models (100M-1B params): 16-32 dimensions
- Large models (> 1B params): 32-64 dimensions
Start with 16-32 and adjust based on your model size and task complexity.
What is "free bits" and why is it important?¶
Free bits is a regularization technique that prevents posterior collapse in VAE training. It sets a minimum threshold for the KL divergence loss:
Typical values: 0.5-2.0. Higher values encourage more latent variable usage but may hurt reconstruction quality.
How do I know if my model is working correctly?¶
Monitor these metrics during training:
- KL loss should be positive: If it drops to zero, you have posterior collapse
- Reconstruction loss should decrease: Standard language modeling progress
- Total loss should be stable: No sudden spikes or instability
- Generation quality: Manually inspect generated text
What's the difference between training and inference modes?¶
Training mode: - Uses the encoder to compute latent plan from the full sequence - Optimizes both reconstruction and KL losses - Plan is derived from the actual data
Inference mode: - Samples latent plan from uniform prior (no encoder needed) - Only uses reconstruction for generation - Plan is randomly sampled
Usage Questions¶
Can I use this for real-world datasets?¶
Yes! While the examples use synthetic data for quick prototyping, the model works with any text dataset. You can:
- Use HuggingFace datasets directly
- Provide your own text files
- Modify the data loading pipeline in
synthetic_data.py
How do I run on multiple GPUs?¶
Use FSDP (Fully Sharded Data Parallel):
# Automatic GPU detection
torchrun --nproc_per_node=auto examples/train_free.py --config configs/free_transformer.yaml --use-fsdp
# Or use the Makefile
make train-free-fsdp
Can I run this without a GPU?¶
Yes, but it will be much slower. Use the CPU Docker image:
Or set device in your code:
How do I change the model size?¶
Edit the configuration file or create a new one:
model:
hidden_dim: 768 # Increase for larger model
num_layers: 24 # More layers = more capacity
num_heads: 12 # Usually hidden_dim // 64
latent_dim: 32 # Scale with model size
Training Questions¶
My KL loss is zero. What's wrong?¶
This is posterior collapse. The model is ignoring the latent variable. Solutions:
- Increase free bits: Try 1.0-2.0 instead of 0.5
- Reduce KL weight: Start with 0.01-0.05 instead of 0.1
- Use KL annealing: Gradually increase KL weight during training
- Check latent dimension: Might be too large for your model
Training is unstable. How do I fix it?¶
Common solutions:
- Gradient clipping:
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) - Lower learning rate: Try 1e-5 instead of 1e-4
- Warmup: Use learning rate warmup for first 1000 steps
- Mixed precision: Can help with stability and speed
How long should I train?¶
Depends on your dataset and model size:
- Small synthetic data: 5-10 epochs
- Medium datasets (1M-10M tokens): 10-50 epochs
- Large datasets (100M+ tokens): 1-5 epochs
Monitor validation loss and stop when it plateaus.
Should I use curriculum learning?¶
Yes, it often helps! Start with:
- Short sequences (128 tokens) → Long sequences (512+ tokens)
- High KL weight (1.0) → Low KL weight (0.1)
- Simple data → Complex data
Comparison Questions¶
How does it compare to other VAE-based language models?¶
The Free Transformer is specifically designed for autoregressive generation with:
- Explicit binary plans: More interpretable than continuous latents
- Llama-style backbone: Modern, efficient architecture
- Flexible injection: Plan can influence multiple layers
- Training efficiency: Competitive with standard Transformers
When should I use Free Transformer vs standard Transformer?¶
Use Free Transformer when: - You need better long-range coherence - Controllable generation is important - You're working with structured text (stories, articles) - Sample diversity matters
Use standard Transformer when: - You need maximum training efficiency - Working with very short sequences - Simplicity is preferred - You have limited computational resources
Deployment Questions¶
Can I deploy this in production?¶
Yes, but consider:
- Inference mode is efficient: No encoder overhead
- Model size: Similar to equivalent standard Transformer
- Memory usage: Slightly higher due to latent computations
- Latency: Comparable to baseline models
How do I optimize for inference?¶
- Use inference mode:
model(tokens, mode='inference') - Enable eval mode:
model.eval() - Disable gradients:
torch.no_grad() - Consider quantization: Standard PyTorch quantization works
- Batch inference: Process multiple sequences together
Can I convert to ONNX or TensorRT?¶
The model uses standard PyTorch operations, so conversion should work, but:
- Test thoroughly: Some operations might not be supported
- Separate modes: Export training and inference modes separately
- Dynamic shapes: May need fixed input sizes
Development Questions¶
How do I contribute?¶
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests:
make test - Run quality checks:
make quality - Submit a pull request
How do I add custom components?¶
The architecture is modular:
- Custom encoder: Inherit from
nn.Module, implementforward() - Custom injection: Modify
injection.py - Custom losses: Add to
losses.py - Custom data: Extend
synthetic_data.py
Where can I get help?¶
- Documentation: This site covers most use cases
- GitHub Issues: Report bugs or ask questions
- Code examples: Check the
examples/directory - Tests: Look at
tests/for usage patterns
Performance Questions¶
How much slower is it than baseline Transformers?¶
Training: ~20-30% slower due to encoder and VAE loss Inference: ~5-10% slower due to latent computations
The overhead is minimal and often worth it for the improved capabilities.
How much memory does it use?¶
Training: ~30-40% more memory than baseline (due to encoder) Inference: ~10-15% more memory than baseline
Use gradient checkpointing and mixed precision to reduce memory usage.
Can I make it faster?¶
- Use mixed precision:
torch.cuda.amp - Gradient checkpointing: Trades compute for memory
- Efficient attention: Flash Attention (planned feature)
- Model parallelism: FSDP for large models
- Batch size tuning: Find optimal batch size for your hardware
Still have questions? Open an issue on GitHub!