Frequently Asked Questions¶

General Questions¶

What is the Free Transformer?¶

The Free Transformer is a novel neural architecture that extends traditional autoregressive Transformers with explicit latent planning. Instead of generating tokens purely reactively based on previous tokens, it first creates an abstract "plan" (latent variable Z) and then generates tokens conditioned on both the history and this plan.

How does it differ from standard Transformers?¶

Aspect	Standard Transformer	Free Transformer
Generation	Reactive (token-by-token)	Plan-then-generate
Training	Language modeling loss	Conditional VAE loss
Coherence	Local	Global + Local
Controllability	Limited	High (via plan manipulation)
Architecture	Decoder-only	Decoder + Encoder + Latent

What are the main benefits?¶

Better long-range coherence: The latent plan helps maintain consistency across long sequences
Controllable generation: You can potentially manipulate the latent plan for controlled text generation
Richer representations: The model learns more structured internal representations
Improved sample diversity: Different plans lead to different generation styles

Technical Questions¶

What is the latent dimension and how do I choose it?¶

The latent dimension (latent_dim) determines the size of the binary plan vector Z. Typical values:

Small models (< 100M params): 8-16 dimensions
Medium models (100M-1B params): 16-32 dimensions
Large models (> 1B params): 32-64 dimensions

Start with 16-32 and adjust based on your model size and task complexity.

What is "free bits" and why is it important?¶

Free bits is a regularization technique that prevents posterior collapse in VAE training. It sets a minimum threshold for the KL divergence loss:

kl_loss = torch.clamp(kl_loss, min=free_bits)

Typical values: 0.5-2.0. Higher values encourage more latent variable usage but may hurt reconstruction quality.

How do I know if my model is working correctly?¶

Monitor these metrics during training:

KL loss should be positive: If it drops to zero, you have posterior collapse
Reconstruction loss should decrease: Standard language modeling progress
Total loss should be stable: No sudden spikes or instability
Generation quality: Manually inspect generated text

What's the difference between training and inference modes?¶

Training mode: - Uses the encoder to compute latent plan from the full sequence - Optimizes both reconstruction and KL losses - Plan is derived from the actual data

Inference mode: - Samples latent plan from uniform prior (no encoder needed) - Only uses reconstruction for generation - Plan is randomly sampled

Usage Questions¶

Can I use this for real-world datasets?¶

Yes! While the examples use synthetic data for quick prototyping, the model works with any text dataset. You can:

Use HuggingFace datasets directly
Provide your own text files
Modify the data loading pipeline in synthetic_data.py

How do I run on multiple GPUs?¶

Use FSDP (Fully Sharded Data Parallel):

# Automatic GPU detection
torchrun --nproc_per_node=auto examples/train_free.py --config configs/free_transformer.yaml --use-fsdp

# Or use the Makefile
make train-free-fsdp

Can I run this without a GPU?¶

Yes, but it will be much slower. Use the CPU Docker image:

make docker-build-cpu
make docker-run-cpu

Or set device in your code:

model = FreeTransformer(config)
model = model.to('cpu')

How do I change the model size?¶

Edit the configuration file or create a new one:

model:
  hidden_dim: 768      # Increase for larger model
  num_layers: 24       # More layers = more capacity
  num_heads: 12        # Usually hidden_dim // 64
  latent_dim: 32       # Scale with model size

Training Questions¶

My KL loss is zero. What's wrong?¶

This is posterior collapse. The model is ignoring the latent variable. Solutions:

Increase free bits: Try 1.0-2.0 instead of 0.5
Reduce KL weight: Start with 0.01-0.05 instead of 0.1
Use KL annealing: Gradually increase KL weight during training
Check latent dimension: Might be too large for your model

Training is unstable. How do I fix it?¶

Common solutions:

Gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
Lower learning rate: Try 1e-5 instead of 1e-4
Warmup: Use learning rate warmup for first 1000 steps
Mixed precision: Can help with stability and speed

How long should I train?¶

Depends on your dataset and model size:

Small synthetic data: 5-10 epochs
Medium datasets (1M-10M tokens): 10-50 epochs
Large datasets (100M+ tokens): 1-5 epochs

Monitor validation loss and stop when it plateaus.

Should I use curriculum learning?¶

Yes, it often helps! Start with:

Short sequences (128 tokens) → Long sequences (512+ tokens)
High KL weight (1.0) → Low KL weight (0.1)
Simple data → Complex data

Comparison Questions¶

How does it compare to other VAE-based language models?¶

The Free Transformer is specifically designed for autoregressive generation with:

Explicit binary plans: More interpretable than continuous latents
Llama-style backbone: Modern, efficient architecture
Flexible injection: Plan can influence multiple layers
Training efficiency: Competitive with standard Transformers

When should I use Free Transformer vs standard Transformer?¶

Use Free Transformer when: - You need better long-range coherence - Controllable generation is important - You're working with structured text (stories, articles) - Sample diversity matters

Use standard Transformer when: - You need maximum training efficiency - Working with very short sequences - Simplicity is preferred - You have limited computational resources

Deployment Questions¶

Can I deploy this in production?¶

Yes, but consider:

Inference mode is efficient: No encoder overhead
Model size: Similar to equivalent standard Transformer
Memory usage: Slightly higher due to latent computations
Latency: Comparable to baseline models

How do I optimize for inference?¶

Use inference mode: model(tokens, mode='inference')
Enable eval mode: model.eval()
Disable gradients: torch.no_grad()
Consider quantization: Standard PyTorch quantization works
Batch inference: Process multiple sequences together

Can I convert to ONNX or TensorRT?¶

The model uses standard PyTorch operations, so conversion should work, but:

Test thoroughly: Some operations might not be supported
Separate modes: Export training and inference modes separately
Dynamic shapes: May need fixed input sizes

Development Questions¶

How do I contribute?¶

Fork the repository
Create a feature branch
Make your changes
Run tests: make test
Run quality checks: make quality
Submit a pull request

How do I add custom components?¶

The architecture is modular:

Custom encoder: Inherit from nn.Module, implement forward()
Custom injection: Modify injection.py
Custom losses: Add to losses.py
Custom data: Extend synthetic_data.py

Where can I get help?¶

Documentation: This site covers most use cases
GitHub Issues: Report bugs or ask questions
Code examples: Check the examples/ directory
Tests: Look at tests/ for usage patterns

Performance Questions¶

How much slower is it than baseline Transformers?¶

Training: ~20-30% slower due to encoder and VAE loss Inference: ~5-10% slower due to latent computations

The overhead is minimal and often worth it for the improved capabilities.

How much memory does it use?¶

Training: ~30-40% more memory than baseline (due to encoder) Inference: ~10-15% more memory than baseline

Use gradient checkpointing and mixed precision to reduce memory usage.

Can I make it faster?¶

Use mixed precision: torch.cuda.amp
Gradient checkpointing: Trades compute for memory
Efficient attention: Flash Attention (planned feature)
Model parallelism: FSDP for large models
Batch size tuning: Find optimal batch size for your hardware

Still have questions? Open an issue on GitHub!