Synthetic Data Generation¶
This guide covers generating synthetic training data for Free Transformer experiments and prototyping.
Overview¶
Synthetic data generation enables: - Fast prototyping: Test models without large datasets - Controlled experiments: Known data properties for analysis - Ablation studies: Isolate specific model behaviors - Development: Quick iteration during model development
Quick Start¶
Using Makefile¶
# Generate large synthetic dataset
make generate-data
# Generate small dataset for testing
make generate-data-small
Using Python Script¶
# Basic usage
python examples/generate_data.py --output-dir data/synthetic
# Custom parameters
python examples/generate_data.py \
--output-dir data/custom \
--vocab-size 5000 \
--seq-length 256 \
--num-train 10000 \
--num-val 1000 \
--seed 42
Data Generation Types¶
1. Random Token Sequences¶
Simple random sequences for basic testing:
from free_transformer.synthetic_data import generate_random_sequences
data = generate_random_sequences(
vocab_size=1000,
seq_length=128,
num_samples=5000,
seed=42
)
Use cases: - Model architecture testing - Memory usage profiling - Basic training pipeline validation
2. Pattern-Based Sequences¶
Sequences with embedded patterns for coherence testing:
from free_transformer.synthetic_data import generate_pattern_sequences
data = generate_pattern_sequences(
vocab_size=1000,
seq_length=256,
num_samples=10000,
pattern_types=['repetition', 'alternation', 'progression'],
pattern_probability=0.3,
seed=42
)
Pattern types:
- Repetition: [A, B, C, A, B, C, ...]
- Alternation: [A, B, A, B, A, B, ...]
- Progression: [1, 2, 3, 4, 5, ...]
- Reversal: [A, B, C, C, B, A, ...]
3. Hierarchical Sequences¶
Multi-level structured sequences:
from free_transformer.synthetic_data import generate_hierarchical_sequences
data = generate_hierarchical_sequences(
vocab_size=1000,
seq_length=512,
num_samples=20000,
hierarchy_levels=3,
structure_probability=0.5,
seed=42
)
Structure levels: - Global: Overall sequence theme - Local: Subsequence patterns - Token: Individual token relationships
4. Conditional Sequences¶
Sequences conditioned on control tokens:
from free_transformer.synthetic_data import generate_conditional_sequences
data = generate_conditional_sequences(
vocab_size=1000,
seq_length=256,
num_samples=15000,
num_conditions=10,
condition_strength=0.8,
seed=42
)
Conditioning: - First token determines sequence properties - Style, length, or pattern controlled by condition - Tests model's ability to follow instructions
Configuration Options¶
Basic Parameters¶
config = {
'vocab_size': 10000, # Vocabulary size
'seq_length': 512, # Sequence length
'num_train': 50000, # Training samples
'num_val': 5000, # Validation samples
'seed': 42, # Random seed
}
Advanced Parameters¶
config = {
# Pattern control
'pattern_probability': 0.3, # Probability of pattern occurrence
'pattern_length': 10, # Average pattern length
'pattern_types': ['rep', 'alt', 'prog'], # Pattern types to use
# Hierarchy control
'hierarchy_levels': 3, # Number of hierarchy levels
'structure_probability': 0.5, # Probability of structured content
'global_coherence': 0.7, # Global coherence strength
# Conditioning
'num_conditions': 20, # Number of condition types
'condition_strength': 0.8, # How strongly condition affects sequence
'condition_tokens': [0, 1, 2], # Special condition tokens
# Noise and variation
'noise_probability': 0.1, # Random token injection probability
'length_variation': 0.2, # Sequence length variation
'vocab_distribution': 'uniform', # Token distribution (uniform, zipf)
}
Data Analysis¶
Sequence Statistics¶
from free_transformer.synthetic_data import analyze_sequences
def analyze_dataset(sequences):
"""Analyze synthetic dataset properties."""
stats = {
'num_sequences': len(sequences),
'avg_length': np.mean([len(seq) for seq in sequences]),
'vocab_usage': len(set(token for seq in sequences for token in seq)),
'token_distribution': Counter(token for seq in sequences for token in seq),
'pattern_detection': detect_patterns(sequences),
}
return stats
# Example usage
train_data = load_synthetic_data('data/synthetic/train.jsonl')
stats = analyze_dataset(train_data)
print(f"Dataset contains {stats['num_sequences']} sequences")
print(f"Average length: {stats['avg_length']:.1f} tokens")
print(f"Vocabulary usage: {stats['vocab_usage']}/{config['vocab_size']}")
Pattern Detection¶
def detect_patterns(sequences, min_length=3):
"""Detect common patterns in sequences."""
patterns = Counter()
for seq in sequences:
for i in range(len(seq) - min_length + 1):
pattern = tuple(seq[i:i + min_length])
patterns[pattern] += 1
# Find most common patterns
common_patterns = patterns.most_common(10)
return common_patterns
# Example usage
patterns = detect_patterns(train_data)
print("Most common patterns:")
for pattern, count in patterns:
print(f" {pattern}: {count} occurrences")
Coherence Metrics¶
def measure_coherence(sequences):
"""Measure sequence coherence metrics."""
metrics = {}
# Local coherence (adjacent token similarity)
local_coherence = []
for seq in sequences:
similarities = []
for i in range(len(seq) - 1):
# Simple similarity: same token = 1, different = 0
sim = 1 if seq[i] == seq[i+1] else 0
similarities.append(sim)
local_coherence.append(np.mean(similarities))
metrics['local_coherence'] = np.mean(local_coherence)
# Global coherence (sequence-level consistency)
global_coherence = []
for seq in sequences:
# Measure repetition of tokens
token_counts = Counter(seq)
max_count = max(token_counts.values())
coherence = max_count / len(seq)
global_coherence.append(coherence)
metrics['global_coherence'] = np.mean(global_coherence)
return metrics
Custom Data Generators¶
Creating Custom Generators¶
from free_transformer.synthetic_data import BaseDataGenerator
class CustomDataGenerator(BaseDataGenerator):
def __init__(self, vocab_size, seq_length, **kwargs):
super().__init__(vocab_size, seq_length, **kwargs)
self.custom_param = kwargs.get('custom_param', 0.5)
def generate_sequence(self):
"""Generate a single sequence with custom logic."""
sequence = []
# Custom generation logic here
for i in range(self.seq_length):
if np.random.random() < self.custom_param:
# Custom behavior
token = self.generate_special_token(i)
else:
# Random token
token = np.random.randint(0, self.vocab_size)
sequence.append(token)
return sequence
def generate_special_token(self, position):
"""Generate special token based on position."""
# Example: position-dependent token
return position % 100
# Usage
generator = CustomDataGenerator(
vocab_size=1000,
seq_length=256,
custom_param=0.3
)
sequences = generator.generate_dataset(num_samples=10000)
Template-Based Generation¶
class TemplateDataGenerator(BaseDataGenerator):
def __init__(self, vocab_size, seq_length, templates, **kwargs):
super().__init__(vocab_size, seq_length, **kwargs)
self.templates = templates
def generate_sequence(self):
"""Generate sequence from template."""
template = np.random.choice(self.templates)
sequence = []
for element in template:
if element == 'RANDOM':
token = np.random.randint(0, self.vocab_size)
elif element == 'PATTERN':
token = self.generate_pattern_token(len(sequence))
else:
token = element
sequence.append(token)
# Pad or truncate to desired length
sequence = self.adjust_length(sequence)
return sequence
# Define templates
templates = [
[1, 'RANDOM', 'RANDOM', 2, 'PATTERN', 'PATTERN'],
[3, 'PATTERN', 'RANDOM', 'PATTERN', 4],
['RANDOM'] * 10 + [5] + ['PATTERN'] * 5,
]
generator = TemplateDataGenerator(
vocab_size=1000,
seq_length=256,
templates=templates
)
Data Quality Control¶
Validation Checks¶
def validate_synthetic_data(sequences, config):
"""Validate generated synthetic data."""
issues = []
# Check sequence lengths
lengths = [len(seq) for seq in sequences]
if min(lengths) < config['min_length']:
issues.append(f"Sequences too short: min={min(lengths)}")
if max(lengths) > config['max_length']:
issues.append(f"Sequences too long: max={max(lengths)}")
# Check vocabulary usage
all_tokens = set(token for seq in sequences for token in seq)
if max(all_tokens) >= config['vocab_size']:
issues.append(f"Invalid tokens: max={max(all_tokens)}")
if min(all_tokens) < 0:
issues.append(f"Negative tokens: min={min(all_tokens)}")
# Check for empty sequences
empty_count = sum(1 for seq in sequences if len(seq) == 0)
if empty_count > 0:
issues.append(f"Empty sequences: {empty_count}")
return issues
# Usage
issues = validate_synthetic_data(train_data, config)
if issues:
print("Data validation issues:")
for issue in issues:
print(f" - {issue}")
else:
print("Data validation passed!")
Quality Metrics¶
def compute_quality_metrics(sequences):
"""Compute data quality metrics."""
metrics = {}
# Diversity metrics
unique_sequences = len(set(tuple(seq) for seq in sequences))
metrics['sequence_diversity'] = unique_sequences / len(sequences)
all_tokens = [token for seq in sequences for token in seq]
unique_tokens = len(set(all_tokens))
metrics['token_diversity'] = unique_tokens
# Complexity metrics
avg_entropy = np.mean([compute_entropy(seq) for seq in sequences])
metrics['average_entropy'] = avg_entropy
# Pattern metrics
pattern_count = sum(1 for seq in sequences if has_pattern(seq))
metrics['pattern_ratio'] = pattern_count / len(sequences)
return metrics
def compute_entropy(sequence):
"""Compute entropy of a sequence."""
token_counts = Counter(sequence)
total_tokens = len(sequence)
entropy = 0
for count in token_counts.values():
prob = count / total_tokens
entropy -= prob * np.log2(prob)
return entropy
Integration with Training¶
Data Loading¶
from torch.utils.data import Dataset, DataLoader
class SyntheticDataset(Dataset):
def __init__(self, sequences, max_length=None):
self.sequences = sequences
self.max_length = max_length
def __len__(self):
return len(self.sequences)
def __getitem__(self, idx):
sequence = self.sequences[idx]
if self.max_length:
sequence = sequence[:self.max_length]
return {
'input_ids': torch.tensor(sequence, dtype=torch.long),
'length': len(sequence)
}
# Usage
train_sequences = load_synthetic_data('data/synthetic/train.jsonl')
train_dataset = SyntheticDataset(train_sequences, max_length=512)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
Curriculum Learning¶
def create_curriculum_data(base_sequences, stages):
"""Create curriculum learning datasets."""
curriculum_data = {}
for stage_name, stage_config in stages.items():
# Filter sequences by complexity
filtered_sequences = filter_by_complexity(
base_sequences,
stage_config['min_complexity'],
stage_config['max_complexity']
)
# Adjust sequence lengths
adjusted_sequences = adjust_sequence_lengths(
filtered_sequences,
stage_config['max_length']
)
curriculum_data[stage_name] = adjusted_sequences
return curriculum_data
# Define curriculum stages
stages = {
'stage1': {'min_complexity': 0.0, 'max_complexity': 0.3, 'max_length': 128},
'stage2': {'min_complexity': 0.2, 'max_complexity': 0.6, 'max_length': 256},
'stage3': {'min_complexity': 0.5, 'max_complexity': 1.0, 'max_length': 512},
}
curriculum_data = create_curriculum_data(train_sequences, stages)
Best Practices¶
- Start simple: Begin with basic random sequences
- Add complexity gradually: Introduce patterns incrementally
- Validate data: Always check generated data quality
- Monitor diversity: Ensure sufficient sequence variation
- Use appropriate size: Match real data characteristics
- Document generation: Keep track of generation parameters
- Version control: Save generation configs and seeds
Troubleshooting¶
Common Issues¶
Low diversity - Increase vocabulary size - Reduce pattern probability - Add noise to generation
Poor patterns - Adjust pattern parameters - Check pattern detection logic - Validate pattern implementation
Memory issues - Generate data in batches - Use streaming data loading - Reduce sequence length
Training instability - Check data distribution - Validate sequence lengths - Ensure proper tokenization
Next Steps¶
- Training Guide: Use synthetic data for training
- Configuration: Configure data parameters
- Examples: See synthetic data in action