Compute-Optimal Model Scaling

Precision FLOP calculations with DeepSeek methodology

Baseline Model Configuration
Total parameters including embeddings
Use B for billions, T for trillions
Model hidden size (d_model). Common: 1024 (small), 2048 (1B), 4096 (7B), 8192 (70B)
Most modern models use tied embeddings
Total Parameters: -
Embedding Parameters (Llama 3 vocab): -
Non-Embedding Parameters: -
Training FLOPs (DeepSeek formula): -
Optimal Learning Rate: -
Optimal Batch Size: -
Alternative Model Configurations
Enter total parameter counts (including embeddings)

DeepSeek Precision Formulas

Model Architecture (Llama 3 tokenizer):
  Vocab size = 128,256 tokens
  Tied embeddings: params = vocab_size × hidden_dim
  Separate embeddings: params = vocab_size × hidden_dim × 2
  Non-embedding params = total_params - embedding_params

Training FLOPs (DeepSeek method):
  M = 2 × non_embedding_params (FLOPs per token)
  C = M × D × 3 = 6 × non_embedding_params × tokens

Hyperparameter Scaling:
  η_opt = 0.3118 × C^(-0.1250) (learning rate)
  B_opt = 0.2920 × C^(0.3271) (batch size in tokens)

Global Batch Sizes:
  Sequences = B_opt / context_length
  Shows power-of-2 and highly divisible options
  Node counts shown for data parallel scaling (8 GPUs per node)
  Micro-batch sizes = sequences_per_GPU / gradient_accumulation_steps