Compute-Optimal Model Scaling Calculator

Baseline Model Configuration

Total Model Parameters

Total parameters including embeddings

Training Tokens

Use B for billions, T for trillions

Hidden Dimension

Model hidden size (d_model). Common: 1024 (small), 2048 (1B), 4096 (7B), 8192 (70B)

Embedding Type

Most modern models use tied embeddings

Total Parameters: -

Embedding Parameters (Llama 3 vocab): -

Non-Embedding Parameters: -

Training FLOPs (DeepSeek formula): -

Optimal Learning Rate: -

Optimal Batch Size: -

Alternative Model Configurations

Model sizes to compare (comma-separated):

Enter total parameter counts (including embeddings)

Compute Budget

Total Training FLOPs

Use scientific notation (e.g., 1e21 for 10²¹ FLOPs)

⚠️ Note: Parameter suggestions below are for non-embedding parameters only. Add embedding params based on your model's hidden dimension and embedding type. Typical: ~131M (1024 hidden), ~525M (4096 hidden), ~1.05B (8192 hidden) for tied embeddings.

Optimal Learning Rate: -

Optimal Batch Size: -

Compute-Optimal Configurations:

Custom Token/Parameter Ratios

Tokens per parameter ratios (comma-separated):

Chinchilla-optimal is ~20 tokens/param. Higher = smaller model with more data

DeepSeek Precision Formulas

Model Architecture (Llama 3 tokenizer):
  Vocab size = 128,256 tokens
  Tied embeddings: params = vocab_size × hidden_dim
  Separate embeddings: params = vocab_size × hidden_dim × 2
  Non-embedding params = total_params - embedding_params

Training FLOPs (DeepSeek method):
  M = 2 × non_embedding_params (FLOPs per token)
  C = M × D × 3 = 6 × non_embedding_params × tokens

Hyperparameter Scaling:
  η_opt = 0.3118 × C^(-0.1250) (learning rate)
  B_opt = 0.2920 × C^(0.3271) (batch size in tokens)

Global Batch Sizes:
  Sequences = B_opt / context_length
  Shows power-of-2 and highly divisible options
  Node counts shown for data parallel scaling (8 GPUs per node)
  Micro-batch sizes = sequences_per_GPU / gradient_accumulation_steps