Baseline Model Configuration
Total Parameters:
-
Embedding Parameters (Llama 3 vocab):
-
Non-Embedding Parameters:
-
Training FLOPs (DeepSeek formula):
-
Optimal Learning Rate:
-
Optimal Batch Size:
-
Alternative Model Configurations
Compute Budget
⚠️ Note: Parameter suggestions below are for non-embedding parameters only.
Add embedding params based on your model's hidden dimension and embedding type.
Typical: ~131M (1024 hidden), ~525M (4096 hidden), ~1.05B (8192 hidden) for tied embeddings.
Optimal Learning Rate:
-
Optimal Batch Size:
-
Compute-Optimal Configurations:
Custom Token/Parameter Ratios