Author: Kyo Takano
Date: 2023-03-29
I discuss the following points in this article.
Hoffmann et al. (2022) formulated the cross-entropy loss of a trained large language model as
$$ L=E+\frac{A}{N^{\alpha}}+\frac{B}{D^{\beta}} $$
where $D$ represents the total number of model parameters; $N$, the number of training tokens; $E$, the estimated entropy of data intrinsic to the language data (i.e., the floor of training loss). The parameters $A$, $B$, $\alpha$, and $\beta$ are fitted to observed data points.
Suppose we have a fixed FLOPs compute budget $C\approx6ND$. We can substitute the predicted parameters with $G=({\alpha A}/{\beta B})^{\frac{1}{\alpha+\beta}}$. Using these parameters, we can estimate optimal $N$ and $D$ respectively as $N\approx G\times(C/6)^{\beta/(\alpha+\beta)}$ and $D\approx G^{-1}\times(C/6)^{\alpha/(\alpha+\beta)}$.
Therefore, the optimal ratio of tokens per parameter $D/N$ is:
$$ \frac{D}{N} = G^{-2}\times(C/6)^{\frac{\alpha-\beta}{\alpha+\beta}} = {({\alpha A}/{\beta B})^{\frac{1}{\alpha+\beta}}}^{-2}\times(C/6)^{\frac{\alpha-\beta}{\alpha+\beta}} $$
Cerebras says “20 tokens per parameter” is compute-optimal, implying that the ratio is indifferent to compute budget $C$. However, the equation shows that the ratio varies exponentially by the factor of $C^\frac{\alpha-\beta}{\alpha+\beta}$. When $\alpha>\beta$ (Hoffmann, et al., 2022), it increases as you get more compute budget.

Figure 2. Cerebras-GPT vs. Pythia. Lower curves show greater compute efficiency for a given loss level (Cerebras, 2023)
As this Figure 2 shows, Cerebras-GPT requires significantly more parameters to reach the same performance as Pythia and other models. This is probably because the largest Chinchilla model, which Cerebras referenced, was actually undertrained and not perfectly compute-optimal.
Hoffmann et al. (2022) estimated the values of the optimal parameters $\alpha, \beta, A,$ and $B$ as 0.34, 0.28, 406.4, and 410.7. However, if you follow these values to get the ratio $D/N$, you get 92, which contradicts with the actual value $1.4T/67B=20$. This gap is most plausibly due to the following factors:
embed_dim, num_heads, num_layers, etc.),  and chose the one whose $N$ is larger than and closest to $N_{opt}$. This suggests that Chinchilla might be actually undertrained