Cerebras-GPT is Not Following the Chinchilla Scaling Law

Author: Kyo Takano

Date: 2023-03-29

I discuss the following points in this article.

Despite the effect of compute budget on the optimal ratio between the number of tokens and the number of parameters, Cerebras-GPT assumed a linear correlation between them.
That explains why Cerebras-GPT models are undertrained, compared to other models like Pythia.

1. Power law between compute budget and the ratio

Hoffmann et al. (2022) formulated the cross-entropy loss of a trained large language model as

$$ L=E+\frac{A}{N^{\alpha}}+\frac{B}{D^{\beta}} $$

where $D$ represents the total number of model parameters; $N$, the number of training tokens; $E$, the estimated entropy of data intrinsic to the language data (i.e., the floor of training loss). The parameters $A$, $B$, $\alpha$, and $\beta$ are fitted to observed data points.

Suppose we have a fixed FLOPs compute budget $C\approx6ND$. We can substitute the predicted parameters with $G=({\alpha A}/{\beta B})^{\frac{1}{\alpha+\beta}}$. Using these parameters, we can estimate optimal $N$ and $D$ respectively as $N\approx G\times(C/6)^{\beta/(\alpha+\beta)}$ and $D\approx G^{-1}\times(C/6)^{\alpha/(\alpha+\beta)}$.

Therefore, the optimal ratio of tokens per parameter $D/N$ is:

$$ \frac{D}{N} = G^{-2}\times(C/6)^{\frac{\alpha-\beta}{\alpha+\beta}} = {({\alpha A}/{\beta B})^{\frac{1}{\alpha+\beta}}}^{-2}\times(C/6)^{\frac{\alpha-\beta}{\alpha+\beta}} $$

Cerebras says “20 tokens per parameter” is compute-optimal, implying that the ratio is indifferent to compute budget $C$. However, the equation shows that the ratio varies exponentially by the factor of $C^\frac{\alpha-\beta}{\alpha+\beta}$. When $\alpha>\beta$ (Hoffmann, et al., 2022), it increases as you get more compute budget.

2. Why Cerebras-GPTs are undertrained

Figure 2. Cerebras-GPT vs. Pythia. Lower curves show greater compute efficiency for a given loss level (Cerebras, 2023)

As this Figure 2 shows, Cerebras-GPT requires significantly more parameters to reach the same performance as Pythia and other models. This is probably because the largest Chinchilla model, which Cerebras referenced, was actually undertrained and not perfectly compute-optimal.

Hoffmann et al. (2022) estimated the values of the optimal parameters $\alpha, \beta, A,$ and $B$ as 0.34, 0.28, 406.4, and 410.7. However, if you follow these values to get the ratio $D/N$, you get 92, which contradicts with the actual value $1.4T/67B=20$. This gap is most plausibly due to the following factors:

Hoffmann et al. (2022) had a small number of candidate architectures (arbitrarily parameterized by embed_dim, num_heads, num_layers, etc.), and chose the one whose $N$ is larger than and closest to $N_{opt}$. This suggests that Chinchilla might be actually undertrained
As they reported $\alpha$ and $\beta$ only down to the second decimal points despite the formulae’s sensitivity to these, the precise estimate might actually be smaller than 92 (e.g, if $\alpha=0.33$ and $\beta=0.29$, $D/N\approx24$).