using alpha monitor the training process of the model #307

cobraheleah · 2024-03-07T02:50:50Z

I use alpha value to monitor the training effect of the model, but I found that as the training progresses, the alpha value will rapidly decrease in the early stage and slowly increase in the middle and later stages. The specific data is as follows:

training step, alpha value
1k, 28.94
2k, 15
3k, 7.1
5k, 4.4
6k, 4.2
100k, 4.7
200k, 4.9
300k, 5.0
400k, 5.1
500k, 5.2

However, throughout the entire training process, the evaluation effect of the model (eg. mmlu, bbh) has been improving.
Is it normal for the alpha value to slowly increase in the middle and later stages of training?
It seems that the size of the alpha value is not absolutely related to the evaluation effect of the model.

charlesmartin14 · 2024-03-07T03:09:21Z

The short answer is — the individual layer alphas give some estimate of how well each layer has converged--but some layers may be converging faster or slower than others (or even backtracking), causing the average alpha to go up So has to take the average in a robust way, and the tool right now does something very simple If I can see more of your data and understand your model better I can address this

cobraheleah · 2024-03-07T13:40:25Z

Thank you very much for your response. Here is the basic situation of the model training:
The model is a LLM with a parameter size of around 60B, and its structure is similar to llama2. The data comes from public datasets. I have three questions to ask:

During the training process, it was found that the alpha values of certain layers changed significantly. Upon further analysis, it was discovered that the V weight matrix in 6th layer self-attention had large fluctuations in the alpha values, as shown below:
training step, alpha value of 6th layer self-attention V weight matrix
502k,12.85
503k,5.2
504k,6.4
505k,6.1
506k,10.5
507k,6.8
Is this kind of change normal? From the downstream evaluation results, no anomalies were found in the checkpoints with large alpha values.
If alpha values can only reflect the convergence process of each layer, what kind of metric should be used to measure the training of the entire model? Or how can a more robust average metric be used?
I saw on https://weightwatcher.ai/ that an alpha value for a layer between 2 and 6 is considered reasonable. Can you please clarify whether this specific range is theoretically derived or experimentally obtained?

charlesmartin14 · 2024-03-07T17:52:05Z

1) I suspect Llama is too big for your data set.
In fact, we think that Llama itself is not well sized for its data...see this comparison with Falcon
https://weightwatcher.ai/leaderboard.html

Both Falcon and Mistral show much better quality scores that LLama

it appears that as some of the LLama layers are learning information, others are just becoming more random.
There are 2 things you can do

1a) check for Dragon Kings
https://calculatedcontent.com/2024/01/29/evaluating-llms-with-weightwatcher-part-iii-the-magic-of-mistral-a-story-of-dragon-kings/

This takes a little more compute, but it can sometimes alphas that are unusually large
1b) See also the ShortGPT study
on pruning models: ShortGPT
https://twitter.com/_akhaliq/status/1765607379264024693

1c) don't include any layer with alpha > 8 in your average
That is...
2) Use robust statistics...that is, compute the median alpha and/or throw away outliers.
2a) I can add method to weightwatcher to do this

2b) Also, If you are training models from scratch, then spectral density of the data jacobian / gradients is also useful to observe.
But this is very expensive to compute, and usually cost prohibitive

3) The alpha range comes from the JMLR theory paper
https://jmlr.org/papers/v22/20-410.html

The theory is an infinite theory limit, so the values should not depend on the size and shape of W
But in practice, it does depend on these a little bit, so I usually say 6 is a good upper limit in general

4) If you are training Llama from scratch, you might want to try the GaLore optimizer*

Memory-Efficient LLM Training by Gradient Low-Rank Projection
https://lnkd.in/g2P_HTPE

cobraheleah · 2024-03-12T13:42:48Z

Thanks for your reply.There are still some remaining issues that I would like to seek advice on.

For the first question, I don’t think it is due to insufficient data. Because as you suggested, I don't include any layer with alpha > 8 in average and the phenomenon that alpha value will rapidly decrease in the early stage and slowly increase in the middle and later stages still exists. On the other hand, Other open-source models like Baichuan2-7B which Published the intermediate ckpt also exhibit this phenomenon that the alpha value first decreases and then increases. And our small sized model, like 6B, 10B, also exhibit this phenomenon. I wonder if anyone has used this alpha value to monitor the training process of the model and has similar phenomenon, rather than just comparing the final model. If so, could you please provide more information on this?
In https://weightwatcher.ai/leaderboard.html Comparison of Llama to Falcon，Falcon is well-sized，Llama is Widely-Overparametrized，but the alpha value of 40b-instruct Falcon is higher than 65b Llama, is it saying that we shouldn't just compare the average alpha value, and that a better approach would be to compare the distribution of alpha values? By the way, how is the alpha value of the model calculated here? Is it computed as an average, and are outliers removed?

charlesmartin14 · 2024-03-12T16:13:48Z

I don’t think it is due to insufficient data. ...
, Other open-source models like Baichuan2-7B which Published the intermediate ckpt also exhibit this phenomenon that the alpha value first decreases and then increases.

Like Llama, the Baichuan2-7B model is thought to have underfit / redundant layers.
See the recent ShortGPT paper
https://arxiv.org/abs/2403.03853

Because these layers are not converging and /or are redudant, the model is not well sized, and its possible that other layers are 'soaking up' the correlations, causing their layers alphas to be smaller than expected

The HTSR theory was developed and presented as a late-stage theory, where it was basically argued that the layers in the NN become PL near convergence.

https://jmlr.org/papers/v22/20-410.html

"Depending on [blah blah blah], additional training can lead to a Heavy-Tailed model"

If you are going to apply weightwatcher early in training you need to check a few things because is quite possible that the fits early in training are simply spurious because the layer is so far from "convergence" (or just never becomes heavy tailed)

You can fit any data set and get an alpha, so in addition to computing alpha, you have to check that

the tail is large enough to get a reliable fit
the quality of fit (D) is good
the PL fit is stable
the ESD is unimodal, heavy-tailed, and sufficiently different from random
there are no Correlation Traps that can cause spuriously small alphas
there is no rank collapse which can cause spuriously small alphas
the layer alphas for the model correlate well with other metrics, such as the rand_distance, spectral norm, distance from init
(see this blog post: https://calculatedcontent.com/2021/10/17/fantastic-measures-of-generalization-that-actually-work-part-1/_)
the eigenvectors of the tail have lower entropy than the bulk

the alpha value of 40b-instruct Falcon is higher than 65b Llama, is
The quality of the alpha fit is frequently a more reliable metric than the value of alpha itself

A better approach would be to compare the distribution of alpha values?
weightwatcher is a diagnostic tool for analyzing how models converge, layer-by-layer
But the theory is only exact on single layer models
(i.e it works perfectly on the original double descent problem, is well understood on small MLPs, etc)

I developed the tool to study how individual layer converges, the correlation flow, how layers inter-correlates with each other ,etc. but we don't fully understand how all these interactions affect convergence

Im happy to collaborate on this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using alpha monitor the training process of the model #307

using alpha monitor the training process of the model #307

cobraheleah commented Mar 7, 2024

charlesmartin14 commented Mar 7, 2024 via email •

edited

Loading

cobraheleah commented Mar 7, 2024

charlesmartin14 commented Mar 7, 2024 •

edited

Loading

cobraheleah commented Mar 12, 2024

charlesmartin14 commented Mar 12, 2024 •

edited

Loading

using alpha monitor the training process of the model #307

using alpha monitor the training process of the model #307

Comments

cobraheleah commented Mar 7, 2024

charlesmartin14 commented Mar 7, 2024 via email • edited Loading

cobraheleah commented Mar 7, 2024

charlesmartin14 commented Mar 7, 2024 • edited Loading

cobraheleah commented Mar 12, 2024

charlesmartin14 commented Mar 12, 2024 • edited Loading

charlesmartin14 commented Mar 7, 2024 via email •

edited

Loading

charlesmartin14 commented Mar 7, 2024 •

edited

Loading

charlesmartin14 commented Mar 12, 2024 •

edited

Loading