Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using alpha monitor the training process of the model #307

Open
cobraheleah opened this issue Mar 7, 2024 · 5 comments
Open

using alpha monitor the training process of the model #307

cobraheleah opened this issue Mar 7, 2024 · 5 comments

Comments

@cobraheleah
Copy link

I use alpha value to monitor the training effect of the model, but I found that as the training progresses, the alpha value will rapidly decrease in the early stage and slowly increase in the middle and later stages. The specific data is as follows:

training step, alpha value
1k, 28.94
2k, 15
3k, 7.1
5k, 4.4
6k, 4.2
100k, 4.7
200k, 4.9
300k, 5.0
400k, 5.1
500k, 5.2

However, throughout the entire training process, the evaluation effect of the model (eg. mmlu, bbh) has been improving.
Is it normal for the alpha value to slowly increase in the middle and later stages of training?
It seems that the size of the alpha value is not absolutely related to the evaluation effect of the model.

@charlesmartin14
Copy link
Member

charlesmartin14 commented Mar 7, 2024 via email

@cobraheleah
Copy link
Author

Thank you very much for your response. Here is the basic situation of the model training:
The model is a LLM with a parameter size of around 60B, and its structure is similar to llama2. The data comes from public datasets. I have three questions to ask:

  1. During the training process, it was found that the alpha values of certain layers changed significantly. Upon further analysis, it was discovered that the V weight matrix in 6th layer self-attention had large fluctuations in the alpha values, as shown below:
    training step, alpha value of 6th layer self-attention V weight matrix
    502k,12.85
    503k,5.2
    504k,6.4
    505k,6.1
    506k,10.5
    507k,6.8
    Is this kind of change normal? From the downstream evaluation results, no anomalies were found in the checkpoints with large alpha values.
  2. If alpha values can only reflect the convergence process of each layer, what kind of metric should be used to measure the training of the entire model? Or how can a more robust average metric be used?
  3. I saw on https://weightwatcher.ai/ that an alpha value for a layer between 2 and 6 is considered reasonable. Can you please clarify whether this specific range is theoretically derived or experimentally obtained?

@charlesmartin14
Copy link
Member

charlesmartin14 commented Mar 7, 2024

1) I suspect Llama is too big for your data set.
In fact, we think that Llama itself is not well sized for its data...see this comparison with Falcon
https://weightwatcher.ai/leaderboard.html

Screenshot 2024-03-07 at 9 53 12 AM

Both Falcon and Mistral show much better quality scores that LLama

it appears that as some of the LLama layers are learning information, others are just becoming more random.
There are 2 things you can do

1a) check for Dragon Kings
https://calculatedcontent.com/2024/01/29/evaluating-llms-with-weightwatcher-part-iii-the-magic-of-mistral-a-story-of-dragon-kings/

image

This takes a little more compute, but it can sometimes alphas that are unusually large
1b) See also the ShortGPT study
on pruning models: ShortGPT
https://twitter.com/_akhaliq/status/1765607379264024693

1c) don't include any layer with alpha > 8 in your average
That is...
2) Use robust statistics...that is, compute the median alpha and/or throw away outliers.
2a) I can add method to weightwatcher to do this

2b) Also, If you are training models from scratch, then spectral density of the data jacobian / gradients is also useful to observe.
But this is very expensive to compute, and usually cost prohibitive

3) The alpha range comes from the JMLR theory paper
https://jmlr.org/papers/v22/20-410.html

HTSR

The theory is an infinite theory limit, so the values should not depend on the size and shape of W
But in practice, it does depend on these a little bit, so I usually say 6 is a good upper limit in general

4) If you are training Llama from scratch, you might want to try the GaLore optimizer*

Memory-Efficient LLM Training by Gradient Low-Rank Projection
https://lnkd.in/g2P_HTPE

@cobraheleah
Copy link
Author

Thanks for your reply.There are still some remaining issues that I would like to seek advice on.

  1. For the first question, I don’t think it is due to insufficient data. Because as you suggested, I don't include any layer with alpha > 8 in average and the phenomenon that alpha value will rapidly decrease in the early stage and slowly increase in the middle and later stages still exists. On the other hand, Other open-source models like Baichuan2-7B which Published the intermediate ckpt also exhibit this phenomenon that the alpha value first decreases and then increases. And our small sized model, like 6B, 10B, also exhibit this phenomenon. I wonder if anyone has used this alpha value to monitor the training process of the model and has similar phenomenon, rather than just comparing the final model. If so, could you please provide more information on this?

  2. In https://weightwatcher.ai/leaderboard.html Comparison of Llama to Falcon,Falcon is well-sized,Llama is Widely-Overparametrized,but the alpha value of 40b-instruct Falcon is higher than 65b Llama, is it saying that we shouldn't just compare the average alpha value, and that a better approach would be to compare the distribution of alpha values? By the way, how is the alpha value of the model calculated here? Is it computed as an average, and are outliers removed?

@charlesmartin14
Copy link
Member

charlesmartin14 commented Mar 12, 2024

I don’t think it is due to insufficient data. ...
, Other open-source models like Baichuan2-7B which Published the intermediate ckpt also exhibit this phenomenon that the alpha value first decreases and then increases.

Like Llama, the Baichuan2-7B model is thought to have underfit / redundant layers.
See the recent ShortGPT paper
https://arxiv.org/abs/2403.03853

Because these layers are not converging and /or are redudant, the model is not well sized, and its possible that other layers are 'soaking up' the correlations, causing their layers alphas to be smaller than expected


The HTSR theory was developed and presented as a late-stage theory, where it was basically argued that the layers in the NN become PL near convergence.

https://jmlr.org/papers/v22/20-410.html
Screenshot 2024-03-12 at 11 56 13 AM

"Depending on [blah blah blah], additional training can lead to a Heavy-Tailed model"

If you are going to apply weightwatcher early in training you need to check a few things because is quite possible that the fits early in training are simply spurious because the layer is so far from "convergence" (or just never becomes heavy tailed)

You can fit any data set and get an alpha, so in addition to computing alpha, you have to check that

  • the tail is large enough to get a reliable fit

  • the quality of fit (D) is good

  • the PL fit is stable

  • the ESD is unimodal, heavy-tailed, and sufficiently different from random

  • there are no Correlation Traps that can cause spuriously small alphas

  • there is no rank collapse which can cause spuriously small alphas

  • the layer alphas for the model correlate well with other metrics, such as the rand_distance, spectral norm, distance from init
    (see this blog post: https://calculatedcontent.com/2021/10/17/fantastic-measures-of-generalization-that-actually-work-part-1/_)

  • the eigenvectors of the tail have lower entropy than the bulk

Screenshot 2024-03-12 at 9 14 51 AM


the alpha value of 40b-instruct Falcon is higher than 65b Llama, is
The quality of the alpha fit is frequently a more reliable metric than the value of alpha itself
Screenshot 2024-03-12 at 9 13 24 AM


A better approach would be to compare the distribution of alpha values?
weightwatcher is a diagnostic tool for analyzing how models converge, layer-by-layer
But the theory is only exact on single layer models
(i.e it works perfectly on the original double descent problem, is well understood on small MLPs, etc)

I developed the tool to study how individual layer converges, the correlation flow, how layers inter-correlates with each other ,etc. but we don't fully understand how all these interactions affect convergence

Im happy to collaborate on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants