-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
using alpha monitor the training process of the model #307
Comments
The short answer is — the individual layer alphas give some estimate of how well each layer has converged--but some layers may be converging faster or slower than others (or even backtracking), causing the average alpha to go up
So has to take the average in a robust way, and the tool right now does something very simple
If I can see more of your data and understand your model better I can address this
|
Thank you very much for your response. Here is the basic situation of the model training:
|
1) I suspect Llama is too big for your data set. Both Falcon and Mistral show much better quality scores that LLama it appears that as some of the LLama layers are learning information, others are just becoming more random. 1a) check for Dragon Kings This takes a little more compute, but it can sometimes alphas that are unusually large 1c) don't include any layer with alpha > 8 in your average 2b) Also, If you are training models from scratch, then spectral density of the data jacobian / gradients is also useful to observe. 3) The alpha range comes from the JMLR theory paper The theory is an infinite theory limit, so the values should not depend on the size and shape of W 4) If you are training Llama from scratch, you might want to try the GaLore optimizer* Memory-Efficient LLM Training by Gradient Low-Rank Projection |
Thanks for your reply.There are still some remaining issues that I would like to seek advice on.
|
Like Llama, the Baichuan2-7B model is thought to have underfit / redundant layers. Because these layers are not converging and /or are redudant, the model is not well sized, and its possible that other layers are 'soaking up' the correlations, causing their layers alphas to be smaller than expected The HTSR theory was developed and presented as a late-stage theory, where it was basically argued that the layers in the NN become PL near convergence. https://jmlr.org/papers/v22/20-410.html "Depending on [blah blah blah], additional training can lead to a Heavy-Tailed model" If you are going to apply weightwatcher early in training you need to check a few things because is quite possible that the fits early in training are simply spurious because the layer is so far from "convergence" (or just never becomes heavy tailed) You can fit any data set and get an alpha, so in addition to computing alpha, you have to check that
I developed the tool to study how individual layer converges, the correlation flow, how layers inter-correlates with each other ,etc. but we don't fully understand how all these interactions affect convergence Im happy to collaborate on this |
I use alpha value to monitor the training effect of the model, but I found that as the training progresses, the alpha value will rapidly decrease in the early stage and slowly increase in the middle and later stages. The specific data is as follows:
training step, alpha value
1k, 28.94
2k, 15
3k, 7.1
5k, 4.4
6k, 4.2
100k, 4.7
200k, 4.9
300k, 5.0
400k, 5.1
500k, 5.2
However, throughout the entire training process, the evaluation effect of the model (eg. mmlu, bbh) has been improving.
Is it normal for the alpha value to slowly increase in the middle and later stages of training?
It seems that the size of the alpha value is not absolutely related to the evaluation effect of the model.
The text was updated successfully, but these errors were encountered: