-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mean/Standard deviation over training dataset to remove rogue embedding dimensions outside normal magnitudes? #22
Comments
If the training data has been aggregated on your side and easily accessible I can do this myself, but I'm not sure if I have access to all of these datasets. |
Hi, actually in RemoteCLIP we use the default normalization mean/std for training, which can be found in: https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/constants.py#L1-L2 We haven't done any ablation on these mean/std. Since the final performance is not bad so we don't see this as a major bottleneck, also using default parameters can keep the input images in the domain that CLIP vision encoders are familiar. But it will be interesting to dive deeper into that! Thank you for your suggestion and we will investigate it further in future works. |
Thanks! When working with RSICD CLIP [1] we actually found some interesting "rogue dimensions", where even after normalization some vector components were dominating across all generated embeddings. We found that if we normalized these rogue dimensions we got much better class separability and performance on downstream tasks: This is what rogue dimensions look like (on the left). A handful of the 512 dimensions have values that are routinely 2-10x larger than other dimensions. We can choose to treat these by calculating the mean and standard deviation for each dimension over a large enough batch of vectors and then rescaling those dimensions, giving the image on the right. In the right-hand case, each dimension has zero mean and unit standard deviation. This has the property that when taking differences between vectors, the vector magnitudes are not dominated by the rogue dimensions. This can lead to cleaner cluster separation and better downstream performance on zero-shot tasks. We suspect that the following paper found this problem first, "All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality" [2], and they found that normalizing and removing these rogue dimensions is important in an applied setting. We do this normalization using Scikit Learns StandardScaler [3], taking the RSICD training dataset, getting image and text embeddings on them, and then getting batch statistics on it to get appropriate means and standard deviations per dimension of the 512 length embedding. This means we end up with 512-length means and standard deviations for images and the same for text embeddings, which we can then use to normalize embeddings to remove rogue dimensions. We suspect that RemoteCLIP will suffer from these same rogue dimensions, and be greatly helped by normalizing and removing them. Unfortunately, RemoteCLIP was trained on a very diverse set of data that we don't have access to. If you all are able to, it would greatly help if you could randomly sample some subset of your training data, compute their image and text embeddings, and then provide these 512-length mean and standard deviation values for images and texts. Others have found this kind of normalization important. We are fine to do it ourselves but then would need access to your training corpus. You could also randomly sample some representative subset of your training images and text and put it in a cloud bucket of some kind and we could compute those statistics and provide them here. [1] https://huggingface.co/flax-community/clip-rsicd-v2 |
Hmm, that's very interesting, and thanks for your comments! 😀 Not sure if you do normalization on each single embedding before evaluation? (e.g., https://github.com/ChenDelong1999/RemoteCLIP/blob/main/retrieval.py#L138-L139) For data samples, the publicly available RSICD, RSITMD, and UCM datasets have both images and captions, and they are used for training RemoteCLIP. I guess using them can give a good enough estimation of full RemoteCLIP training data, which I won't be able to release before paper acceptance : ( Btw, I am very curious about how much can this rogue dimension normalization operation improve retrieval/zero-shot performance? |
We’ve noticed that normalization seems to increase class separability when
clustering but haven’t gotten to gauge its effect on actual downstream
performance yet. Since you have a test harness might interesting for you to
see if it gives you an accuracy bump. We are still building our own test
harness for our particular problem, but increased class separability tends
to be a proxy of improved downstream performance.
I agree we can probably just use the same values as RSICD. Would you like
me to provide the values we calculated for that ourselves and you can test
in your own test harness to see if it increases your accuracy numbers?
Could be a nice bump for your paper!
…On Tue, Mar 5, 2024 at 3:24 AM Delong Chen (陈德龙) ***@***.***> wrote:
Hmm, that's very interesting, and thanks for your comments! 😀
Not sure if you do normalization on each single embedding before
evaluation? (e.g.,
https://github.com/ChenDelong1999/RemoteCLIP/blob/main/retrieval.py#L138-L139
)
For data samples, the publicly available RSICD, RSITMD, and UCM datasets
have both images and captions, and they are used for training RemoteCLIP.
I guess using them can give a good enough estimation of full RemoteCLIP
training data, which I won't be able to release before paper acceptance : (
Btw, I am very curious about how much can this rogue dimension
normalization operation improve retrieval/zero-shot performance?
—
Reply to this email directly, view it on GitHub
<#22 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABEHUPA2W5P4KPDHUML55TYWWTN3AVCNFSM6AAAAABD6UFL66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZYGUZTSMZWHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I think calculating the values is not very complicated, we could first try by ourselves and investigate it further. I am now super curious about which types of inputs can make these rogue dimensions activate 🤔 Thank you very much for the discussion, let us try try and get back to you if we have some results! |
Do you have an email address? I could send you some sample Planet images
that we’ve been testing with.
…On Tue, Mar 5, 2024 at 5:40 PM Delong Chen (陈德龙) ***@***.***> wrote:
I think calculating the values is not very complicated, we could first try
by ourselves and investigate it further.
I am now super curious about which types of inputs can make these rogue
dimensions activate 🤔
Thank you very much for the discussion, let us try try and get back to you
if we have some results!
—
Reply to this email directly, view it on GitHub
<#22 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABEHUMY2MJETPYLQ5ZAUPTYWZXZ5AVCNFSM6AAAAABD6UFL66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZZHEZDKMZXGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I've previously been experimenting with the RSICD CLIP model [1], which was trained just over the RSICD dataset. I'm very impressed at how many data sources you've used to train RemoteCLIP, which you document in your paper with this table:
In my experiments with RSICD I've found that having the mean and standard deviation of the training data in order to do normalization was an important part of getting quality results, so that input imagery follows the remote sensing distribution set by the training data rather than the mean/std that the parent OpenCLIP model has, as remote sensing imagery obviously has a very different distribution that standard consumer photography.
Since you have so many datasets for your model, which is a strength, that unfortunately makes it quite hard for us to compute our own mean/std. Might it be possible for you to compute a per-band mean and std over the datasets you trained with that you might have locally? It's probably fine to compute this using a sampling strategy as long as the sample size is large enough.
Thanks again for such a great paper and open sourcing your project :)
[1] https://github.com/arampacha/CLIP-rsicd
The text was updated successfully, but these errors were encountered: