-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry Regarding Vocabulary Size in CodonBERT PyTorch Model #6
Comments
I also nocticed this problem. Did is this a mistake when the author provided the model? Cause of this problem, to run the fintune code successfully, I have to modified the model code. I modified the 69 to 130, and maintaned the original weight of the original model, adding 0 or randomly normalizing for the rest weight. But this operation may have an influence on the model performence. |
I noticed that there are 64 different codons and 5 special tokens, adding up to 69 the vocabulary size. But the order of codons in the vocabulary table still remains a problem. |
In the pretrain.py and finetune.py, after data processing, the variable dic_voc storage the vocabulary table, and the output is as: Total 130(0~129). But the model's config is 69. |
To my knowledge, There are only 64 kinds of valid codons. Codons and corresponding amino acids are listed below: |
Year that's correct. Thank you for reply. After my last reply, I immediately checked the code again. Maybe I knew why the difference appeared. In the finetune.py and pretrain.py, the variable lst_ele( line 70 in pretrain.py) is a list of ('AUGCN'),so after processing, there are 130 vocabulary. When I modified the list to ('AUGC'), there are 69 vocabulary, which is consisdent with the model's config. In terms of this result, the another problem appeared: the model's config author provided is 69, did this means they pretained the model on the setting of list('AUGC') rather than list('AUGCN')? If it is, the model's code will be not consisdent with the paper, where they mentionend they acctually pretrained on the setting of list('AUGCN'). |
that is what happens when corporations release something that is not peer reviewed |
I would like to express my gratitude for your excellent work on CodonBERT. I have been thoroughly impressed by your research and the accompanying code.
However, I have encountered a discrepancy that I would like to clarify. In your paper and code, the vocabulary size is mentioned as 555+5=130, based on the characters 'A', 'U', 'G', 'C', and 'N'. Yet, in the CodonBERT PyTorch model you provided, the vocabulary size is set to 69.
Could you please explain the rationale behind this difference in vocabulary size? Understanding this would greatly help me in comprehending and utilizing your model more effectively.
Thank you in advance for your assistance.
The text was updated successfully, but these errors were encountered: