Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing documentation - new languages #12

Open
AmitMY opened this issue Feb 14, 2025 · 2 comments
Open

Missing documentation - new languages #12

AmitMY opened this issue Feb 14, 2025 · 2 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@AmitMY
Copy link
Contributor

AmitMY commented Feb 14, 2025

There's this file:
https://github.com/GerrySant/multimodalhugs/blob/master/examples/multimodal_translation/pose2text_translation/other/new_languages_how2sign.txt

But I don't know how it should be constructed.
Also, why is there no __slt__ token like the documentation or __en__?

In setup there should be validation of that file, to make sure it fits a format


In my run, I created a file with one new token per line, but it errors out so I expect that's wrong.

pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.20G/1.20G [00:06<00:00, 196MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 147/147 [00:00<00:00, 2.06MB/s]
Traceback (most recent call last):
  File "/data/amoryo/conda/envs/multimodalhugs/bin/multimodalhugs-setup", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/multimodalhugs_cli/training_setup.py", line 34, in main
    pose2text_setup(args.config_path)
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/training_setup/pose2sign_training_setup.py", line 66, in main
    model = model_class.build_model(**model_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/models/multimodal_embedder.py", line 451, in build_model
    source_embeddings = SpecialTokensEmbeddings.build_module(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/modules/special_tokens_embeddings.py", line 49, in build_module
    custom_embeddings = CustomEmbedding.build_module(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/modules/custom_embedding.py", line 55, in build_module
    module.old_embeddings.weight.data[:] = old_embs_weight[:used_size]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
RuntimeError: The expanded size of the tensor (1024) must match the existing size (1472) at non-singleton dimension 1.  Target sizes: [384, 1024].  Tensor sizes: [384, 1472]
@AmitMY
Copy link
Contributor Author

AmitMY commented Feb 14, 2025

even when my file is __token__ 1 the error persists.

Saving the dataset (1/1 shards): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 96404/96404 [00:00<00:00, 1277394.34 examples/s]
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1738/1738 [00:00<00:00, 165089.69 examples/s]
Saving the dataset (1/1 shards): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1090/1090 [00:00<00:00, 99849.11 examples/s]
/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
Traceback (most recent call last):
  File "/data/amoryo/conda/envs/multimodalhugs/bin/multimodalhugs-setup", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/multimodalhugs_cli/training_setup.py", line 34, in main
    pose2text_setup(args.config_path)
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/training_setup/pose2sign_training_setup.py", line 66, in main
    model = model_class.build_model(**model_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/models/multimodal_embedder.py", line 451, in build_model
    source_embeddings = SpecialTokensEmbeddings.build_module(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/modules/special_tokens_embeddings.py", line 49, in build_module
    custom_embeddings = CustomEmbedding.build_module(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/modules/custom_embedding.py", line 55, in build_module
    module.old_embeddings.weight.data[:] = old_embs_weight[:used_size]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
RuntimeError: The expanded size of the tensor (1024) must match the existing size (1472) at non-singleton dimension 1.  Target sizes: [384, 1024].  Tensor sizes: [384, 1472]

@GerrySant
Copy link
Owner

GerrySant commented Feb 14, 2025

Idea 1: implement the code necessary to set encoder_embed_dim automatically to model.backbone.d_model if pretrained_backbone is defined.

idea 2: In setup there should be validation of that file, to make sure it fits a format. Also allow to specify tokens already present in the text_tokenizer

@GerrySant GerrySant added bug Something isn't working enhancement New feature or request labels Feb 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants