Missing documentation - new languages #12

AmitMY · 2025-02-14T15:29:39Z

There's this file:
https://github.com/GerrySant/multimodalhugs/blob/master/examples/multimodal_translation/pose2text_translation/other/new_languages_how2sign.txt

But I don't know how it should be constructed.
Also, why is there no __slt__ token like the documentation or __en__?

In setup there should be validation of that file, to make sure it fits a format

In my run, I created a file with one new token per line, but it errors out so I expect that's wrong.

pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.20G/1.20G [00:06<00:00, 196MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 147/147 [00:00<00:00, 2.06MB/s]
Traceback (most recent call last):
  File "/data/amoryo/conda/envs/multimodalhugs/bin/multimodalhugs-setup", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/multimodalhugs_cli/training_setup.py", line 34, in main
    pose2text_setup(args.config_path)
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/training_setup/pose2sign_training_setup.py", line 66, in main
    model = model_class.build_model(**model_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/models/multimodal_embedder.py", line 451, in build_model
    source_embeddings = SpecialTokensEmbeddings.build_module(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/modules/special_tokens_embeddings.py", line 49, in build_module
    custom_embeddings = CustomEmbedding.build_module(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/modules/custom_embedding.py", line 55, in build_module
    module.old_embeddings.weight.data[:] = old_embs_weight[:used_size]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
RuntimeError: The expanded size of the tensor (1024) must match the existing size (1472) at non-singleton dimension 1.  Target sizes: [384, 1024].  Tensor sizes: [384, 1472]

The text was updated successfully, but these errors were encountered:

AmitMY · 2025-02-14T15:46:34Z

even when my file is __token__ 1 the error persists.

Saving the dataset (1/1 shards): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 96404/96404 [00:00<00:00, 1277394.34 examples/s]
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1738/1738 [00:00<00:00, 165089.69 examples/s]
Saving the dataset (1/1 shards): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1090/1090 [00:00<00:00, 99849.11 examples/s]
/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
Traceback (most recent call last):
  File "/data/amoryo/conda/envs/multimodalhugs/bin/multimodalhugs-setup", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/multimodalhugs_cli/training_setup.py", line 34, in main
    pose2text_setup(args.config_path)
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/training_setup/pose2sign_training_setup.py", line 66, in main
    model = model_class.build_model(**model_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/models/multimodal_embedder.py", line 451, in build_model
    source_embeddings = SpecialTokensEmbeddings.build_module(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/modules/special_tokens_embeddings.py", line 49, in build_module
    custom_embeddings = CustomEmbedding.build_module(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/multimodalhugs/lib/python3.11/site-packages/multimodalhugs/modules/custom_embedding.py", line 55, in build_module
    module.old_embeddings.weight.data[:] = old_embs_weight[:used_size]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
RuntimeError: The expanded size of the tensor (1024) must match the existing size (1472) at non-singleton dimension 1.  Target sizes: [384, 1024].  Tensor sizes: [384, 1472]

GerrySant · 2025-02-14T15:58:17Z

Idea 1: implement the code necessary to set encoder_embed_dim automatically to model.backbone.d_model if pretrained_backbone is defined.

idea 2: In setup there should be validation of that file, to make sure it fits a format. Also allow to specify tokens already present in the text_tokenizer

GerrySant added bug Something isn't working enhancement New feature or request labels Feb 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing documentation - new languages #12

Missing documentation - new languages #12

AmitMY commented Feb 14, 2025

AmitMY commented Feb 14, 2025

GerrySant commented Feb 14, 2025 •

edited

Loading

Missing documentation - new languages #12

Missing documentation - new languages #12

Comments

AmitMY commented Feb 14, 2025

AmitMY commented Feb 14, 2025

GerrySant commented Feb 14, 2025 • edited Loading

GerrySant commented Feb 14, 2025 •

edited

Loading