Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pose2text: reduce disk read pressure #10

Merged
merged 6 commits into from
Feb 15, 2025
Merged

Conversation

AmitMY
Copy link
Contributor

@AmitMY AmitMY commented Feb 14, 2025

  • there is no need to check if the file exists again
  • if one file has many items, it shouldn't re-read the file a million times

@GerrySant
Copy link
Owner

The version you propose may encounter errors in case a file does not exist:

Traceback (most recent call last):
  File "/Users/sant/anaconda3/envs/multimodalhugs/lib/python3.12/site-packages/datasets/builder.py", line 1606, in _prepare_split_single
    for key, record in generator:
  File "/Users/sant/repositories/multimodalhugs/multimodalhugs/data/datasets/pose2text.py", line 110, in _generate_examples
    dataset = dataset.map(mapping_function)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sant/anaconda3/envs/multimodalhugs/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 560, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sant/anaconda3/envs/multimodalhugs/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3035, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/Users/sant/anaconda3/envs/multimodalhugs/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3408, in _map_single
    example = apply_function_on_filtered_inputs(example, i, offset=offset)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sant/anaconda3/envs/multimodalhugs/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3300, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sant/repositories/multimodalhugs/multimodalhugs/data/datasets/pose2text.py", line 100, in mapping_function
    buffer = self._read_pose(sample['source'])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sant/repositories/multimodalhugs/multimodalhugs/data/datasets/pose2text.py", line 83, in _read_pose
    with open(file_path, "rb") as pose_file:
         ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sant/anaconda3/envs/multimodalhugs/lib/python3.12/site-packages/datasets/streaming.py", line 75, in wrapper
    return function(*args, download_config=download_config, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sant/anaconda3/envs/multimodalhugs/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 947, in xopen
    return open(main_hop, mode, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/Users/sant/phd/first_paper/data/How2Sign/sentence_level/val/rgb_front/pose_estimation/eY32ru3Nstc_20-8-rgb_front.pose'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/sant/repositories/multimodalhugs/multimodalhugs/multimodalhugs_cli/training_setup.py", line 47, in <module>
    main()
  File "/Users/sant/repositories/multimodalhugs/multimodalhugs/multimodalhugs_cli/training_setup.py", line 34, in main
    pose2text_setup(args.config_path)
  File "/Users/sant/repositories/multimodalhugs/multimodalhugs/training_setup/pose2sign_training_setup.py", line 25, in main
    dataset.download_and_prepare(data_path)
  File "/Users/sant/anaconda3/envs/multimodalhugs/lib/python3.12/site-packages/datasets/builder.py", line 924, in download_and_prepare
    self._download_and_prepare(
  File "/Users/sant/anaconda3/envs/multimodalhugs/lib/python3.12/site-packages/datasets/builder.py", line 1647, in _download_and_prepare
    super()._download_and_prepare(
  File "/Users/sant/anaconda3/envs/multimodalhugs/lib/python3.12/site-packages/datasets/builder.py", line 999, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/Users/sant/anaconda3/envs/multimodalhugs/lib/python3.12/site-packages/datasets/builder.py", line 1485, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/Users/sant/anaconda3/envs/multimodalhugs/lib/python3.12/site-packages/datasets/builder.py", line 1642, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

So I suggest to modify this to the following code fragment:

        # Filter out samples where the file path does not exist
        dataset = dataset.filter(lambda sample: file_exists_filter('source_signal', sample))

        # Apply the update to the VIDEO_NAME column
        dataset = dataset.map(mapping_function)

@AmitMY
Copy link
Contributor Author

AmitMY commented Feb 15, 2025

Addressed. Changed the order of filter and map

@GerrySant GerrySant merged commit 2abdd56 into GerrySant:master Feb 15, 2025
1 check passed
@AmitMY AmitMY deleted the patch-3 branch February 15, 2025 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants