Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow dataset loading #1559

Open
assaftibm opened this issue Jan 27, 2025 · 17 comments
Open

slow dataset loading #1559

assaftibm opened this issue Jan 27, 2025 · 17 comments
Assignees

Comments

@assaftibm
Copy link
Member

It seems that loading a dataset from HF using Unitxt is much slower than doing it using the datasets package.

Compare this:

from datasets import load_dataset
from time import time
import os
from uuid import uuid4

path = os.path.join(f"cache/{uuid4()}")

t0 = time()
ds = load_dataset("PrimeQA/clapnq_passages", cache_dir=path)
t1 = time()

print(t1-t0)

print(len(ds))

To:

from time import time

from unitxt import load_dataset


t0 = time()
ds = load_dataset('card=cards.rag.documents.clap_nq.en')
t1 = time()

print(t1-t0)

print(len(ds))

The Unitxt version takes about x5 times longer.

In both cases a fresh new copy is downloaded.

@dafnapension
Copy link
Collaborator

dafnapension commented Jan 27, 2025

Hi, I combined both cases, and ran each to its end, not just load the beginning of the datasets, but actually do the whole process. I also deleted the default huggingface cache before each run:

from datasets import load_dataset as hf_load_dataset
from time import time
import os
from uuid import uuid4
from unitxt.api import load_dataset
import shutil 

path = "/home/dafna/.cache/huggingface"
shutil.rmtree(path)
t0 = time()
ds = hf_load_dataset("PrimeQA/clapnq_passages")
print(f"hf ds: {ds}")
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")
print()
shutil.rmtree(path)
t0 = time()
ds = load_dataset('card=cards.rag.documents.clap_nq.en')
print(f"unitxt ds: {ds}")
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")

@dafnapension
Copy link
Collaborator

dafnapension commented Jan 27, 2025

The printout is:

(virtual39) dafna@DESKTOP-GM8R3J7:~/workspaces/unitxt$ python -m tests.mine.checktimes_assaf
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
README.md: 100%|█████████████████████████████████████████████████████████████████████| 48.0/48.0 [00:00<00:00, 8.89kB/s]
passages.tsv: 100%|██████████████████████████████████████████████████████████████████| 120M/120M [00:18<00:00, 6.38MB/s]
Generating train split: 100%|█████████████████████████████████████████| 178890/178890 [00:02<00:00, 83884.14 examples/s]
hf ds: DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'title'],
        num_rows: 178890
    })
})
num of instances in ds = 178890
from start to after load_dataset: 24.695245504379272
from after load_dataset to after list of ds = 13.033636331558228

Template was not specified in recipe, using the first template from the card by default.
README.md: 100%|█████████████████████████████████████████████████████████████████████| 48.0/48.0 [00:00<00:00, 4.59kB/s]
Generating train split: 178890 examples [03:13, 924.58 examples/s]
unitxt ds: DatasetDict({
    train: Dataset({
        features: ['source', 'target', 'references', 'metrics', 'groups', 'subset', 'media', 'postprocessors', 'task_data', 'data_classification_policy'],
        num_rows: 178890
    })
})
num of instances in ds = 178890
from start to after load_dataset: 195.73645544052124
from after load_dataset to after list of ds = 37.590712785720825
(virtual39) dafna@DESKTOP-GM8R3J7:~/workspaces/unitxt$

@dafnapension
Copy link
Collaborator

dafnapension commented Jan 27, 2025

I then removed all IterableDatasets from Unitxt Loaders, and from the end of api.load_dataset:

api.load_dataset now reads:
recipe = load_recipe(dataset_query, **kwargs)
return recipe()

and LoaderHF.load_iterables() reads:

def load_iterables(self) -> IterableDatasetDict:
    dataset = self.load_dataset()

    if self.filtering_lambda is not None:
        dataset = self.filter_load(dataset)

    limit = self.get_limit()
    if limit is not None:
        .. unchanged ..
        return result
    return dataset

and self.load_dataset , in return, has all .to_iterable_dataset() removed.

Now, both loads match. Unitxt floats like a butterfly, stings like a bee:

(virtual39) dafna@DESKTOP-GM8R3J7:~/workspaces/unitxt$ python -m tests.mine.checktimes_assaf
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
README.md: 100%|█████████████████████████████████████████████████████████████████████| 48.0/48.0 [00:00<00:00, 6.90kB/s]
passages.tsv: 100%|██████████████████████████████████████████████████████████████████| 120M/120M [00:22<00:00, 5.34MB/s]
Generating train split: 100%|█████████████████████████████████████████| 178890/178890 [00:02<00:00, 71896.42 examples/s]
hf ds: DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'title'],
        num_rows: 178890
    })
})
num of instances in ds = 178890
from start to after load_dataset: 28.13403820991516
from after load_dataset to after list of ds = 8.479058742523193

Template was not specified in recipe, using the first template from the card by default.
README.md: 100%|█████████████████████████████████████████████████████████████████████| 48.0/48.0 [00:00<00:00, 8.89kB/s]
passages.tsv: 100%|██████████████████████████████████████████████████████████████████| 120M/120M [00:22<00:00, 5.34MB/s]
Generating train split: 100%|█████████████████████████████████████████| 178890/178890 [00:02<00:00, 74905.96 examples/s]
unitxt ds keys: ['train']
num of instances in ds = 178890
from start to after load_dataset: 27.448201894760132
from after load_dataset to after list of ds = 100.16085314750671
(virtual39) dafna@DESKTOP-GM8R3J7:~/workspaces/unitxt$

@assaftibm
Copy link
Member Author

Thanks! Nice find. I guess the default should be the standard Dataset, and only for very large datasets it would be better to use the iterator variant, as it doesn't assume that the full dataset is downloaded to disk / read into memory. Perhaps this can be indicated in the dataset card using some flag.

@assaftibm
Copy link
Member Author

@dafnapension notice the diff in these prints:

HF:

from after load_dataset to after list of ds = 8.479058742523193

Unitxt:

from after load_dataset to after list of ds = 100.16085314750671

Any idea why the 2nd print is much longer? 100 secs vs. 8.5 secs?

@dafnapension
Copy link
Collaborator

Hi @assaftibm , the first printout occurs after streaming all the ~180K instances through list() -- just iterate over them.
The second -- after streaming them through all the steps of the recipe, defined by the card.
So, depending on the card.
The card here is not too terrible. Does not contain Mix of streams (which implies re-reads of the contributing streams). But still: manipulation of fields that you see explicitly (like the Copy and the Set) and those that you do not see in the card, but they do occur (to fields recipe_metadata and task_data), and the formation of field source by the template and the genuine input fields. And constant overhead of managing the whole operation of Unitxt Multistream.
Not sure whether this diff justifies the whole 100, but at least gives an excuse..
I think this diff is mostly implied by whether the whole dataset can sit in RAM or not, including the space needed for the processing of each instance (as explained). The numbers I printed are as ran from a little Dell/Win10 I have at home (on WSL - Windows Subsystem for Linux), I think its utilizable Ram is not too big.

@assaftibm
Copy link
Member Author

Hi @dafnapension ,

I was talking only about one print -

print(f"from after load_dataset to after list of ds = {t2-t1}")

which reports the duration of

instances = list(ds["train"])

(and I'm comparing it for the HF case and for the UT case).

I'll benchmark on my Mac.

@assaftibm
Copy link
Member Author

btw, this line:

print(f"unitxt ds: {ds}")

prints the full dataset to the console -

unitxt ds: {'train': [{'metrics': ['metrics.rouge'], 'data_classification_policy': ['public'], 'media': {'images': [], 'audios': []}, 'postprocessors': ['processors.to_string_stripped'], 'target': '', 'references': [''], 'source': '', 'task_data': '{"document_id": "827849752_115-357", "title": "Egocentrism", "passages": ["Egocentrism is the inability to differentiate between self and other . More specifically , it is the inability to untangle subjective schemas from objective reality ; an inability to understand or assume any perspective other than their own ."], "metadata_field": "", "metadata": {"data_classification_policy": ["public"], "num_demos": 0, "demos_pool_size": 0, "template": {"__type__": "input_output_template", "input_format": "", "output_format": ""}}}', 'groups': [], 'subset': []}, {'metrics': ['metrics.rouge'], 'data_classification_policy': ['public'], 'media': {'images': [], 'audios': []}, 'postprocessors': ['processors.to_string_stripped'], 'target': '', 'references'...ges Plane Crashed November 23 , 1996 Comoros , Indian Ocean Plane Crash Survivors Sky - die - ver April 23 , 2003 Skydive DeLand DeLand , FL Skydiver Chris Colwell D * * k Sucked May 5 , 2005 Mexico City , Mexico Producer / Actor Sergio Mayer I Flipped Offed < *.........```

@assaftibm
Copy link
Member Author

assaftibm commented Jan 29, 2025

@dafnapension this is the output on my computer:

/Users/assaft/workspace/utest/venv/bin/python /Users/assaft/workspace/utest/my_test2.py 
Generating train split: 100%|██████████| 178890/178890 [00:01<00:00, 164677.96 examples/s]
hf ds: DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'title'],
        num_rows: 178890
    })
})
num of instances in ds = 178890
from start to after load_dataset: 4.948763132095337
from after load_dataset to after list of ds = 1.8445498943328857

Template was not specified in recipe, using the first template from the card by default.
Generating train split: 100%|██████████| 178890/178890 [00:01<00:00, 161153.35 examples/s]
unitxt ds: 1
num of instances in ds = 178890
from start to after load_dataset: 21.21868586540222
from after load_dataset to after list of ds = 0.03181910514831543

Process finished with exit code 0

Three conclusions:

  • Unlike in your run, here the main latency is in from start to after load_dataset (i.e. t1-t0)
  • In addition, the latency in from after load_dataset to after list of ds (i.e. t2-t1) is negligible in my case.
  • Overall, Unitxt is about x3 times slower.

@assaftibm
Copy link
Member Author

@elronbandel / @yoavkatz can you benchmark on your computer?

Just make sure you are on the right branch - no_iterable_datasets

I'm using this version of @dafnapension 's code:

from datasets import load_dataset as hf_load_dataset
from time import time
import os
from uuid import uuid4
from unitxt.api import load_dataset
import shutil

from uuid import uuid4

path = os.path.join(f"cache/{uuid4()}")

# path = "/home/dafna/.cache/huggingface"
# shutil.rmtree(path)
cache_dir=path
t0 = time()
ds = hf_load_dataset("PrimeQA/clapnq_passages", cache_dir=path)
print(f"hf ds: {ds}")
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")
print()
# shutil.rmtree(path)

t0 = time()
ds = load_dataset('card=cards.rag.documents.clap_nq.en')
print(f"unitxt ds: {len(ds)}")
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")

@dafnapension
Copy link
Collaborator

dafnapension commented Jan 29, 2025

Hi @assaftibm ,
First, I tried to run both cases without any file-system cache waiting "hot" for them.
That is why I cleaned the (default) huggingface file cache dir:

path = "/home/dafna/.cache/huggingface"
shutil.rmtree(path)

Second, the first printout was before I removed the .to_iterable_dataset at the end of unitxt's api.load_data, and hence it did go through conversion to iterable dataset, and did printout something "reasonable to the eye".
After I removed the to_iterable_dataset at the end of unitxt's api.load_data it does print the dict representing a unitxt MultiStream. That specific multistream has just one key ("train") hence the len=1. The generator that "train" maps to is indeed a full fledged unitxt generator that remembers the whole history of the recipe-pipeline that it needs to activate. This is why it looked so "fat" to your eye.
I also bumped into this fat thing (which is hard to understand, but at least convinces that something big is going here, perhaps justifies the 100 sec..) and therefore in the second printout just printed the keys.. not the whole dictionary, without drawing your attention to it:

Template was not specified in recipe, using the first template from the card by default.
README.md: 100%|█████████████████████████████████████████████████████████████████████| 48.0/48.0 [00:00<00:00, 8.89kB/s]
passages.tsv: 100%|██████████████████████████████████████████████████████████████████| 120M/120M [00:22<00:00, 5.34MB/s]
Generating train split: 100%|█████████████████████████████████████████| 178890/178890 [00:02<00:00, 74905.96 examples/s]
unitxt ds keys: ['train']
num of instances in ds = 178890
from start to after load_dataset: 27.448201894760132
from after load_dataset to after list of ds = 100.16085314750671
(virtual39) dafna@DESKTOP-GM8R3J7:~/workspaces/unitxt$

Third, in both cases (your hf and unitxt) I added the list(ds[train]) after the load. Both, hf and unitxt, for load_data, give you an object that you can iterate over (more precisely, a dictionary with a single entry - "train" that maps to a generator that you can iterate over), and I also wanted to measure the time it takes to iterate over. That is: to make sure all instances arrived from hf to my laptop, and went through the process (for your hf experiment the "process" is just list them - make sure all arrived -- and not just a generator that will call them if you trigger it, and for unitxt -- that they go through the whole card you specified). This is why I added the list(ds[train]) to both, but again - please bear in mind that what you iterate over is different, although both are called ds, and both obtained after load_data (two different load_data : of hf and of unitxt).

And yes, the branch I closed , no_iterable_datasets, contains the modifications I made in throwing away all iterable_datasets, to make the experiments I reported. But since the branch did not pass all existing tests (it did pass the little script I wrote following your start), I closed the PR for now.
Meanwhile, and following what we found, thanks to your observation, Elron started a PR that much more elegantly removes the use of iterable_datasets.

@assaftibm
Copy link
Member Author

Hi @dafnapension

Thanks for the explanation! I think we understand each other :-)

Can you put a link to the new PR that @elronbandel started?

Thanks

@assaftibm
Copy link
Member Author

I assume it's #1562 or #1564

@dafnapension
Copy link
Collaborator

Hi @assaftibm , yes, the latter, #1564

@assaftibm
Copy link
Member Author

assaftibm commented Jan 29, 2025

Slightly updated version on the test code. Using the cache_dir parameter to make it easy to delete the directory without deleting other cached datasets/models.

import os.path
import shutil
from time import time

from datasets import load_dataset as hf_load_dataset
from unitxt.api import load_dataset
import unitxt

path = "hf_cache"
# path = "/home/dafna/.cache/huggingface"

if os.path.exists(path):
    shutil.rmtree(path)
cache_dir=path
t0 = time()
ds = hf_load_dataset("PrimeQA/clapnq_passages", cache_dir=path)
print(f"hf ds: {ds}")
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")
print()
if os.path.exists(path):
    shutil.rmtree(path)

# disable the cache for hf datasets
unitxt.settings.disable_hf_datasets_cache=True
t0 = time()
ds = load_dataset('card=cards.rag.documents.clap_nq.en')
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")

@dafnapension
Copy link
Collaborator

dafnapension commented Jan 30, 2025

Hi @assaftibm , for unitxt, depending on settings:

settings.disable_hf_datasets_cache = (bool, True)

the HF loader may or may not use the default file-system cache that HF uses (HF uses it if you do not specify anything not-None for parameter cache_dir in calling its load_dataset.)
if settings.disable_hf_datasets_cache and not self.streaming:

So, if the settings happens to be False, then your test script plays in favor of unitxt, by allowing it to use the file-system cache for loading from HF, whereas 'your part' does wipe out the potential cache_dir before invoking load_dataset

@dafnapension
Copy link
Collaborator

And I now realize that I misled you: both PR-s of @elronbandel aim to solving your important observation, #1562 and #1564

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants